# Stochastic Bandits

This chapter formally introduces stochastic bandits. 
Specifically, this notebook will attempt to make plain
the assumptions and definitions needed to understand
stochastic bandits. 

## Core Assumptions

A stochastic bandit is a collection of distributions: 

$$\nu = (P_a : a \in \mathcal{A})$$

If $\mathcal{A}$ is the set of available actions, then $a$ is any action in the set. 
As such, $P_a$ is the *probability distribution* associated with action $a$. 
Furthermore, when we select $a$ as our action, $P_a$ is the distribution we will draw our reward from.

A learner will interact with a bandit problem for $n$ rounds.
This $n$ is our time-horizon.
In every round $t \in \{1,...,n\}$, the learner is responsible for selecting an $A_t \in \mathcal{A}$.
In other words, at every time step, a learner must choose an action to take (and that must be from the set of possible actions, as defined).

When a learner selects $A_t \in \mathcal{A}$ (an action in the available choices), the environment samples a reward from the associated probability distribution:

$$
X_t \sim P_{A_t}
$$

In plain English: We select one of the arms as our action. 
In doing so, that arm's associated probability distribution is sampled which generates the reward $X_t$ for this specific time step.
This reward is shown to the learner (allowing for learning to take place).

The selection of arms and receiving of rewards continues back-and-forth, creating a sequence of actions:

$$
A_1,X_1,A_2,X_2,...,A_n,X_n
$$

This sequence necessarily needs to obey two assumptions:

- $p(X_t | A_1,X_1,...,A_{t-1},X_{t-1},A_t) = P_{A_t}$
    - The conditional distribution over a reward for the current time step is simply equal to the probability distribution of the arm selected by action $A_t$
    - This helps to assert that $X_t$ is drawn from the distribution associated with the arm
- The law for how a learner selects $A_t$ given the sequence $A_1,X_1,A_2,X_2,...,A_{t-1},X_{t-1}$ is simply $\pi_t(\cdot | A_1,X_1,A_2,X_2,...,A_{t-1},X_{t-1})$
    - This is saying that a learner's policy for selecting an action is solely determined by the past sequence of events
    - In other words, our agent can't see the future! (~obvious, but important~)

### Quick Summary / Parameters of Note!

- $|\mathcal{A}|$ -- the number of arms
- $P_a$ -- the probability distribution associated with an arm
- $n$ -- the time horizon, the limit to the number of time steps
- $t$ -- the current time step
- $A_t$ -- the action selected at a specific time step
- $X_t$ -- the reward drawn at times step $t$, given the selection of $A_t$

## Learning Objective

The goal of a bandit problem is (typically) to maximize the reward!

$$
S_n = \sum_{t=1}^{n} X_t
$$

The textbook highlights 3 reasons why MAB problems are not *optimization* problems:
- Uncertainty in time horizon $n$... 
    - Do we know $n$? 
    - Is it important for our algorithm?
        - Adjustments can be made to bounds to handle uncertainty in the horizon, but at what penalty?
- Cumulative reward is a random quantity
    - Even if we know the exact distributions, we still need a measure of utility to understand how to interact with this specific instance
- The reward distributions are unknown! 
    - This is the beauty of bandits; we know the arms but don't know the distributions associated with them!
    - Must learn to estimate potential reward and maximize it
    - Must do it efficiently as possible!

## Knowledge and Environment Class

Typically a learner has some type of information about the type of problems/environments that it will face. 
This information is the environment class $\mathcal{E}$.

### Unstructured Bandits

A class where the actions ($\mathcal{A}$) are finite and there are a set of distributions $\mathcal{M_a}$ for each $a \in \mathcal{A}$ such that:

$$
\mathcal{E} = \{ \nu = (P_a:a \in \mathcal{A}) : P_a \in \mathcal{M_a}, \forall a \in \mathcal{A} \}
$$

Essentially this just means that if you pull arm a, then you won't learn anything new about arm b. 
In other words, for unstructured bandits, the learner only learns about the action it chooses. 
Nothing else is gained about the other arms.

Some examples:
- A 5-armed bandit where every arm is a Bernoulli distribution
- A 4-armed bandit where we know all the arms are drawn from a Gaussian distribution with variance = 1
- A A/B test with 11 site variations where click through is modelled with Bernoulli distributions

The goal is to use some knowledge about the desired problem (that it can be modelled with distributions of a certain property or class), and that becomes the mechanism for formally analyzing expected performance.

### Structured Bandits

Structured bandits are classes of bandits where a learner can infer something **more** about the other arms, based on some prior knowledge about how arms relate.
In short, arms that are _never_ played can still be learned about.

For example (Ex. 4.1), if there are two Bernoulli arms with means $\theta$ and $1-\theta$, then by pulling one arm a learner can begin to infer how the other will behave. There is only one parameter to really estimate between the two arms.

Another example (Ex. 4.2) gives a situation where the action space spans d-dimensional real space ($\mathcal{A} \subset \mathbb{R}^d$). 
The mean of each _arm_ is estimated by the inner product between its representation in this space and some unknown vector parameter $\theta$. 
One could perhaps consider this as finding the optimal action vector in a d-dimensional space, as measured by the inner product.

A final example (Ex. 4.3) features an action space that is a path through a graph. 
Edges are removed with random probability and the goal is to find a path where no edges are removed. 
Each time a path is tried, information is gained about the edges selected 
(i.e. the probability an edge does not get removed).

## Regret

For a given bandit problem (represented by a collection of distributions, $\nu$)
and the policy of a learner, $\pi$,
the equation for regret is:

$$
R_n(\pi, \nu) = n \mu^{*}(\nu) - \mathbb{E}\left[ \sum_{t=1}^n X_t \right]
$$

This says the regret is measured as the difference between the reward gained by choosing the optimal action every step and the expected reward our learner will gain through following policy $\pi$.

A couple of things to be aware of from Lemma 4.4: 
- $R_n(\pi, \nu) \geq 0$, $\forall \pi$
    - This is saying that regret is never negative. A learner never does better than choosing the optimal action at every step
- The policy $\pi$ that chooses an $A_t \in \text{argmax}_a{\mu_a}$ $\forall t$ statisfies $R_n = 0$
    - Again, this emphasizes that only the policy of choosing the optimal action at **every step** achieves zero regret.
    - This indicates that a learner should expect some positive regret 
- If $R_n = 0$ then $\mathbb{P}(\mu_{A_t}=\mu^{*}) = 1 \forall t \in [n]$
    - Another assertion that only by having oracle knowledge of the optimal actions can a learner achieve zero regret
    
As one might infer, all of this is to say that some regret **will** occur. 
It then becomes a question of "how much regret?"
Or, perhaps, "how little regret can I guarantee?"

A couple of example goals from the text: 
- $\forall \nu \in \mathcal{E}: \lim_{n \rightarrow \infty} \frac{R_n}{n} = 0$
    - Asserting that the regret will be sublinear in $n$
    - Over time, learning will assure our learner finds the optimal action and incurs no or little regret
- $\forall \nu \in \mathcal{E}: R_n \leq Cn^{p}$, $C > 0$ and $p < 1$
    - A tighter bound; sub-linear but specifically polynomial in $n$
- $\forall n \in \mathbb{N}, \nu \in \mathbb{E}$, $R_n \leq C(\nu)f(n)$
    - Decomposing the regret into two components:
        - Regret due to the instance of the particular problem
        - Regret as a function of the time horizon

## Decomposing the Regret

A very important notion in bandit algorithms is that of the suboptimality gap (action gap, immediate regret) of a given action $a$:

$$
\Delta_{a}(\nu) = \mu^{*}(\nu) - \mu_{a}(\nu)
$$

As the name might suggest, this is the regret **immediately** incurred by not choosing the optimal action.
In other words, this is the difference between the expected optimal reward and the reward received by selecting arm $a$.

Additionally, using indicator variables, we can define a random variable to denote the number of times an action was selected:

$$
T_{a}(t) = \sum_{s=1}^{t} \mathbb{I}\{A_s = a\}
$$

The beauty of this definition is that it allows us to phrase a learner's regret as:

$$
R_n = \sum_{a \in \mathcal{A}} \Delta_a \mathbb{E}[T_a(n)]
$$

Or, the sum of the products of the number of times an arm was selected and the regret incurred by selecting that arm (which could be zero if the arm is the optimal one!)