# Model Formulation
We have $n$ bands each containing $m_i$ bins.

![Band_Bins.png](attachment:Band_Bins.png)

Each bin $b_{i,j}$ has an associated random reward $R_{i,j}$ that is normally distributed $$R_{i,j} \sim N(\mu_{i,j}, \sigma_{i,j} ^2) \quad \forall \; b_{i,j}\in \mathbf{b}_i \tag{Bin Reward Distribution}$$

The mean parameter $\mu_{i,j}$ for each $R_{i,j}$ is drawn from the associated bin mean reward distribution parameters that depend on the band: $$\mu_{i,j} \sim N(\mu_i, \sigma_i^2) \quad \forall \; b_{i,j} \in \mathbf{b}_{i} \tag{Bin Mean Reward Distribution}$$ 

___
## Modeling Assumptions
1. We know both the band variance parameter $\sigma_i$ for all bands, and the bin variance parameter $\sigma_{i,j}$ for all bins
 1. Note: This assumption allows for closed-form sequential updates to band parameter estimates, but this assumption can be relaxed later on


2. There are a large number of bins within each band ($n_i$ is large for all $i$)
 1. Thus distribution over all $\mu_{i,j}$ within Band $i$ would resemble the Bin Mean Reward Distribution for that Band


3. We are operating within discrete time-steps and during each step $t$ we may:
    1. Sample/Resample a bin and receive reward $r_{i,j}(t)$ drawn independently from $R_{i,j}$
    1. Terminate the algorithm and return the estimate of the best bin
    
___    

___

# Objective
*To do*
___


We assume we are operating during discrete time-steps indexed by $t$. At the beginning of each time-step we determine whether or not to terminate the algorithm based on our current estimates the termination conditions of the objective

Conditioned on having not terminated by time $t_o$, we can define the history at time $t_o$ as:
$$\mathbf{h}(t_o) = \{a[1], r[1], a[2], r[2], ..., [t_o-1], r[t_o-1] \}$$
where $a[t]$ refers to the action taken at time $t$ and $r[t]$ refers to the reward obtained at time $t$ 
+ *Note:* Brackets [ ] are used here for time-indexing
+ *Note 2:* A reward history $r^{(t_o)} =\{r[1], r[2], ..., r[t_o - 1]\}$ would not be enough as we need to know which bin an reward at time $t$ is associated with

The action at time $t$ is based on our policy $\pi: \mathcal{H}\rightarrow \mathcal{A}$
$$a[t] = \pi(\mathbf{h}(t))$$


Given the history $\mathcal{H}_t$, we can form parameter estimates $\hat{\mu}_i(\mathcal{H}_t)$ and $\hat{\mu}_{i,j}(\mathcal{H}_t)$ for all $i,j$
+ *Note: parenthesis ( ) are used here to indicate estimates are function of the history*


___

# Online Learning
We will focus on methods that simultaneously estimate parameters/distributions and make decisions on what bin to sample within the same timestep

As opposed to algorithms that operate in distinct phases e.g. sample $k$ bins within each band, form an estimate from those $k$ samples


# Conventions
Subscripts indicate band/bins ex. $B_1$ is Band 1, while $b_{1,2}$ is bin $2$ within Band $1$

Square-brackets indicate indexing in time ex. The action at time $t$ is $a[t]$

Parenthsis indicates a function ex. estimate $\hat{\mu}_i(\mathcal{H_t})$

Uppercase letters (and greek letters with $\tilde{ }$ ) indicate random variables ex. Reward $R_{i,j}$ is the random variable representing the reward for sampling bin $b_{i,j}$


Lowercase letters indicate deterministic variables 

Bolded letters/symbols refer to vectors/sets ex. The band is a vector of its $m_i$ bins  $\mathbf{b}_i= \{b_{i,1}, b_{i,2}, ..., b_{i, m_i}\}$

 