# *Sample-average* method for estimating action values 

## One natural way to estimate sample averages is by averaging the rewards actually received

# $ Q_{t}(a) = \frac{  \sum_{i=0}^{i=t-1} R_{i} I_{A_{i} = a} }{  \sum_{i=0}^{i=t-1} I_{A_{i} = a}  } $

## It is easy to devise incremental formulas for updating averages with small, constant computation required to process each new reward. 

# $  Q_{n + 1} = Q_{n} + \frac{1}{n}[  R_{n} - Q_{n}  ]  $

# A simple bandit algorithm
```
Initialize, for a = 1 to k:
  Q(a) <- 0
  N(a) <- 0
Loop forever:  
  A <- { argmax_{a} Q(a) with probability 1 - epsilon (breaking ties randomly)
  A <- { a random action with probability epsilon
  R <- bandit(A)
  N(A) <- N(A) + 1
  Q(A) <- Q(A) + 1/N(A) * (R-Q(A))
```

## Upper-Confidence-Bound (UCB) Action Selection 

# $A_{t} = \underset{a}{\operatorname{argmax}}{ [ Q_{t}(a) + c \sqrt{\frac{\ln{t}}{N_{t}(a)}} ] }$

where 
- $\ln{t}$ denotes the natural logarithm of t, 
- $N_{t}(a)$ denotes the number of times that action a has been selected prior to time t ), 
- and the number c > 0 controls the degree of exploration. 

If N t ( a ) = 0 , then a is considered to be a maximizing action. 

The square-root term is a measure of the uncertainty or variance in the estimate of a ’s value. 

# Gradient Bandit

We consider learning a numerical preference for each action a, which we denote $H_t(a)$. 

# $  Pr\{A_{t} = a\} = \frac{ e^{H_{t}(a)} }{  \sum_{b=1}^{b=k} e^{H_{t}(b)}  } = \pi_{t}(a)  $

### $\pi_{t}(a)$ is probability of taking action a at time t

Initially all action preferences are the same (e.g., $H_1 ( a ) = 0$ , for all a ) so that all actions have an equal probability of being selected.

There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. 

On each step, after selecting action A t and receiving the reward R t , the action preferences are updated by: 

# $  H_{t + 1} (A_t) = H_{t} (A_t) + \alpha (R_t - \bar{R_t})(1-\pi_{t}(A_{t}))  $ &nbsp;&nbsp;&nbsp;&nbsp;, and
# $  H_{t + 1} (a) = H_{t} (a) - \alpha (R_t - \bar{R_t}) \pi_{t}(a)  $ &nbsp;&nbsp;&nbsp;&nbsp; - for all $ a \neq A_{t} $

where 
 - α > 0 is a step-size parameter, 
 - $R_t \in {\rm I\!R}$ is the average of all the rewards up through and including time t, which can be computed incrementally. 
 