# Reinforcement Learning

(https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html)

## 1. The Problem

### Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements for a reinforcement learning system:

* **a policy**: (stimulus-response rule) defines the learning agents way of behaving at a given time. Map from perceived states of the environment to actions to be taken when in those states.
* **a reward function**: defines the goal in a RL problem. Maps each perceived state (or state/action pair) of the environment to a single number, indicating the desirability of that state. A RL agents sole objective is to maximize the total reward it receives in the long run. This function must be unalterable by the agent! Reward functions may be stochastic. Reward functions indicate what is good in an immediate sense as they are basically given directly by the environment.
* **a value function**: specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Values are predicitions of rewards and must be estimated and reestimated from the sequences of observations an agent makes.
* **a model of the environment (optional)**: mimics the behavior of the environment. E.g., given a state and action, the model might predict the resultant next state and next reward. Models are used for *planning*.



## 2. Evaluative Feedback

### n-armed Bandit

**Learning problem**:repeated choice among $n$ different actions, after each choice, a numerical reward is received, chosen from a stationary probability distribution, depending on the selected action.

**Objective**: Maximize the expected total reward over some time (e.g. 1000 actions) -> each action selection is called a *play*.

The **value** of the action is the *mean* of each action which is unknown to the agent at the beginning. When *exploring*, the agent choses a random action and when *exploiting*, the agent is chosing greedily the action with best value.

In [4]:
def h1():
    return random(-2,2)
def h2():
    return random(0,2)
def h3():
    return random(-1,4)
bandit = [h1, h2, h3]

n = len(bandit)


### Action-Value Methods

The true value of action $a$ is denoted as $Q^*(a) = I\!E(a)$ and the estimated vaue at the $t$th play as $Q_t(a)$.

$$Q_t(a)=\frac{r_1+r_2+...+r_{k_a}}{k_a}$$

If $k_a = 0$, we define $Q_t(a)$ to be some default value. As $k_a \rightarrow \infty$, $Q_t(a)$ converges to $Q^*(a)$ (**law of large numbers**).

**Simple action selection rule**: On play $t$ greedily select action $a^*$ for which $Q_t(a^*)=\max_aQ_t(a)$. This method always exploits current knowledge to maximize immediate reward.

**$\epsilon$-greedy method**: Do as the simple action selection rule above, but, every once in a while, with small probability $\epsilon$, seelect an action at random uniformly.

### Softmax Action Selection

$\epsilon$-greedy action selection is effective to balance exploration and exploitation. The drawback is that it chooses equally among all actions. So it is equally likely to chose the worst or the best action which might be an issue, if the *worst* action might be really bad.

**Solution**: vary action probabilities as a graded function of estimated value. Greedy action has the highest probability, but all others are ranked and weighted according to their value estimates.

$$\Pr_t^{Softmax}(a)=\frac{e^{\frac{Q_t(b)}{\tau}}}{\sum_{b=1}^{n}e^{\frac{Q_t(a)}{\tau}}}$$

$\tau$ is a positive parameter called *temperatur*.

* High temperatures cause actions to be nearly equiprobable
* Low temperatures produce greater difference in selection probability

When $\tau \rightarrow 0$, softmax action selection becomes the same as greedy action selection.

### Evaluation vs Instruction

Example: Suppose there are $100$ possible actions and action $32$ is selected. **Evaluative** feedback would give a score, say $7.2$ while **instructive** training would say what other action, e.g. $67$ was correct instead.

#### $L_{R-P}$ algorithm

**Linear reward penalty**: $\pi_t(a)$ is the probability, that action $a$ is chosen.

$$\pi_{t+1}(a)=\pi_t(d_t)+\alpha\biggr[1-\pi_t(d_t)\biggr]$$

### Incremental Implementation

$$Q_t(a)=\frac{r_1+r_2+...+r_{k_a}}{k_a}$$

where $r_1, ..., r_{k_a} are all the rwards received following all selections of action $a$ prior to play $t$.

**Incremental update** to sensibly implement the function on a computer:
$Q_k$ denotes the average of its first $k$ rewards for some action $a$ (but: $Q_k \neq Q_k(a)$). Given this average and a $(k+1)$st reward, the $r_{k+1}$, then the average of all $k+1$ rewards can be computed by:


$$Q_{k+1}= \frac{1}{k+1}\sum_{i=1}^{k+1}r_i\
=Q_k + \frac{1}{k+1} \biggr[r_{k+1}-Q_k\biggr]$$

**General update rule**:
$$NewEstimate \leftarrow OldEstimate + StepSize \biggr[Target - OldEstimate \biggr]$$

$\biggr[Target - OldEstimate\biggr]$ is an *error* in the estimate which is reduced by taking a step towards the *Target*.



### Tracking a Nonstationary Problem

In non-stationary problems, it make sense to weight recent rewards stronger than long-past ones. One solution is the use of an
**constant step size** parameter $0 < \alpha \leq 1$. 

$$Q_{K+1}=Q_k + \alpha \biggr[r_{k+1}-Q_k\biggr]$$

$Q_0$ is the initial estimate.

$$Q_k=Q_{k-1} + \alpha \biggr[r_k - Q_{k-1}\biggr]\\
=(1-\alpha)^kQ_0 + \sum_{i=1}^{k} \alpha (1-\alpha)^{k-i} r_i$$

Sum of the weigts is $(1-\alpha)^k + \sum_{i=1}^{k} \alpha (1-\alpha)^{k-i}=1$

### Reinforcement comparison

Create a *reference reward* $\overline{r_t}$ which is the average of all rewards to be able to tell apart good from bad results. These methods are sometimes more efficient than **action-value** methods.

$$\pi_t(a)=\Pr\{ a_t = a\} = \frac{e^{p_t(a)}}{\sum_{b=1}^{n}e^{p_t(b)}}$$

$$p_{t+1}(a_t)=p_t(a_t)+\beta \biggr[r_t-\overline{r_t}\biggr]$$

$\beta$ is a positive step-size parameter that implements the idea that high rewards should increase the probability and low rewards should decrease it.

$$\overline{r_{t+1}}=\overline{r_t}+\alpha\biggr[r_t-\overline{r_t}\biggr], 0 < \alpha \leq 1$$

In [2]:
# imports
from random import randint, random