# Reinforcement Learning

(https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html)

## 1. The Problem

### Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements for a reinforcement learning system:

* **a policy**: (stimulus-response rule) defines the learning agents way of behaving at a given time. Map from perceived states of the environment to actions to be taken when in those states.
* **a reward function**: defines the goal in a RL problem. Maps each perceived state (or state/action pair) of the environment to a single number, indicating the desirability of that state. A RL agents sole objective is to maximize the total reward it receives in the long run. This function must be unalterable by the agent! Reward functions may be stochastic. Reward functions indicate what is good in an immediate sense as they are basically given directly by the environment.
* **a value function**: specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Values are predicitions of rewards and must be estimated and reestimated from the sequences of observations an agent makes.
* **a model of the environment (optional)**: mimics the behavior of the environment. E.g., given a state and action, the model might predict the resultant next state and next reward. Models are used for *planning*.



## 2. Evaluative Feedback

### n-armed Bandit

**Learning problem**:repeated choice among $n$ different actions, after each choice, a numerical reward is received, chosen from a stationary probability distribution, depending on the selected action.

**Objective**: Maximize the expected total reward over some time (e.g. 1000 actions) -> each action selection is called a *play*.

The **value** of the action is the *mean* of each action which is unknown to the agent at the beginning. When *exploring*, the agent choses a random action and when *exploiting*, the agent is chosing greedily the action with best value.

In [4]:
def h1():
    return random(-2,2)
def h2():
    return random(0,2)
def h3():
    return random(-1,4)
bandit = [h1, h2, h3]

n = len(bandit)


### Action-Value Methods

The true value of action $a$ is denoted as $Q^*(a) = I\!E(a)$ and the estimated vaue at the $t$th play as $Q_t(a)$.

$$Q_t(a)=\frac{r_1+r_2+...+r_{k_a}}{k_a}$$

If $k_a = 0$, we define $Q_t(a)$ to be some default value. As $k_a \rightarrow \infty$, $Q_t(a)$ converges to $Q^*(a)$ (**law of large numbers**).

**Simple action selection rule**: On play $t$ greedily select action $a^*$ for which $Q_t(a^*)=\max_aQ_t(a)$. This method always exploits current knowledge to maximize immediate reward.

**$\epsilon$-greedy method**: Do as the simple action selection rule above, but, every once in a while, with smal


In [2]:
# imports
from random import randint, random