# The Reinforcement Learning Problem
(https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html)

## The Agent-Environment Interface

* Agent and environent interact in discrete time steps $t=0,1,2,3...$
* At each time step $t$ the agent receives some representation of the environments *state*, $S_t\in S$ where $S$ is the set of possible states
* .. and on that basis selects an *action*, $A_t\in A(S_t)$ where $A(S_t)$ is the set of actions available to state $S_t$.
* At each step a **policy** $\pi_t$ is evaluated, where $\pi_t(a\vert s)$ is the $\Pr$ that $A_t=a$ if $S_t=s$


## Goals and Rewards

* Reward is at each step is a simple number $R_t\in I\!R$
* Must be comuted in the environment and not in the agent!


## Returns

Sequence of rewards received after each time step $t$ is denoted $R_{t+1}, R_{t+2}, R_{t+2},...$ $\Rightarrow$ Seek to **maximize the expected return** $G_t$. In the simplest case the return is the sum of rewards:

$$G_t = R_{t+1}+R_{t+2}+...+R_T$$

where $T$ is the final time step (**episodic tasks**).

**Continual process-control**: Agent-environment interaction goes on continually without limit.
**Problem**: $T\rightarrow \infty$, therefor we need **discounting** whith a *discount rate* $0\leq \gamma \leq 1$:

$$G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...=\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$$

* $\gamma<1$: infinite sum  has a finite value as long as the reward sequence is bounded
* $\gamma=0$ **myopic**, agent is only concerned with immediate reward

$$G_t=\sum_{k=0}^{T-t-1}\gamma^kR_{t+k+1}$$

## Markov Decision Process

finite MDP, given any state $s$ and action $a$, the **transition probabilities** of each possible next state $s'$ are:

$$p(s'\ \vert \ s,a)=\Pr\{S_{t_1}=s' \ \vert \ S_t=s,A_t=a\}$$

The **expected value** of next reward:

$$r(s,a,s')=I\!E\biggr[ R_{t+1} \ \vert \ S_t=s,A_t=a,S_{t+1}=s'\biggr]$$

## Value Functions

A policy $\pi$ is a mapping from each state $s\in S$ and action $a \in A(s)$ to the probability $\pi(a\ \vert \ s)$ of taking action $a$ when in state $s$.

Estimate, *how good* it is for the agent to be in a given state.
**State-value function** for policy $\pi$:

$$v_\pi(s)=I\!E_\pi \biggr[ G_t \ \biggr\vert \ S_t=s \biggr] = I\!E_\pi \biggr[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \ \biggr\vert \ S_t=s \biggr]$$


$$v_\pi(s)=I\!E\biggr[ G_t \vert S_t=s \biggr]\\
=\sum_a\pi(a\ \vert \ s)\sum_{s'}p(s'\ \vert \ s,a)\biggr[r(s,a,s')+\gamma v_\pi(s')\biggr]$$

which is the **Bellman equation** for $v_{\pi}$. It expresses a relationship between the value of a state and the values of its successor states.

**Action-value function** for policy $\pi$:

$$q_\pi(s,a)=I\!E_\pi \biggr[ G_t \vert S_t=s,A_t=a\biggr] =
I\!E_\pi\biggr[ \sum_{k=0}^\infty y^k R_{t+k+1} \biggr\vert S_t=s, A_t=a \biggr]$$


### Optimal Value Functions

A policy $\pi$ is defined to be **better** than or equal to a policy $\pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s) \forall s\in S$

*optimal state-value function*:
$$v_*(s)=\max_\pi v_\pi(s)\\
=\max_a I\!E_{\pi^*}\biggr[ G_t \ \vert \ S_t=s, A_t =a \biggr]\\
=\max_{a\in A(s)} \sum_{s'} p(s' \ \vert \ s, a) \biggr[ r(s,a,s') + \gamma v_*(s') \biggr]$$

$$q_*(s,a)=\max_\pi q_\pi(s,a) \\
=I\!E\biggr[R_{t+1} + \gamma v_*(S_{t+1} \ \vert \ S_t=s, A_t=a \biggr]\\
=\sum_{s'}p(s' \ \vert \ s, a) \biggr[ r(s,a,s')+\gamma \max_{a'}q_*(s',a')\biggr]$$

In [1]:
# Example: Grid-world
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

