# RL: Value function (Tabular case)

## MDPs

A Markov Decision Process (MDP) is a tuple <$\mathcal{S}$, $\mathcal{A}$, $\mathcal{P}$, $\mathcal{R}$, $\gamma$>.  
$\mathcal{S}$ finite set of states.  
$\mathcal{A}$ finite set of actions.  
$\mathcal{P}$ state transition probability matrix.
$$\mathcal{P}_{ss'}^{a} = \mathbb{P}[S_{t+1} = s' | S_t=s, A_t=a]$$
$\mathcal{R}$ reward function.
$$\mathcal{R}_s^a = \mathbb{E}[R_{t+1}|S_t=s, A_t=a]$$
$\gamma$ discount factor, $\gamma \in [0, 1]$.

Return $G_t$: discounted reward from time-step $t$
$$G_t = \sum_{k=0}^\infty \gamma^k R_{t_k+1}$$

A policy $\pi$ is a distribution over actions given states.  
It defines the behaviour of the agent, and depend only on the current state. 
$$\pi (a|s) = \mathbb{P}[A_t=a|S_t=s]$$

Following policy $\pi$ on an MDP produces a Markow reward process <$\mathcal{S}$, $\mathcal{P^\pi}$, $\mathcal{R^\pi}$, $\gamma$>.
$$\mathcal{P}_{ss'}^\pi = \sum_{a \in \mathcal{A}} \pi(a|s) \mathcal{P}_{ss'}^{a}$$
$$\mathcal{R}_{s}^\pi = \sum_{a \in \mathcal{A}} \pi(a|s) \mathcal{R}_{s}^{a}$$

State value function $v_\pi(s)$: Expected return starting from state $s$, following policy $\pi$.
$$v_\pi(s) = \mathbb{E}_\pi [G_t|S_t=s]$$

State-actons value function $q_\pi(s, a)$: Expected return starting from state $s$, taking action $a$, and then following policy $\pi$.
$$q_\pi(s, a) = \mathbb{E}_\pi [G_t|S_t=s, A_t=a]$$

### Bellman Exceptation Equations

$$v_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi(S_{t+1}) |S_t=s]$$  
$$q_\pi(s, a) = \mathbb{E}_\pi [R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) |S_t=s, A_t=a]$$

$$v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s)q_\pi(s, a)$$  
$$q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s')$$  
$$v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s)(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s'))$$  
$$q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}} \pi(a'|s')q_\pi(s', a)'$$

$$v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi$$
$$v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} \mathcal{R}^\pi$$

### Optimal value function

$v_*(s)$ optimal statue-value function.
$$v_*(s) = \max_{\pi} v_\pi(s)$$  
$q_*(s, a)$ optimal action-value function.
$$q_*(s, a) = \max_{\pi} q_\pi(s, a)$$

Oreding over policies: $\pi \geq \pi'$ if $v_\pi(s) \geq v_{\pi'}(s) \forall s$.  
There always exists an optimal policy $\pi_*$: $\pi_* \geq \pi, \forall \pi$
$$v_{\pi_*}(s) = v_*(s)$$
$$q_{\pi_*}(s. a) = q_*(s, a)$$

$$
\pi_*(a|s) = 
\begin{cases}
    1 & \text{if } a = \text{arg}\max_{a \in \mathcal{A}} q_*(s, a)\\
    0 & \text{otherwise}
\end{cases}
$$

### Bellman Optimilality Equations

$$v_*(s) = \max_{a \in \mathcal{A}} q_*(s, a)$$  
$$q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s')$$  
$$v_*(s) = \max_{a \in \mathcal{A}} (\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s'))$$  
$$q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \max_{a' \in \mathcal{A}} q_*(s', a')$$

## Dynamic Programming

### Iterative policy evalutation

$$v_{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a|s)(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s'))$$  

As $k \rightarrow \infty$, $v_k \rightarrow v_\pi$

### Policy Itearation

- Evaluates policy $\pi$
- Improve $\pi$: $\pi' = \text{greedy}(v_\pi)$
- Repeat

$$
\pi'(a|s) = 
\begin{cases}
    1 & \text{if } a = \text{arg}\max_{a \in \mathcal{A}} \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s')\\
    0 & \text{otherwise}
\end{cases}
$$

Converges to $\pi_*$

### Value Iteration

$$v_{k+1}(s) = \max_{a \in \mathcal{A}} (\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s'))$$ 

As $k \rightarrow \infty$, $v_k \rightarrow v_*$

### Asynchronous Dynamic Programming

- In Place: Value iteration store only one value functions, no k
- Prioretized sweeping: update state with largest remaining Bellman error
- Real-Time: update states visited by a real agent

Bellman error:
$$|\max_{a \in \mathcal{A}} (\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s')) - v(s)|$$

## Model-Free Prediction

### Monte-Carlo Policy Evaluation (MC)

$$V(s) = \frac{S(s)}{N(s)}$$
$S(s)$: sum of all returns $G_t$ from state $S_t=s$.  
$N(s)$: number of times state $s$ is visited.  
As $N(s) \rightarrow \infty$, $V(s) \rightarrow v_\pi(s)$

- First visit MC: Update $N$ et $S$ only the first time $s$ is visited at each episode
- Every visit MC: Update $N$ et $S$ every time $s$ is visited at each episode

Incremeantal updates:  
For each state $_t$ with return $G_t$:
$$N(S_t) \leftarrow N(S_t) + 1$$
$$V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t}(G_t - V(S_t))$$

Updates for non-stationary problems:
$$V(S_t) \leftarrow V(S_t) + \alpha(G_t - V(S_t))$$
Forget old episodes over time  

### Temporal-Difference-Learning (TD)

### TD(0)

For each visited state $_t$:
$$V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$$  

TD target: $R_{t+1} + \gamma V(S_{t+1})$  
TD Error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

MC must wait until the end of the episode to update its estimate.  
MC only work for terminating environments.  
TD can update its estimate after every episode.

MC has high variance and no bias
TD has low variance and some bias

MC tries to minizime the mean squared error of the observed returns $G_t$.  
TD(0) converges to a MDP that best fits the data.
Usually MC is more effective is non-markov enironements than TD, and vice-versa.

### n-step returns

$$G_t^{(n)} = \sum_{k = 1}^n \gamma^{k-1} R_{t+k} + \gamma^n V(S_t + n)$$
$n = 1$: TD(0).  
$n = \infty$: MC.  
$$V(S_t) \leftarrow V(S_t) + \alpha (G_t^{(n)} - V(S_t))$$

### $\lambda$-return

$$G_t^\lambda = (1 - \lambda) \sum_{k=1}^\infty \lambda^{n-1} G_t^{(n)}$$  
$$V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\lambda} - V(S_t))$$

Can only be computed from complete episodes, very slow to compute

### Eligibility Traces

- Assign credit to most frequent states
- Assign credit to most recent states

$$E_0(s) = 0$$
$$E_t(s) = \gamma \lambda E_{t-1}(s) + \mathbb{1}(S_t = s)$$

### TD($\lambda$)

For every visited state $s$:

$$\delta_t = R_{t+1} + \gamma V(S_{t+1} - V(S_t)$$
$$V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$
Updates are equivalent to the $\lambda$-return

## Model-Free Control

On-Policy Learning: learn about policy $\pi$ from experience sampled from $\pi$.  
Off-Policy Learning: learn about policy $\pi$ from experience sampled from $\mu$.

### Monte-Carlo policy Iteration

- Policy evaluation: $Q \approx q_\pi$
- $\epsilon$-greedy policy improvement

$\epsilon$-greedy policy:
- choose greedy action with probability 1 - $\epsilon$
- choose random action with probability $\epsilon$

$$
\pi_(a|s) = 
\begin{cases}
    \frac{\epsilon}{m} + 1 - \epsilon & \text{if } a = \text{arg}\max_{a' \in \mathcal{A}} Q(s, a')\\
    \frac{\epsilon}{m} & \text{otherwise}
\end{cases}
$$

### Greedy in the Limit with Infinite Exploration (GLIE)

A GLIE policy satisfies:

$$\lim_{k \to \infty} N_k(s,a) = \infty$$
$$\lim_{k \to \infty} \pi_k(a|s) = \mathbb{1}(a = \text{arg}\max_{a' \in \mathcal{A}} Q(s, a'))$$  
$\epsilon$-greedy policy with $\epsilon = \frac{1}{k}$ is GLIE

### GLIE MC-Control

At the end of each episode $k$ sampled following policy $\pi \leftarrow \epsilon\text{-greedy}(Q)$, with $\epsilon = \frac{1}{k}$:  
For each state and action $S_t$ and $A_t$ in episode $k$:  

$$N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$$
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q(S_t, A_t))$$  

As $k \to \infty$, $Q(s, a) \to q_*(s, a)$.

### SARSA(0)

Same technique than for MC control with $\epsilon$-greedy policy.  
Replace MC update by TD(0) update.  

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) = \alpha (R_t + \lambda Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$$

SARSA converges to the optimat policy if:

- policy $\pi_t(a|s)$ is GLIE
- step-size $\alpha_t$ follow a Robbins-Monro sequence:

$$\sum_{t=1}^\infty \alpha_t = \infty$$
$$\sum_{t=1}^\infty \alpha_t^2 < \infty$$  

In practice, might work with fixed $\alpha$ and $\epsilon$

### n-step SARSA

n-step Q-return:

$$q_t^{(n)} = \sum_{k=1}^{n} \lambda^{k-1}R_{t+k} + \gamma^n Q(S_{t+n})$$
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) = \alpha (q_t^{(n)} - Q(S_t, A_t))$$

### $\lambda$-return

$$q_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1}q_t^{(n)}$$
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) = \alpha (q_t^\lambda - Q(S_t, A_t))$$

### SARSA($\lambda$)

Eligibily traces for q-values:  
$$E_0(s,a) = 0$$
$$E_t(s, a) = \gamma \lambda E_{t-1}(s, a) + \mathbb{1}(S_t = s, A_t = a)$$

For every visited state $S_t$ and action $A_t$ visited following $\epsilon$-greedy policy:  

$$\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$$  

For every state-action pair $s,a$:  

$$Q(s, a) \leftarrow Q(s, a) + \alpha \delta_t E_t(s, a)$$

### Off-Policy learning

Learn from target policy $\pi$ while following behaviour policy $\mu$. Usefull for:
    
- Learn from observing other agent
- Reuse experience from old policies
- Learn about optimal policy while following exploratory policy

### Importance sampling

$$\mathbb{E}_{X \sim P}[f(X)] = \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)} f(X)]$$

### Off-Policy Monte-Carlo

$$G_t^{\pi / \mu} = \frac{\sum_{k=t}^T \pi(A_k|S_k)}{\sum_{k=t}^T \mu(A_k|S_k)} G_t$$
$$V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\pi / \mu} - V(St))$$  

Extremely high variance, doesn't work in practice

### Off-POlicy TD(0)

$$V(S_t) \leftarrow V(S_t) + \alpha (\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$$  

Much lower variance than MC

### Q-Learning

Off policy control with TD(0):  

Next action $A_{t+1}$ chosen from behaviour policy $\mu$.  
Update toward action $A'$ chosen from target policy $\pi$

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))$$

Let's choose target policy $\pi$ greedy with respect to $Q(s, a)$ and behaviour policy $\mu$ $\epsilon$-greedy with respect to $Q(s, a)$.  

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t))$$  

$Q(s, a) \to q_*(s, a)$