# CSE 204 Lab 13: Reinforcement Learning

<img src="https://raw.githubusercontent.com/adimajo/polytechnique-cse204-2019-releases/master/logo.jpg" style="float: left; width: 15%" />

[CSE204-2019](https://moodle.polytechnique.fr/course/view.php?id=7862) Lab session #13

Jérémie Decock

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adimajo/polytechnique-cse204-2019-releases/blob/master/lab_session_13/lab_session_13.ipynb)

[![My Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/adimajo/polytechnique-cse204-2019-releases/master?filepath=lab_session_13%2Flab_session_13.ipynb)

[![Local](https://img.shields.io/badge/Local-Save%20As...-blue)](https://github.com/adimajo/polytechnique-cse204-2019-releases/raw/master/lab_session_13/lab_session_13.ipynb)

## Introduction

The purpose of this lab is to introduce some classic concepts used
in reinforcement learning like *Dynamic Programming*, *Bellman's Principle of Optimality*, *Bellman equations*, *value search* and *policy search*.
$
\newcommand{\vs}[1]{\mathbf{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\mathbf{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\U{V}
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
$

**Notice**: there are some differences in notations with the lecture slides, especially with the transition function: $T(\state, \action, \state') \equiv \mathcal{P}(\state' | \state, \action)$. Here we also assume that the reward only depends on the state: $r(\state) \equiv \mathcal{R}(\state, \action, \state')$.

**Notice**: this notebook requires the OpenAI *Gym* library ; you can install it with `pip install gym` (the next cell does this for you if you use the Google Colab environment).

In [None]:
colab_requirements = [
    "matplotlib>=3.1.2",
    "numpy>=1.18.1",
    "nose>=1.3.7",
    "gym=>0.15.4",
]
import sys, subprocess
def run_subprocess_command(cmd):
    # run the command
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
    # print the output
    for line in process.stdout:
        print(line.decode().strip())
        
if "google.colab" in sys.modules:
    for i in colab_requirements:
        run_subprocess_command("pip install " + i)

You can uncomment the following cell to install gym in MyBinder or in your local environment (remove only the `#` not the `!`).

In [None]:
#!pip install gym

In [None]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

import math
import gym
import numpy as np
import copy
import pandas as pd
import seaborn as sns

In [None]:
sns.set_context("talk")

In [None]:
#matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

## Backward Induction

*Backward Induction* is a basic *Dynamic Programming* method **[BELLMAN57]**.
Like other Dynamic Programming algorithms, it uses the *Bellman's
Principle of Optimality* **[BELLMAN57]** for accelerating computation (compared
to an exhaustive search). It can be applied to problems that exhibit a compatible structure, i.e., a problem that has *overlapping subproblems* or a problem having an *optimal substructure* **[BELLMAN57]**.
Actually, this acceleration is obtained by breaking problems down into simpler subproblems in such a manner
that redundant computations are avoided by storing results.
When applicable, the method takes far less time than naïve methods that don't take advantage of the subproblem overlap (like depth-first search).

*Backward Induction* computes non-stationary policies: a new policy is computed for each time step.
Thus the number of time steps used to solve the problem is set in advance.
*Backward Induction* algorithms solve Sequential Decision Making problems defined with
discrete actions and state spaces.

The *value* (or *utility*) $\U^*$ for each state $\state$ at the latest time step $T$ is
$$
\U^*_T(\state) = r(\state) \label{eq:backward-induction-last-value} \tag{1}
$$
where $r$ is the immediate reward function.

The best expected value $\U^*$ for each state $\state$ at the $t^{\text{th}}$ time step is
$$
\U^*_t(\state) = r(\state) + \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U^*_{t+1}(\state') \right]  \label{eq:backward-induction-tth-value} \tag{2}
$$
and the $t^{\text{th}}$ optimal action (or decision) $d^*_t(\state)$ among the set of
possible actions $\actionset$ is
$$
d^*_t(\state) = \arg\max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U^*_{t+1}(\state') \right]  \label{eq:backward-induction-tth-decision} \tag{3}
$$
where $T$ is the transition function.

The main idea is to compute the expected value of each state
(Eq. \ref{eq:backward-induction-tth-value}) and then to use it to select the
best action for any given state (Eq. \ref{eq:backward-induction-tth-decision}).

Eq. \ref{eq:backward-induction-tth-value} cannot be solved analytically because
the system of equations to compute $V$ contains non-linear terms (due to the
"max" operator).
As an alternative, Eq. \ref{eq:backward-induction-tth-value}
is usually computed using Dynamic Programming method, as described in algorithm 1.

___
### Algorithm 1: Backward Induction

**Input**:<br>
$\quad$ $mdp = \langle \stateset, \actionset, T, r \rangle$, a Markov Decision Process <br>
$\quad$ $T$, the resolution horizon (i.e. the number of time steps) <br>
**Local variables**: <br>
$\quad$ $\U^*_t ~~ \forall t \in \{1, ..., T\}$, vectors of utilities for states in $\stateset$ <br>
<br>
$\U^*_T[\state] \leftarrow r(\state) ~~ \forall \state \in \stateset$ <br>
**for all** $t \in \{T-1, T-2, ..., 1\}$ **do** <br>
$\quad$ **for all** $\state \in \stateset$ **do** <br>
$\quad\quad$ **if** $\state$ is a final state **then** <br>
$\quad\quad\quad$ $\displaystyle \U^*_t[\state] \leftarrow r(\state)$ <br>
$\quad\quad$ **else** <br>
$\quad\quad\quad$ $\displaystyle \U^*_t[\state] \leftarrow r(\state) + \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state,\action,\state') \U^*_{t+1}[\state'] \right]$ <br>
$\quad\quad$ **end if** <br>
$\quad$ **end for** <br>
**end for** <br>
<br>
**return** $\U^*_t ~~ \forall t \in \{1, ..., T\}$
___

## Value Iteration

*Value Iteration* **[BELLMAN57]** is one of the most famous Dynamic Programming algorithm to compute the optimal policy for a Markov Decision Process (MDP).
Similarly to Backward Induction, the
main idea implemented by Value Iteration is to compute the best expected value of each state and then to use
these values to select the best action from any given state.

The main difference with the Backward Induction algorithm is that Value Iteration
is used to compute stationary policies.
Indeed, the same resulting policy is used for each time step and thus there is
no assumption about the number of time steps to consider for the solution.

The expected value $\U^{\pi}$ for each state $\state$ when the agent follows a
given (stationary) policy $\pi$ is 
$$
\U^{\pi}(\state) = E \left[ \sum^{\infty}_{t=0} \discount^t r(\state_t) | \pi, \state_0 = \state \right] \label{eq:vi-value-of-s-for-pi} \tag{4}
$$

The optimal (stationary) policy $\pi^*$ is defined using the best expected value $\U^{\pi^*}$ and using the principle of *Maximum Expected Utility* as follows
$$
\pi^*(\state) = \arg\max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U^{\pi^*}(\state') \right]  \label{eq:vi-optimal-policy} \tag{5}
$$

Eq. \ref{eq:vi-bellman-eq} is commonly called *Bellman equation*; it gives the best
value we can expect for any given
state (assuming the optimal policy $\pi^*$ is
followed). There are $|\stateset|$ Bellman equations, one for each state.
As for the Backward Induction method,
this system of equations cannot be solved analytically because
Bellman equations contain non-linear terms (due to the
"max" operator).  As an alternative, Eq. \ref{eq:vi-bellman-eq}
can be computed iteratively using Value Iteration, a Dynamic Programming method
described in Algorithm 2.

\begin{equation}
    \U(\state) := \U^{\pi^*}(\state) = \left\{
    \begin{array}{l l}
        r(\state)                                                                                                                                 & \quad \text{if $\state$ is a final state} \\
        \displaystyle r(\state) + \discount \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U(\state') \right]    & \quad \text{otherwise}\\
    \end{array} \right.
    \label{eq:vi-bellman-eq} \tag{6}
\end{equation}

Equation \ref{eq:vi-bellman-update} -- called *Bellman update* -- is
used in the iterative method described in Algorithm 2, to update $\U$ at each iteration.

\begin{equation}
    \U_{i+1}(\state) \leftarrow \left\{
    \begin{array}{l l}
        r(\state)                                                                                                                                   & \quad \text{if $\state$ is a final state} \\
        \displaystyle r(\state) + \discount \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U_i(\state') \right]    & \quad \text{otherwise}\\
    \end{array} \right.
    \label{eq:vi-bellman-update} \tag{7}
\end{equation}

___
### Algorithm 2: Value Iteration

**Input**:<br>
$\quad$ $mdp = \langle \stateset, \actionset, T, r \rangle$, a Markov Decision Process <br>
$\quad$ $\discount$, the discount factor <br>
$\quad$ $\epsilon$, the maximum error allowed in the utility of any state in an iteration <br>
**Local variables**: <br>
$\quad$ $\U, \U'$, old and new vectors of utilities for states in $\stateset$, initially zero <br>
$\quad$ $\delta$, the maximum change in the utility of any state in an iteration <br>
<br>
**repeat** <br>
$\quad$ $\U \leftarrow \U'$ <br>
$\quad$ $\delta \leftarrow 0$ <br>
$\quad$ **for all** $\state \in \stateset$ **do** <br>
$\quad\quad$ **if** $\state$ is a final state **then** <br>
$\quad\quad\quad$ $\displaystyle \U'[\state] \leftarrow r[\state]$ <br>
$\quad\quad$ **else** <br>
$\quad\quad\quad$ $\displaystyle \U'[\state] \leftarrow r[\state] + \discount \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state,\action,\state') \U[\state'] \right]$ <br>
$\quad\quad$ **end if** <br>
$\quad\quad$ **if** $|\U'[\state] - \U[\state]| > \delta$ **then** <br>
$\quad\quad\quad$ $\delta \leftarrow |\U'[\state] - \U[\state]|$ <br>
$\quad\quad$ **end if** <br>
$\quad$ **end for** <br>
**until** $\delta < \epsilon(1-\discount)/\discount$ <br>
<br>
**return** $\U$
___

### Convergence

The convergence of Value Iteration has been proved, but this convergence is asymptotic **[BELLMAN57]**.
However, each iteration is easy and fast to compute.

## Policy Iteration

*Policy Iteration* **[HOWARD60]** is another popular Dynamic Programming algorithm to
compute MDP's optimal policy. In practice, it is often faster than Value Iteration.

The Policy Iteration algorithm alternates the following two steps, starting with an initial policy $\pi_0$:
1. Policy Evaluation: given a policy $\pi_i$, compute $\U^{\pi_i}(\state) ~ \forall \state \in \stateset$, the expected value of each state when $\pi_i$ is followed.
2. Policy Improvement: compute a new policy $\pi_{i+1}$, using one-step look-ahead based on $\U^{\pi_i}$ and using the principle of *Maximum Expected Utility* as follows
$$
\pi_{i+1}(\state) = \arg\max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state, \action, \state') \U^{\pi_i}(\state') \right]  \label{eq:pi-policy-improvement} \tag{8}
$$

Algorithm 3 describes the two-step procedure.
The algorithm terminates when the *Policy Improvement* step yields no change in the utilities.

___
### Algorithm 3: Policy Iteration

**Input**:<br>
$\quad$ $mdp = \langle \stateset, \actionset, T, r \rangle$, an MDP <br>
**Local variables**: <br>
$\quad$ $\U$, vector of utilities for states in $\stateset$, initially zero <br>
$\quad$ $\pi$, a policy vector indexed by state, initially random <br>
<br>
**repeat** <br>
$\quad$ $\U \leftarrow \mbox{POLICY-EVALUATION}(\pi, \U, mdp)$ <br>
$\quad$ unchanged $\leftarrow$ true <br>
$\quad$ **for all** $\mbox{state} ~ \state \in \stateset$ **do** <br>
$\quad\quad$ **if** $\displaystyle \max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state,\action,\state') \U[\state'] \right] > \sum_{\state' \in \stateset} T(\state,\pi[\state],\state') \U[\state']$ **then** <br>
$\quad\quad\quad$ $\displaystyle \pi[\state] \leftarrow \arg\max_{\action \in \actionset} \left[ \sum_{\state' \in \stateset} T(\state,\action,\state') \U[\state'] \right]$ <br>
$\quad\quad\quad$ unchanged $\leftarrow$ false <br>
$\quad\quad$ **end if** <br>
$\quad$ **end for** <br>
**until** unchanged <br>
<br>
**return** $\pi$
___

Solving the POLICY-EVALUATION routine is much simpler than solving the standard
Bellman equations (which is what Value Iteration does).  Indeed, the action in each
state is fixed by the policy, thus the "max" operator disappears and Bellman
equations become linear.
As a result, $\U^{\pi_i}$ can be computed by solving the linear system of these
*simplified Bellman equations* (Eq. \ref{eq:pi-simplified-bellman}) for
each state.

\begin{equation}
    \U^{\pi_i}(\state) = \left\{
    \begin{array}{l l}
        r(\state)                                                                                                          & \quad \text{if $\state$ is a final state} \\
        \displaystyle r(\state) + \discount \sum_{\state' \in \stateset} T(\state, \pi_i(\state), \state') \U^{\pi_i}(\state')    & \quad \text{otherwise}\\
    \end{array} \right.
    \label{eq:pi-simplified-bellman} \tag{9}
\end{equation}

### Convergence

As the number of states and policies is finite, and as the policy is improved
at each iteration, Policy Iteration converges in a finite number of iterations (often
small in practice).
However, within each iteration, solving the
POLICY-EVALUATION routine may cost a lot (its complexity is $O(|\stateset|^3)$).

## Hands on OpenAI Gym and the FrozenLake toy problem

For the purpose of focusing on the algorithms, we will use standard environments provided by OpenAI Gym framework.
OpenAI Gym provides controllable environments (https://gym.openai.com/envs/) for research in Reinforcement Learning.
We will use a simple toy problem to illustrate Dynamic Programming algorithms properties.

**Task:** read https://gym.openai.com/docs/ to discover Gym and get familiar with its main concepts.

In this lab, we will try to solve the FrozenLake-v0 environment (https://gym.openai.com/envs/FrozenLake-v0/).
Additional information is available [here](https://github.com/openai/gym/wiki/FrozenLake-v0) and [here](https://github.com/openai/gym/wiki/FrozenLake-v0).

**Notice**: this environment is *fully observable*, thus here the terms (environment) *state* and (agent) *observation* are equivalent.
This is not always the case for example in poker, the agent doesn't know the opponent's cards.

### Get the FrozenLake state space and action space

In [None]:
env = gym.make('FrozenLake-v0')

Possible states in FrozenLake are:

In [None]:
states = list(range(env.observation_space.n))
states

Possible actions are:

In [None]:
actions = list(range(env.action_space.n))
actions

The following dictionary may be used to understand actions:

In [None]:
action_labels = {
    0: "Move Left",
    1: "Move Down",
    2: "Move Right",
    3: "Move Up"
}

### Display functions

The next cells contain functions that can be used to display states, transitions and policies with the FrozenLake environment.

In [None]:
def states_display(state_seq, title=None, figsize=(5,5), annot=True, fmt="0.1f", linewidths=.5, square=True, cbar=False, cmap="Reds"):
    size = int(math.sqrt(len(state_seq)))
    state_array = np.array(state_seq)
    state_array = state_array.reshape(size, size)
    
    fig, ax = plt.subplots(figsize=figsize)         # Sample figsize in inches
    sns.heatmap(state_array, annot=annot, fmt=fmt, linewidths=linewidths, square=square, cbar=cbar, cmap=cmap)
    plt.title(title)
    plt.show()

In [None]:
def transition_display(state, action):
    states_display(transition_array[state,action], title="Transition probabilities for action {} ({}) in state {}".format(action, action_labels[action], state))

In [None]:
def display_policy(policy):
    actions_src = ["{}={}".format(action, action_labels[action].replace("Move ", "")) for action in actions]
    title = "Policy (" + ", ".join(actions_src) + ")"
    states_display(policy, title=title, fmt="d", cbar=False, cmap="Reds")

### Make the `is_final_array`, `reward_array` and `transition_array`

To implement Dynamic Programming algorithms, we need the transition probability (or transition function) and the reward function, both defined in `env.P`.

`env.P[S][A]` gives the list of reachable states from state S executing action A.

These reachable states are coded in a tuple defined like this: `(probability, next state, reward, is_final_state)`.

You will not need to use `env.P` to solve exercises.
In the following cell, `is_final_array`, `reward_array` and `transition_array` are defined for convenience.

In [None]:
is_final_array = np.full(shape=len(states), fill_value=np.nan, dtype=np.bool)
reward_array = np.full(shape=len(states), fill_value=np.NINF)                # np.NINF = negative infinity
transition_array = np.zeros(shape=(len(states), len(actions), len(states)))

for state in states:
    for action in actions:
        for next_state_tuple in env.P[state][action]:              # env.P[state][action] contains the next states list (a list of tuples)
            transition_probability, next_state, next_state_reward, next_state_is_final = next_state_tuple

            is_final_array[next_state] = next_state_is_final
            reward_array[next_state] = max(reward_array[next_state], next_state_reward)   # workaround: when we already are in state 15, reward is 0 if we stay in state 15 (in practice this never append as the simulation stop when we arrive in state 15 as any other terminal state)
            transition_array[state, action, next_state] += transition_probability

In [None]:
def reachable_states(state, action):
    return np.nonzero(transition_array[state, action])[0]

The following plot shows the state corresponding to square of the FrozenLake grid.

In [None]:
states_display(states, fmt="d", title="States ID")

The following plot shows the reward obtained in each square of the FrozenLake grid.

In [None]:
states_display(reward_array, title="Rewards")

The following plot shows whether a square is a final state or not (i.e. whether it ends the simulation or not).

In [None]:
states_display(is_final_array, fmt="d", title="Final states")

The following cells show how to display transitions with the provided `transition_display` function. Figures displayed in squares are the probability to reach these squares from the given (`state`, `action`) pair. Colored squares are the states that may be reached from this pair (a non-zero probability).

In [None]:
transition_display(state=0, action=0)

In [None]:
transition_display(state=6, action=0)

In [None]:
transition_display(state=6, action=1)

## Exercise 1: Implement the Value Iteration algorithm

To solve the FrozenLake-v0 problem with Dynamic Programming, we will first use the Value Iteration algorithm described in Algorithm 2.

Notice that the FrozenLake-v0 environment is non-deterministic.
To implement Value Iteration, you will need the transition probability (or the transition function) defined in `transition_array`.
- Use `reachable_states(S, A)` to get the list of reachable states from state `S` executing action `A`.
- Use `transition_array[S, A]` to get the probability of reaching each state from state `S` executing action `A`.
- Use `transition_array[S, A, S']` to get the probability of reaching state `S'` from state `S` executing action `A`.

You will also need the previously defined `is_final_array` matrix.
- Use `is_final_array[S]` to know whether `S` is a final state (`True`) or not (`False`).

Finally, you will need the previously defined `reward_array` matrix.
- Use `reward_array[S]` to get the reward obtained by the agent each time it reaches state `S`.

In the following cell, we define `expected_value` and `expected_values` functions for convenience.
The first one returns the expected reward
$$\sum T(\state, \action, \state') \U(\state')$$
for a given pair $(\state, \action)$ and a given V-table (value function) $\U$.
The second one computes the expected reward for all the actions in $\state$.

In [None]:
def expected_value(state, action, v_array):
    return (transition_array[state, action] * v_array).sum() # compute sum(T(s,a,s').V(s'))

In [None]:
def expected_values(state, v_array):
    return (transition_array[state] * v_array).sum(axis=1)   # compute sum(T(s,a,s').V(s')) for all the actions

### Question 1: Implement the Value Iteration algorithm (compute the *value function* `v_array`)

**Note**: here we use the `state_display` function to show the evolution of the value function `v_array` over iterations.

In [None]:
stop = False

value_function_history = []
delta_history = []

def value_iteration(gamma=0.95, epsilon=0.001, display=False):
    v_array = np.zeros(len(states))   # Initial value function
    stop = False

    while not stop:
        if display:
            states_display(v_array, title="Value function", cbar=True, cmap="Reds")
        else:
            print('.', end="")
        value_function_history.append(v_array)
        
        delta = 0.
        
        # TODO...
        
        delta_history.append(delta)
        
        if delta < epsilon:
            stop = True
    
    return v_array
        
v_array = value_iteration(display=True)
states_display(v_array, title="Value function", cbar=True, cmap="Reds")

### Display the evolution of the value function over iterations

In [None]:
df_v_hist = pd.DataFrame(value_function_history)
df_v_hist

Evolution of `v_array` (the estimated value of each state) over iterations (one curve per state):

In [None]:
df_v_hist.plot()
plt.title("V(s) w.r.t iteration")
plt.ylabel("V(s)")
plt.xlabel("iteration")
plt.legend(loc='upper right');

Evolution of `delta` over iterations:

In [None]:
plt.plot(delta_history)
plt.yscale("log")
plt.title(r"$\max~\delta$ w.r.t iteration")
plt.ylabel(r"$\max~\delta$")
plt.xlabel("iteration");

### Question 2: Define the greedy policy (Maximum Expected Utility)

In [None]:
def greedy_policy(state, v_array):
    
    # TODO...
    
    return policy

### Display the opimized policy

Applying the `greedy_policy` on each state gives us the policy matrix:

In [None]:
policy = [greedy_policy(state, v_array) for state in states]

The following cell gives us a graphical representation of the optimal policy we have computed. The figure in each square is the optimal action to execute in the corresponding state (0 = "move left", 1 = "move down", 2 = "move right", 3 = "move up").

In [None]:
display_policy(policy)

### Evaluate Value Iteration with Gym (single trial)

So far, we have computed the value function `v_array` for one *episode*.
The environment is stochastic, thus if we apply the computed policy several times on the environment, we may have different results.
To measure the performance of our value function `v_array`, we should assess it several times and count the number of successful trials.
OpenAI considers an agent to successfully solve the FrozenLake problem if it reaches 76% success rate over the last 100 trials (or "episodes").

In [None]:
env._max_episode_steps = 1000

In [None]:
reward_list = []

NUM_EPISODES = 1000

for episode_index in range(NUM_EPISODES):
    state = env.reset()
    done = False
    #t = 0

    while not done:
        action = greedy_policy(state, v_array)
        state, reward, done, info = env.step(action)
        #t += 1

    reward_list.append(reward)
    #print("Episode finished after {} timesteps ; reward = {}".format(t, reward))

print(sum(reward_list) / NUM_EPISODES)            

env.close()

### Question 3: What do you think the discount factor $\gamma$ is for?

TODO...

### Evaluate Value Iteration for different value of $\gamma$ with confidence interval (bootstrap)

In [None]:
%%time

NUM_EPISODES = 1000

reward_list = []

for gamma in (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.84, 0.9, 0.95, 0.99):
    v_array = value_iteration(gamma=gamma)
    
    for episode_index in range(NUM_EPISODES):
        state = env.reset()
        done = False

        while not done:
            action = greedy_policy(state, v_array)
            state, reward, done, info = env.step(action)

        reward_list.append({"gamma": gamma, "reward": reward})

env.close()

In [None]:
df = pd.DataFrame(reward_list)
df.tail()

In [None]:
# Plot mean reward (with its 95% confidence interval)

sns.relplot(x="gamma", y="reward", kind="line", data=df, height=6, aspect=1.5)
plt.axhline(0.76, color="red", linestyle=":", label="76% success threshold");   # 76% success threshold
plt.legend();

### Display the Value Iteration optimal policy with respect to $\gamma$

In [None]:
for gamma in (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.84, 0.9, 0.95, 0.99):
    print()
    print("=" * 10, "GAMMA = ", gamma, "=" * 10)
    print()
    
    v_array = value_iteration(gamma=gamma)
    
    print()
    print()
    
    policy = [greedy_policy(state, v_array) for state in states]
    display_policy(policy)

## Exercise 2: Implement the Policy Iteration algorithm (Bonus)

An approach alternative to Value Iteration (Exercise 1) is Policy Iteration (described in Algorithm 3).

**Task:** implement Iterative Policy Iteration (for the same environment). Note that as part of this task you should also implement iterative policy evaluation. Compare the policies obtained by both approaches (they should be the same).

### Question 1: Define the (exact) Policy Evaluation function

In [None]:
def policy_evaluation(policy, gamma):
    
    # TODO...
    
    return x

### Question 2: Define the Policy Improvement function

In [None]:
def policy_iteration(gamma, initial_policy=None, policy_evaluation_function=policy_evaluation):
    
    # TODO...

    return policy

In [None]:
gamma = 0.99

policy = policy_iteration(gamma=gamma, policy_evaluation_function=policy_evaluation)

display_policy(policy)

### Evaluate Policy Iteration with Gym (single trial)

In [None]:
env._max_episode_steps = 1000

In [None]:
reward_list = []

NUM_EPISODES = 1000

for episode_index in range(NUM_EPISODES):
    state = env.reset()
    done = False
    #t = 0

    while not done:
        action = policy[state]      # Take a random action
        state, reward, done, info = env.step(action)
        #t += 1

    reward_list.append(reward)
    #print("Episode finished after {} timesteps ; reward = {}".format(t, reward))

print(sum(reward_list) / NUM_EPISODES)            

env.close()

### Evaluate Policy Iteration for different $\gamma$ with confidence interval (bootstrap)

In [None]:
%%time

NUM_EPISODES = 1000

reward_list = []

for gamma in (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.84, 0.9, 0.95, 0.99):
    print("gamma:", gamma)
    policy = policy_iteration(gamma=gamma)
    
    for episode_index in range(NUM_EPISODES):
        state = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, info = env.step(action)

        reward_list.append({"gamma": gamma, "reward": reward})

env.close()

In [None]:
df = pd.DataFrame(reward_list)
df.tail()

### Plot mean reward (with its 95% confidence interval)

In [None]:
sns.relplot(x="gamma", y="reward", kind="line", data=df, height=6, aspect=1.5)
plt.axhline(0.76, color="red", linestyle=":", label="76% success threshold");   # 76% success threshold
plt.legend();

## References

**[BELLMAN57]** Richard Ernest Bellman. *Dynamic Programming*. Princeton University Press, Princeton,
New Jersey, USA, 1957.

**[HOWARD60]** R.A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge,
Massachusetts, 1960.

## Going further

In this lab we have introduced Reinforcement Learning in a very specific case where the *agent* (the algorithm) has a perfect knowledge of the environment (transition and reward functions).

This is convenient to introduce basic concepts but we cannot expect this assumption to be true in many practical problems.
A lot of sophisticated algorithms have been developed recently and most of them have been implemented in [OpenAI Baselines](https://openai.com/blog/openai-baselines-dqn/) library and can be used in [OpenAI Gym](https://gym.openai.com/) benchmark library.

Also, for those who want to go further, one of the best book in reinforcement learning is freely available on the web: http://incompleteideas.net/book/RLbook2018.pdf

Example of what can be done in RL:
- AlphaGo (movie) https://www.youtube.com/watch?v=WXuK6gekU1Y (this work had huge impact in the AI community)
- AlphaGo https://deepmind.com/research/case-studies/alphago-the-story-so-far
- AlphaZero https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go
- AlphaStar (StarCraft II) https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning
- DQN https://deepmind.com/blog/article/deep-reinforcement-learning
- Dota 2 https://openai.com/blog/openai-five/
- Robotics https://openai.com/blog/solving-rubiks-cube/
- Robotics https://openai.com/blog/learning-dexterity/
- Breakout https://www.youtube.com/watch?v=V1eYniJ0Rnk
- Walker https://youtu.be/pgaEE27nsQw
- Helicopter https://www.youtube.com/watch?v=VCdxqn0fcnE
- Energy https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40
- Self-driving cars