# Reinforcement Learning - Policy Gradient

<img src="https://raw.githubusercontent.com/jeremiedecock/polytechnique-inf581-2021/master/logo.jpg" style="float: left; width: 15%" />

[INF581-2021](https://moodle.polytechnique.fr/course/view.php?id=9352) Lab session #7

2019-2021 Jérémie Decock

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeremiedecock/polytechnique-inf581-2021/blob/master/lab7_rl3_reinforce_answers.ipynb)

[![My Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jeremiedecock/polytechnique-inf581-2021/master?filepath=lab7_rl3_reinforce_answers.ipynb)

[![NbViewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/jeremiedecock/polytechnique-inf581-2021/blob/master/lab7_rl3_reinforce_answers.ipynb)

[![Local](https://img.shields.io/badge/Local-Save%20As...-blue)](https://github.com/jeremiedecock/polytechnique-inf581-2021/raw/master/lab7_rl3_reinforce_answers.ipynb)

$
\newcommand{\vs}[1]{\mathbf{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\mathbf{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\U{V}
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
%
\def\E{\mathbb{E}}
%\newcommand{transition}{T(s,a,s')}
%\newcommand{transitionfunc}{\mathcal{T}^a_{ss'}}
\newcommand{transitionfunc}{P}
\newcommand{transitionfuncinst}{P(\nextstate|\state,\action)}
\newcommand{transitionfuncpi}{\mathcal{T}^{\pi_i(s)}_{ss'}}
\newcommand{rewardfunc}{r}
\newcommand{rewardfuncinst}{r(\state,\action,\nextstate)}
\newcommand{rewardfuncpi}{r(s,\pi_i(s),s')}
\newcommand{statespace}{\mathcal{S}}
\newcommand{statespaceterm}{\mathcal{S}^F}
\newcommand{statespacefull}{\mathcal{S^+}}
\newcommand{actionspace}{\mathcal{A}}
\newcommand{reward}{R}
\newcommand{statet}{S}
\newcommand{actiont}{A}
\newcommand{newstatet}{S'}
\newcommand{nextstate}{\state'}
\newcommand{newactiont}{A'}
\newcommand{stepsize}{\alpha}
\newcommand{discount}{\gamma}
\newcommand{qtable}{Q}
\newcommand{finalstate}{\state_F}
%
\newcommand{\vs}[1]{\boldsymbol{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\boldsymbol{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\vit{Value Iteration}
\def\pit{Policy Iteration}
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
\def\cstateset{\mathcal{X}} %%%
\def\x{\vs{x}}                    % TODO cstate
\def\cstate{\vs{x}}               %%%
\def\policy{\pi}
\def\piparam{\vs{\theta}}         % TODO pparam
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\caction{\vs{u}}       % action
\def\cactionset{\mathcal{U}} %%%
\def\decision{\vs{d}}       % decision
\def\randvar{\vs{\omega}}       %%%
\def\randset{\Omega}       %%%
\def\transition{T}       %%%
\def\immediatereward{r}    %%%
\def\strategichorizon{s}    %%% % TODO
\def\tacticalhorizon{k}    %%%  % TODO
\def\operationalhorizon{h}    %%%
\def\constalpha{a}    %%%
\def\U{V}              % utility function
\def\valuefunc{V}
\def\X{\mathcal{X}}
\def\meu{Maximum Expected Utility}
\def\finaltime{T}
\def\timeindex{t}
\def\iterationindex{i}
\def\decisionfunc{d}       % action
\def\mdp{\text{MDP}}
$

## Introduction

In a previous lab, we have dealt with reinforcement learning in discrete state and action spaces. To do so, we used methods based on value function and, especially, $Q$-function estimation. The $Q$-function was stored in a table and updated with on- or off- policy algorithms (namely SARSA and $Q$-Learning). 

Yet, these methods do not scale to large state spaces and especially not to the case of continuous state spaces. To address these issues one can either extend value-based methods making use of value-function approximation or directly search in policy spaces. In this lab, we will explore the second solution. 

More specifically, we will search in a family of parameterized policies $\pi_\theta(s, a)$ using a policy gradient method. This method performs gradient ascent in the policy space so that the total return is maximized. We will restrict our work to episodic tasks, *i.e.* tasks that have a starting states and last for a finite and fixed number of steps $H$, called horizon. 

More formally, we define an optimization criterion that we want to maximize:

$$J(\theta) = \E_{\pi_\theta}\left[\sum_{t=1}^H r(s_t,a_t)\right],$$

where $\E_{\pi_\theta}$ means $a \sim \pi_\theta(s,.)$ and $H$ is the horizon of the episode. In other words, we want to maximize the value of the starting state: $V^{\pi_\theta}(s)$. The policy gradient theorem tells us that:

$$
\nabla_\theta J(\theta) = \nabla_\theta V^{\pi_\theta}(s) = \E_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta (s,a) Q^{\pi_\theta}(s,a) \right],
$$
With 

$$Q^\pi(s,a) = \E^\pi \left[\sum_{t=1}^H r(s_t,a_t)|s=s_1, a=a_1\right].$$

Policy Gradient theorem is extremely powerful because it says one doesn't need to know the dynamics of the system to compute the gradient if one can compute the $Q$-function of the current policy. By applying the policy and observing the one-step transitions is enough. Using a stochastic gradient ascent and replacing $Q^{\pi_\theta}(s_t,a_t)$ by a Monte Carlo estimate $R_t = \sum_{t'=t}^H r(s_{t'},a_{t'})$ over one single trajectory, we end up with a special case of the REINFORCE algorithm (see Algorithm below). 

---
REINFORCE with Policy Gradient theorem Algorithm
---

Initialize $\theta^0$ as random<br>
Initialize step-size $\alpha_0$<br>
$n \leftarrow 0$<br>
<b>WHILE</b> no convergence<br>
	$\quad$ Generate rollout $h_n \leftarrow \{s_1^n,a_1^n,r_1^n, \ldots, s_H^n, a_H^n, r_H^n\} \sim \pi_{\theta^n}$<br>
	$\quad$ $PG_\theta \leftarrow 0$<br>
	$\quad$ <b>FOR</b> $t=1$ to $H$<br>
		$\quad\quad$ $R_t \leftarrow \sum_{t'=t}^H r_{t'}^n$<br>
		$\quad\quad$ $PG_\theta \leftarrow PG_\theta + \nabla_\theta \log \pi_{\theta^{n}}(s_t,a_t) ~ R_t$<br>
	$\quad$ $n \leftarrow n + 1$ <br>
	$\quad$ $\theta^n \leftarrow \theta^{n-1} + \alpha_n PG_\theta$<br>
	$\quad$  update $\alpha_n$ (if step-size scheduling)<br>

<b>RETURN</b> $\theta^n$ 

**Notice**: by replacing the $Q$-function by a Monte-Carlo estimate, we get rid of the Markov assumption and this algorithm is expected to work even in non-Markovian systems. 

For the purpose of focusing on the algorithms, we will use standard environments provided by OpenAI Gym suite.
OpenAI Gym provides controllable environments (https://gym.openai.com/envs/) for research in reinforcement learning.
Especially, we will try to solve the CartPole-v0 environment (https://gym.openai.com/envs/CartPole-v0/) which offers a continuous state space and discrete action space.
The Cart Pole task consists in maintaining a pole in a vertical position by moving a cart on which the pole is attached with a joint.
No friction is considered.
The task is supposed to be solved if the pole stays up-right (within 15 degrees) for 195 steps in average over 100 episodes while keeping the cart position within reasonable bounds.
The state is given by $\{x,\frac{\partial x}{\partial t},\omega,\frac{\partial \omega}{\partial t}\}$ where $x$ is the position of the cart and $\omega$ is the angle between the pole and vertical position.
There are only two possible actions: {LEFT, RIGHT}.

## Setup the Python environment

**Notice**: this notebook requires the following libraries: OpenAI *Gym*, NumPy, Pandas, Seaborn and imageio.

You can install them with the following command (the next cell does this for you if you use the Google Colab environment):

``
pip install gym numpy pandas seaborn imageio
``

In [None]:
import sys, subprocess

def is_colab():
    return "google.colab" in sys.modules

In [None]:
colab_requirements = [
    "gym",
    "numpy",
    "pandas",
    "seaborn",
    "pyvirtualdisplay",
    "imageio"
]

debian_packages = [
    "xvfb",
    "x11-utils"
]

if is_colab():

    def run_subprocess_command(cmd):
        # run the command
        process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
        # print the output
        for line in process.stdout:
            print(line.decode().strip())

    for i in colab_requirements:
        run_subprocess_command("pip install " + i)

    for i in debian_packages:
        run_subprocess_command("apt install " + i)

Setup virtual display for Google colab

In [None]:
if "google.colab" in sys.modules:
    import pyvirtualdisplay

    _display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                        size=(1400, 900))
    _ = _display.start()

You can uncomment the following cell to install required packages in your local environment (remove only the `#` not the `!`).

In [None]:
#!pip install gym numpy pandas seaborn

In [None]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

import math
import gym
import numpy as np
import copy
import pandas as pd
import seaborn as sns
import time

In [None]:
import imageio     # To render episodes in GIF images (otherwise there would be no render on Google Colab)
                   # C.f. https://stable-baselines.readthedocs.io/en/master/guide/examples.html#bonus-make-a-gif-of-a-trained-agent

In [None]:
# To display GIF images in the notebook

import IPython
from IPython.display import Image

In [None]:
class RenderWrapper:
    def __init__(self, env, force_gif=False):
        self.env = env
        self.force_gif = force_gif
        self.reset()

    def reset(self):
        self.images = []

    def render(self):
        if not is_colab():
            self.env.render()
            time.sleep(1./60.)

        if is_colab() or self.force_gif:
            img = self.env.render(mode='rgb_array')
            self.images.append(img)

    def make_gif(self, filename="render"):
        if is_colab() or self.force_gif:
            imageio.mimsave(filename + '.gif', [np.array(img) for i, img in enumerate(self.images) if i%2 == 0], fps=29)
            return Image(open(filename + '.gif','rb').read())

    @classmethod
    def register(cls, env, force_gif=False):
        env.render_wrapper = cls(env, force_gif=True)

In [None]:
sns.set_context("talk")

## Exercise 1: Hands on Cart Pole

As for the previous lab, open the \texttt{test\_envs.py} file (lab materials) and run it to get used to basic functions on Cart Pole. Identify the action selection command in the code. You can notice it's not different from previously. Some more functions are available for multidimensional continuous states. 

The most important command is still the \texttt{env.step(action)} one. It applies the selected action to the environment and returns an observation (next state), a reward, a flag that is set to $True$ if the episode has terminated and some info. 

Try to use a different policy (for instance, a constant action) to understand the role of that command. Although the Cart Pole has easy dynamics that can be computed analytically, we will solve this problem with Policy Gradient based Reinforcement learning.

In [None]:
# Print some info about the environment

env = gym.make('CartPole-v0')
print("State space dimension is:", env.observation_space.shape[0])
print("State upper bounds:", env.observation_space.high)
print("State lower bounds:", env.observation_space.low)
print("Actions are: {" + ", ".join([str(a) for a in range(env.action_space.n)]) + "}")
env.close()

In [None]:
env = gym.make('CartPole-v0')
RenderWrapper.register(env, force_gif=True)

for i_episode in range(5):
    observation = env.reset()

    for t in range(200):
        env.render_wrapper.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

    env.close()

env.render_wrapper.make_gif("ex1")

## Exercise 2: Implement a sigmoid policy

As the number of actions is $2$, one can see the problem of controlling the Cart Pole as a binary classification problem.
Binary classification can be easily solved thanks to logistic regression which transforms the classification problem into a regression problem using the sigmoid function defined by:

$$\sigma(x) = \frac{1}{1+e^{-x}}.$$

**To do**:
- Build a policy that select the *RIGHT* action with probability $\sigma(\theta^\top s)$, where $\theta$ is the parameter vector and $s$ is the 4-dimension state vector. 
- As a measure of performance, count the average return (sum or rewards) over 100 episodes. 

## Exercise 3: Compute $\nabla_\theta \log \pi_\theta (s,a)$

Verify that, for a sigmoid policy:
- $\nabla_\theta \log \pi_\theta (s,\text{RIGHT}) = \pi_\theta (s, \text{LEFT}) \times s$
- $\nabla_\theta \log \pi_\theta (s,\text{LEFT}) = - \pi_\theta (s, \text{RIGHT}) \times s$ 

## Exercise 4: Implement REINFORCE

To implement REINFORCE, you will need to follow these steps:
- Run rollouts with current policy $\pi_\theta(s,a)$ with fixed horizon $H$. Since the goal is to attain an average return of 195, horizon should be larger than 195 steps (say 250 for instance).
- Compute policy gradient = $\sum_{t=1}^H \nabla_\theta \log \pi_\theta(s_t,a,_t) R_t$.
- Update parameters with gradient ascent.
- Verify if the new policy meets success requirement (average return $>$194).

In [None]:
ENV_NAME = "CartPole-v0"
EPISODE_DURATION = 300
ALPHA_INIT = 0.1
SCORE = 195.0
TEST_TIME = 100
LEFT = 0
RIGHT = 1

VERBOSE = True

In [None]:
# Compute policy parameterisation
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

In [None]:
# Return policy
def get_policy(s, theta):

    p_right = sigmoid(np.dot(s, np.transpose(theta)))
    pi = [1-p_right, p_right]
    return pi

In [None]:
# Draw an action according to current policy
def act_with_policy(s, theta):
    p_right = get_policy(s, theta)[1]
    r = np.random.rand()
    if r < p_right:
        return 1
    else:
        return 0

In [None]:
# Generate an episode
def gen_rollout(env, theta, max_episode_length=EPISODE_DURATION, render=False):
    s_t = env.reset()
    episode_states = []
    episode_actions = []
    episode_rewards = []
    episode_states.append(s_t)

    for t in range(max_episode_length):

        if render:
            env.render_wrapper.render()
        a_t = act_with_policy(s_t, theta)
        s_t, r_t, done, info = env.step(a_t)
        episode_states.append(s_t)
        episode_actions.append(a_t)
        episode_rewards.append(r_t)
        if done:
            break

    return episode_states, episode_actions, episode_rewards

In [None]:
def test_policy(env, theta, score = SCORE, num_episodes = TEST_TIME , max_episode_length=EPISODE_DURATION, render=False):
    num_success = 0
    average_return = 0

    for i_episode in range(num_episodes):
        _, _, episode_rewards = gen_rollout(env, theta, max_episode_length, render)

        total_rewards = sum(episode_rewards)

        if total_rewards > score:
            num_success+=1

        average_return += (1.0 / num_episodes) * total_rewards

        if render:
            print("Test Episode {0}: Total Reward = {1} - Success = {2}".format(i_episode,total_rewards,total_rewards>score))

    if average_return > score:
        success = True
    else:
        success = False

    return success, num_success, average_return

In [None]:
# Returns Policy Gradient for a given episode
def compute_PG(episode_states, episode_actions, episode_rewards, theta):

    H = len(episode_rewards)
    PG = 0

    for t in range(H):

        pi = get_policy(episode_states[t], theta)
        a_t = episode_actions[t]
        R_t = sum(episode_rewards[t::])
        if a_t == LEFT:
            g_theta_log_pi = - pi[RIGHT] * episode_states[t] * R_t
        else:
            g_theta_log_pi = pi[LEFT] * episode_states[t] * R_t
        
        PG += g_theta_log_pi

    return PG

In [None]:
# Train until average return is larger than SCORE
def train(env, theta_init, max_episode_length = EPISODE_DURATION, alpha_init = ALPHA_INIT):

    theta = theta_init
    i_episode = 0
    average_returns = []

    success, _, R = test_policy(env, theta)
    average_returns.append(R)

    # Train until success
    while (not success):

        # Rollout
        episode_states, episode_actions, episode_rewards = gen_rollout(env, theta, max_episode_length)

        # Schedule step size
        #alpha = alpha_init
        alpha = alpha_init / (1 + i_episode)

        # Compute gradient
        PG = compute_PG(episode_states, episode_actions, episode_rewards, theta)

        # Do gradient ascent
        theta += alpha * PG

        # Test new policy
        success,_,R = test_policy(env, theta, render=False)

        # Monitoring
        average_returns.append(R)

        i_episode += 1

        if VERBOSE:
            print("Episode {0}, average return: {1}".format(i_episode, R))

    return theta, i_episode, average_returns

In [None]:
#np.random.seed(1234)
env = gym.make(ENV_NAME)
RenderWrapper.register(env, force_gif=True)
#env.seed(1234)

In [None]:
dim = env.observation_space.shape[0]

# Init parameters to random
theta_init = np.random.randn(1, dim)

# Train agent
theta, i, average_returns = train(env, theta_init)

print("Solved after {} iterations".format(i))

In [None]:
# Test final policy
test_policy(env, theta, num_episodes=10, render=True)
env.render_wrapper.make_gif("ex4")

In [None]:
# Show training curve
plt.plot(range(len(average_returns)),average_returns)
plt.title("Average reward on 100 episodes")
plt.xlabel("Training Steps")
plt.ylabel("Reward")

plt.show()

env.close()

In [None]:
#state = np.array([0.1, 0.1, 0.1, 0.9])
#theta = np.array([0.1, 0.9, 0.1, 0.9])
#print(get_policy(state, theta))
#print(type(get_policy(state, theta)))
#
#action_list = [act_with_policy(state, theta) for _ in range(2000)]
##print(action_list)
#plt.hist(action_list)
#plt.show()