<a href="https://colab.research.google.com/github/onlyabhilash/reinforcement_learning_course_materials/blob/main/exercises/templates/ex12/PolicyGradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 12) Policy Gradients

In this exercise we will have a look at policy gradients.
The theory of policy gradients applies to function approximators that decide on which action to choose.
The function approximators we met in the past were employed to estimate the (action) value function.
Since their task was to judge the quality of the current situation they are often referred to as "critics".
In contrary, we can also use a function approximator to directly choose an action; we call these "actors".
Why should we do that if we made it work with nothing more than a critic?
Because this will finally allow us to make use of contiuous action spaces! Eureka!

In this exercise we will use a new `gym` environment `LunarLanderContinuous-v2`.
To run this environment please make sure to have `Box2D` installed: `pip install Box2D`.

![](https://images.squarespace-cdn.com/content/v1/59e0d6f0197aea1a0abc8016/1507938542206-41S6K9T97YETKEHP0PQF/ke17ZwdGBToddI8pDm48kMR1yAHb8bPoH1-OdajP2rZZw-zPPgdn4jUwVcJE1ZvWQUxwkmyExglNqGp0IvTJZUJFbgE-7XRK3dMEBRBhUpyDg3tXaPHS4cFkn9Bnm-emI0BDr_E-XKAFKqWrx68ZVlLyhCgVi_FJvVMH7mHrc18/lunar_lander_success_example.gif?format=500w)

Source: https://www.billyvreeland.com/portfolio/2017/1013/solving-openai-gym-nm4yz

The main task is to land the LunarLander in the landing zone.
An accident-free landing is defined by both legs coming into  ground contact with moderate velocity.
We are dealing with a continuous state and action space as defined below.
Please notice that the control functions for main and side engines contain a dead zone in which the engines are inactive.
The reward is mainly defined depending on whether the landing procedure is successful (+100) or not (-100).
Firing the main engine gives a small negative reward.
The problem is solved if a return of at least 200 is reached.
For more information see https://gym.openai.com/envs/LunarLanderContinuous-v2/.




\begin{align}
\text{state}&=
\begin{bmatrix}
p_x\\
p_y\\
v_x\\
v_y\\
\varphi\\
20 \, \omega\\
1 \text{ if left leg has ground contact, else } 0\\
1 \text{ if right leg has ground contact, else } 0\\
\end{bmatrix}
\\
\text{action}&=
\begin{bmatrix}
\text{main engine: } [-1, 0] \rightarrow \text{off}, ]0, 1] \rightarrow [50 \, \%, 100 \, \%] \text{ of available power}\\
\text{side engines: } [-1, -0.5] \rightarrow [50 \, \%, 100 \, \%] \text{ of available right engine power}, [0.5, 1] \rightarrow [50 \, \%, 100 \, \%] \text{ of available left engine power}\\
\end{bmatrix}
\end{align}

In [None]:
import numpy as np
import pandas as pd
import gym
import random
import time
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')
from mpl_toolkits.mplot3d import Axes3D
from collections import deque
from sklearn.kernel_approximation import RBFSampler
import sklearn.pipeline
import sklearn.preprocessing

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Lambda, Input
from tensorflow.keras.optimizers import SGD

In [None]:
env = gym.make('LunarLanderContinuous-v2')

state = env.reset()
while True:
    env.render()
    state, reward, done, _ = env.step(env.action_space.sample())

    if done:
        break

env.close()

## 1) Monte Carlo Policy Gradient
Write a REINFORCE algorithm.

Execute the follwoing cell to fit the featurizer using RBFSampler, like already learned in the last exercises.

In [None]:
state_array = []
state = env.reset()

for i in tqdm(range(1000)):
    state = env.reset()
    while True:
        state, reward, done, _ = env.step(env.action_space.sample())
        state_array.append(state)

        if done:
            break

state_array = np.array(state_array)

featurizer = sklearn.pipeline.make_pipeline(
    sklearn.preprocessing.StandardScaler(),
    sklearn.pipeline.FeatureUnion([
    ("rbf0", RBFSampler(gamma=5.0, n_components = 1000)),
    ("rbf1", RBFSampler(gamma=2.0, n_components = 1000)),
    ("rbf2", RBFSampler(gamma=1.0, n_components = 1000)),
    ("rbf3", RBFSampler(gamma=0.5, n_components = 1000)),
    ]),
    sklearn.preprocessing.StandardScaler()
)

_ = featurizer.fit(state_array)


Use the following cell to define the function approximators for the policy.
As seen in Algo.12.1 we need to calculate $\nabla_\theta \mathrm{ln}\,\pi(u_k | x_k, \theta)$.
$\pi$ is herein defined as the normal distribution :
\begin{align}
\pi(u_k | x_k, \theta) & = \frac{\mathrm{exp} \left( {-\frac{1}{2} (u_k - \mu_\theta(x_k))^\mathrm{T} \mathbf{\Sigma}^{-1}_\theta(x_k) (u_k - \mu_\theta(x_k))} \right)}{\sqrt{(2\pi)^p \mathrm{det}(\mathbf{\Sigma}_\theta(x_k))}},\\
\text{with}\hspace{1em} p & = \mathrm{dim}(u_k).
\end{align}

Extend `loglikelyhoodGaussian` such that it returns $\mathrm{ln}\,\pi(u_k | x_k, \theta)$!
Use the numpy equivalent `TensorFlow` functions (e.g. `tf.linalg.inv()`).
`TensorFlow` functions are differentiable and can therefore be  used to calculate $\nabla_\theta \mathrm{ln}\,\pi(u_k | x_k, \theta)$.


In [None]:
state = np.reshape(env.reset(), (1, -1))
feature_state = featurizer.transform(state)
input_dim = feature_state.shape[1]
action_space_dim = len(env.action_space.sample())


# define policy
def create_policy():
    input_layer = Input(shape=input_dim)

    hidden1 = Dense(400, activation='linear')(input_layer)
    hidden1 = tf.keras.layers.LeakyReLU(alpha=0.1)(hidden1)

    hidden2_mu = Dense(400, activation='linear')(hidden1)
    hidden2_mu = tf.keras.layers.LeakyReLU(alpha=0.1)(hidden2_mu)

    hidden2_sigma = Dense(400, activation='linear')(hidden1)
    hidden2_sigma = tf.keras.layers.LeakyReLU(alpha=0.1)(hidden2_sigma)

    mu_out = Dense(action_space_dim, activation='linear')(hidden2_mu)
    mu_out = Lambda(lambda x: tf.reshape(tf.clip_by_value(x, -1, 1), (-1, 1)))(mu_out)
    sigma_out = Dense(action_space_dim, activation='softplus')(hidden2_sigma)
    sigma_out = Lambda(lambda x: tf.linalg.diag(tf.clip_by_value(tf.reshape((x @ np.array([[0.01, 0], [0, 0.1]])), [-1]), 1e-4, 1)))(sigma_out)

    policy = Model(inputs=input_layer, outputs=[mu_out, sigma_out])

    return policy

policy = create_policy()
theta = policy.get_weights()

# Regularization; downscaling of the network parameters to prevent divergence
for i in range(len(theta)):
    theta[i] = theta[i] * 0.4

policy.set_weights(theta)

@tf.function
def loglikelyhoodGaussian(x, _mu, _sigma):
    x = tf.cast(x, tf.float32)

    # YOUR CODE HERE
    raise NotImplementedError()


Use the following template to write a REINFORCE algorithm.
This time the Adam (adaptive moment estimation) optimizer is used, which is an enhanced SGD optimizer.
For more informations see https://arxiv.org/abs/1412.6980.

In [None]:
env.close()
alpha_policy = 1e-5
gamma = 0.99
nb_episodes = 5000

env = gym.make('LunarLanderContinuous-v2')
return_history = []

optimizer = tf.optimizers.Adam(learning_rate=alpha_policy)

for j in tqdm(range(nb_episodes)):
    k = 0
    accumulated_rewards = 0

    states = []
    actions = []
    rewards = []
    probs_log = []

    state = env.reset()
    state = state.reshape(1, -1)

    with tf.GradientTape() as tape:
        while True:
            # env.render()
            k += 1

            # YOUR CODE HERE
            raise NotImplementedError()


Plot the learning curve of the training process!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Execution

Use `deterministic` to choose between deterministic execution (applying $\mu$ directly) or taking the stochastic action by sampling from the normal distribution.

In [None]:
env.close()
env = gym.make('LunarLanderContinuous-v2')

deterministic = True

for j in tqdm(range(10)):

    state = env.reset()
    accumulated_rewards = 0

    while True:
        env.render()

        ### STEP)
        feat_state = featurizer.transform(np.reshape(state, (1, -1)))
        mu, sigma = policy(feat_state)
        print(mu)
        if deterministic:
            action = np.reshape(mu.numpy(),(-1,))
        else:
            action = np.random.multivariate_normal(mean=np.squeeze(mu.numpy()), cov=sigma.numpy())

        next_state, reward, done, _ = env.step(action)

        accumulated_rewards += reward

        state = next_state

        if done:
            if j % 1 == 0:
                print(f"Episode {j}, Length: {k}")
                print(f"Return {accumulated_rewards}")
                print()
            break

    env.close()



## 2) Actor-Critic with TD(0) Targets

Write an actor-critic (AC) algorithm to land the lander in the landing zone :)


Use the following cell to create two function approximators. One to estimate the state values (critic) and one to decide on the actions to take (actor).

In [None]:
state = np.reshape(env.reset(), (1, -1))
input_dim = len(featurizer.transform(state)[0])
action_space_dim = len(env.action_space.sample())

# define critic
def create_critic():
    critic = Sequential()
    critic.add(Dense(400, activation='linear', input_dim=input_dim))
    critic.add(tf.keras.layers.LeakyReLU(alpha=0.1))
    critic.add(Dense(400, activation='linear'))
    critic.add(tf.keras.layers.LeakyReLU(alpha=0.1))
    critic.add(Dense(1, activation='linear'))

    return critic

critic = create_critic()
w = critic.get_weights()
for i in range(len(w)):
    w[i] = w[i] * 0.2
critic.set_weights(w)

# define actor
def create_actor():
    input_layer = Input(shape=input_dim)

    hidden1 = Dense(400, activation='linear')(input_layer)
    hidden1 = tf.keras.layers.LeakyReLU(alpha=0.1)(hidden1)

    hidden2 = Dense(400, activation='linear')(hidden1)
    hidden2 = tf.keras.layers.LeakyReLU(alpha=0.1)(hidden2)

    mu = Dense(action_space_dim, activation='linear')(hidden2)
    mu = Lambda(lambda x: tf.reshape(tf.clip_by_value(x, -1, 1), (-1, 1)))(mu)

    sigma = Dense(action_space_dim, activation='softplus')(hidden2)
    sigma = Lambda(lambda x: tf.linalg.diag(tf.clip_by_value(tf.reshape(x @ np.array([[0.01,   0], [   0, 0.1]]), [-1]), 1e-4, 1)))(sigma)

    actor = Model(inputs=input_layer, outputs=[mu, sigma])

    return actor

actor = create_actor()
theta = actor.get_weights()
for i in range(len(theta)):
    theta[i] = theta[i] * 0.4
actor.set_weights(theta)


Use the following template to write an AC algorithm.

In [None]:
env.close()

alpha_critic = 1e-4
alpha_actor = 1e-5
gamma = 0.99
nb_episodes = 2000


env = gym.make('LunarLanderContinuous-v2')

return_history = []

for j in tqdm(range(nb_episodes)):
    k = 0
    accumulated_rewards = 0

    state = env.reset()
    I = 1

    while True:
        # env.render()
        k += 1

        with tf.GradientTape(persistent=True) as tape:
            # persistent means the tape is not deleted after the first gradient has been computed

            # YOUR CODE HERE
            raise NotImplementedError()

        if done:
            if j % 250 == 0:
                plt.plot(return_history, label='Return')
                plt.plot(pd.Series(return_history, name='reward_history').rolling(10).mean(), label='MA')
                plt.xlabel('episode')
                plt.ylabel('return')
                plt.grid(True)
                _=plt.legend()
                plt.show()
            return_history.append(accumulated_rewards)
            break

    env.close()


Plot the learning curve of the training process!


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Execution

Use `deterministic` to choose between deterministic execution (apply $\mu$) directly or take the stochastic action by sampling from the normal distribution.

In [None]:
env.close()
env = gym.make('LunarLanderContinuous-v2')

deterministic = True

for j in tqdm(range(10)):

    state = env.reset()
    accumulated_rewards = 0

    while True:
        env.render()

        ### STEP)
        feat_state = featurizer.transform(np.reshape(state, (1, -1)))
        mu, sigma = actor(feat_state)

        if deterministic:
            action = np.reshape(mu.numpy(), (-1,))
        else:
            action = np.random.multivariate_normal(mean=np.squeeze(mu.numpy()), cov=sigma.numpy())

        next_state, reward, done, _ = env.step(action)

        accumulated_rewards += reward

        state = next_state

        if done:
            if j % 1 == 0:
                print(f"Episode {j}, Length: {k}")
                print(f"Return {accumulated_rewards}")
                print()
            break

    env.close()
