<a href="https://colab.research.google.com/github/onlyabhilash/reinforcement_learning_course_materials/blob/main/exercises/templates/ex06/n_StepMethods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 6: n-Step Methods

In this exercise we will have a look at n-step methods. This class of reinforcement learning algorithms is an abstraction of the previously discussed Monte-Carlo and TD(0) methods and includes them as special cases. The environment we will be dealing with is a little more typical for control research: the inverted pendulum.

![](https://miro.medium.com/max/1000/1*TNo3x9zDi1lVOH_3ncG7Aw.gif)

To implement this environment, we will make use of the gym library. Please install the gym library within your preferred Python environment using:

```pip install gym```

In [None]:
import numpy as np
import gym
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')

Check if the installation and import work by executing the following cell. A window with an animation of the pendulum should open, display some random actions, and close automatically.

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped # removes a built-in time limit of k_T = 200, we want to determine the time limit ourselves

state = env.reset()
for _ in range(300):
    env.render()
    state, reward, done, _ = env.step(env.action_space.sample()) # take a random action
env.close()

The goal of this environment is to bring the pendulum into the upper neutral position, where the angle $\theta = 0$ and the angular velocitiy $\frac{\text{d}}{\text{d}t}\theta=\omega=0$. The reward function is already designed that way and does not need further specification. For further information about the environment you may refer to the code and documentation of OpenAI's `gym`:

[Documentation of the gym pendulum](https://github.com/openai/gym/wiki/Pendulum-v0)

[Pendulum environment in the gym Github repository](https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py)

## 1) Discretization of Action and State Space

Unlike the racetrack environment, the inverted pendulum comes with a continuous action and state space. Although it is possible to handle systems with these characteristics, we did not yet learn how to deal with them. For now, we only know how to implement agents for discrete action and state spaces. Accordingly, we will also try to represent the inverted pendulum within a discrete state / action space. For this, a discretization is necessary.

The pendulum has three state variables relating to the momentary angular position $\theta$:
\begin{align*}
x=\begin{bmatrix}
\text{cos}(\theta)\\
\text{sin}(\theta)\\
\frac{\text{d}}{\text{d}t}\theta
\end{bmatrix}
\in
\begin{bmatrix}
[-1, 1]\\
[-1, 1]\\
[-8 \, \frac{1}{\text{s}}, 8 \, \frac{1}{\text{s}}]
\end{bmatrix},
\end{align*}

and one input variable which relates to the torque applied at the axis of rotation:

$u = T \in [-2 \, \text{N}\cdot\text{m}, 2 \, \text{N}\cdot\text{m}]$

After the discretization, we want the system to be defined on sets of non-negative natural numbers:

\begin{align*}
x_d=
\text{discretize_state}(x)
\in
\begin{bmatrix}
\{0,1,2,...,d_{\theta}-1\}\\
\{0,1,2,...,d_{\theta}-1\}\\
\{0,1,2,...,d_{\omega}-1\}
\end{bmatrix},
\end{align*}


$
u_d=
\text{discretize_action}(u)
\in
\{0,1,2,...,d_{T}-1\}.
$

Since action is selected within the discrete action space, we need to transform it accordingly:

$
u=
\text{continualize_action}(u_d):
\{0,1,2,...,d_{T}-1\} \rightarrow [-2 \, \text{N}\cdot\text{m}, 2 \, \text{N}\cdot\text{m}]
.
$

Write the functions `discretize_state` and `continualize_action`, such that a discrete RL agent can be applied. (Please note that all I/O of `gym` consists of numpy arrays.) Write the functions in such a way that the number of discretization intervals $d_\theta, d_\omega, d_T$ are parameters that can be changed for different tests. The discretization intervals should be uniformly distributed on their respective state space.

A parametrization of $d_\theta = d_\omega = d_T = 15$ can be used to yield satisfactory results in this exercise.
However, does it make a difference if the number of discretization intervals is odd or even? If yes, what should be preferred for the given environment?

## Solution 1)

YOUR ANSWER HERE

In [None]:
d_T = 15
d_theta = 15
d_omega = 15

def discretize_state(states):

    # YOUR CODE HERE
    raise NotImplementedError()


def continualize_action(disc_action):

    # YOUR CODE HERE
    raise NotImplementedError()

Use the following cell for debugging:

In [None]:
env = gym.make('Pendulum-v0')
state = env.reset()
for _ in range(5):
    disc_action = np.random.choice(range(9))
    cont_action = continualize_action(disc_action)
    print("discrete action: {}, continuous action: {}".format(disc_action, cont_action))

    state, reward, done, _ = env.step(cont_action) # take a random action
    disc_state = discretize_state(state)
    print("discrete state: {}, continuous state: {}".format(disc_state, state))

env.close()

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 2) n-Step Sarsa

Write an on-policy n-step Sarsa control algorithm for the inverted pendulum from scratch. This time, no code template is given.

Use the following parameters: $\alpha=0.1, \gamma=0.9, \varepsilon=0.1, n=10$ with 500 time steps in 2000 episodes.

![](nStepSARSA_Algo.png)

YOUR ANSWER HERE

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped

alpha = 0.1 # learning rate
gamma = 0.9 # discount factor
epsilon = 0.1 # epsilon greedy parameter
n = 10 # steps between updates

nb_episodes = 2000 # number of episodes
nb_steps = 500 # length of episodes

action_values = np.zeros([d_theta, d_theta, d_omega, d_T])
pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)  # int is necessary for indexing

cumulative_reward_history = [] # we can use this to figure out how well the learning worked

for j in tqdm(range(nb_episodes), position=0, leave=True):

    # YOUR CODE HERE
    raise NotImplementedError()
pi_learned = np.copy(pi) # save pi in cache under different name for later

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Greedy Execution

Test the learned policy by pure greedy execution.

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped

nb_steps = 200

state = env.reset() # initialize x_0
disc_state = tuple(discretize_state(state)) # use tuple indexing
disc_action = pi_learned[disc_state]

for k in range(nb_steps):

    cont_action = continualize_action(disc_action)
    env.render() # comment out for faster execution
    state, reward, done, _ = env.step(cont_action)
    disc_state = tuple(discretize_state(state))

    if done:
        break

    disc_action = pi_learned[disc_state] # exploitative action

env.close()

## 3) Tree Backups

Although n-step Sarsa is a very powerful algorithm, it still needs to be trained on-policy. This is not a problem in simulations, but it might be quite dangerous if used on physical systems. Therefore, we also need an off-policy solution.

Use the policy learned in task (2) as a behavior policy when implementing n-step Sarsa with tree backups and compare their learning behavior. Be aware that execution may be time consuming.

Use the following parameters: $\alpha=0.1, \gamma=0.9, \varepsilon=0.1, n=5$ with 500 time steps in 10 000 episodes (might take some time).

What can we say about the training process? What can we say about the resulting learned policy? Did the agent learn a good policy? Why? Why not?

![](nStepTreeBackup_Algo.png)

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped

alpha = 0.1 # learning rate
gamma = 0.9 # discount factor
epsilon = 0.1 # 0.1 # epsilon greedy parameter
n = 5 # steps between updates

nb_episodes = 10000 # number of episodes
nb_steps = 500 # length of episodes

action_values = -999 * np.ones([d_theta, d_theta, d_omega, d_T])
behavior_policy = np.copy(pi_learned) # pi_learned should be the learned policy from (2), make sure it is active
pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)

cumulative_reward_history = [] # we can use this to figure out how well the learning worked

for j in tqdm(range(nb_episodes), position=0, leave=True):

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Greedy Execution

Test the learned policy by pure greedy execution.

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped

nb_steps = 200


state = env.reset() # initialize x_0
disc_state = tuple(discretize_state(state)) # use tuple indexing
disc_action = pi[disc_state]

for k in range(nb_steps):

    cont_action = continualize_action(disc_action)
    env.render() # comment out for faster execution
    state, reward, done, _ = env.step(cont_action)
    disc_state = tuple(discretize_state(state))

    if done:
        break

    disc_action = pi[disc_state] # exploitative action

env.close()

## 4) Comprehensive: n-Step $Q(\sigma)$ Hyperparameter Optimization

The $Q(\sigma)$ algorithm allows for even more flexibility. Like in n-step Sarsa, we can again choose the number $n$ of past steps to consider for an update. Moreover, we can choose the parameter $\sigma$ to change the weighting between sampled and expected state transitions. Accordingly, a properly tuned $Q(\sigma)$ agent is very flexible and thus powerful.

Unfortunately, with great power comes great responsibility. Namely, the responsibility to make a good decision on the parameters. What is a "properly tuned" agent? This is a very basic question of RL and now we want to investigate it by the means of a (small) hyperparameter optimization that utilizes simple grid search.

Write an off-policy epsilon-greedy $Q(\sigma)$ ($\pi$ greedy and $b$ $\varepsilon$-greedy) algorithm to control the inverted pendulum. The algorithm should be formulated as a function, such that we can pass $n$ and $\sigma$ to it and get a policy and a reward history returned. For simplicity, we define $\sigma=\text{const.}$ during one training process.

Carry out the training process for $n \in \{0,2,4,...,10\}$ and for $\sigma \in \{0,0.2,0.4,...,1\}$. Evaluate the training process by determining the best reward history (how can we define what is a good reward history?).

Parameters: $\alpha=0.1, \gamma=0.9, \varepsilon=0.1$ with 500 time steps in 2 000 episodes

![](QSigma_1.png)
![](QSigma_2.png)

## Solution 4)

This task will take some time both, in coding and in running the optimization.

In [None]:
def Q_sigma(n, sigma):
    env = gym.make('Pendulum-v0')
    env = env.unwrapped

    action_values = np.zeros([d_theta, d_theta, d_omega, d_T])
    pi = np.zeros([d_theta, d_theta, d_omega], dtype=int) # on policy: b=pi

    alpha = 0.1
    gamma = 0.9
    epsilon = 0.1
    # sigma is an argument of this function
    nb_steps = 500
    nb_episodes = 2000

    reward_history = []

    for j in tqdm(range(nb_episodes), position=0, leave=True):

        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
# install joblib with
# >>> pip install joblib

from joblib import Parallel, delayed

ns = np.arange(6) * 2
sigmas = np.linspace(0, 1, 6)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
env = gym.make('Pendulum-v0')
env = env.unwrapped

nb_steps = 500


state = env.reset() # initialize x_0
disc_state = tuple(discretize_state(state)) # use tuple indexing
disc_action = pi[disc_state]

for k in range(nb_steps):

    cont_action = continualize_action(disc_action)
    env.render() # comment out for faster execution
    state, reward, done, _ = env.step(cont_action)
    disc_state = tuple(discretize_state(state))

    if done:
        break

    disc_action = pi[disc_state] # exploitative action

env.close()