<a href="https://colab.research.google.com/github/onlyabhilash/reinforcement_learning_course_materials/blob/main/exercises/templates/ex09/FunctionApproximatorPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 9) Function Approximators in Prediction

Up to now, we have used tabular methods to store data. These methods are rather simple concerning implementation and use, but they lack efficiency in several fields:

- Tabular methods need a lot of storage capacity. Saving one number per possible state (or state-action combination) is very expensive for systems with a large discrete problem space. It even gets impossible when looking at continuous problem spaces.
- Tabular methods are unable to generalize. Every update to the table only contains information about one specific state, which means that we only learn about states we have seen and not about the states that are "near" them. We can decide if we want to solve this issue by extending the training time or by lowering our expectations of the outcome.

or alternatively, we make use of function approximators!

For our investigations, we will have a look at the MountainCar environment from OpenAI's `gym`.

This system has a continuous two-dimensional state space and a discrete one-dimensional action space.

The MountainCar can be compared to the pendulum concerning that a successful policy must be able to perform a swing-up movement. The car has limited ability to accelerate uphill, it has to build up velocity by accelerating downhill.
In contrast to the pendulum scenario, the MountainCar terminates upon reaching the goal on the mountaintop to the right. Every timestep will be rewarded with a reward of $r_{k+1}=-1$, such that it is most beneficial to end an episode as fast as possible.

For this exercise we want to concentrate on the evaluation of an existing policy.

![](https://marcinbogdanski.github.io/rl-sketchpad/Deep_Q_Network/assets/mountaincar.gif)

(Source of GIF: https://marcinbogdanski.github.io/rl-sketchpad/Deep_Q_Network/1010_DQN_ClassicControl.html)

Please make sure to have `TensorFlow` installed:

`pip install tensorflow`

In [None]:
import numpy as np
import gym
import random
import time
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')
from mpl_toolkits.mplot3d import Axes3D

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

Test the environment:

In [None]:
env = gym.make('MountainCar-v0')
state = env.reset()

while True:
    #env.render()
    state, reward, done, _ = env.step(env.action_space.sample())

    if done:
        break

env.close()

## 1) Linear Function Approximation

The next cell contains a simple policy for the swing-up of the MountainCar. We want to predict the value function with the use of a linear function approximator of the form:

$\hat{v}(\mathbf{x_k})=\mathbf{w}^\text{T} \tilde{\mathbf{x}}_k$.

Herein, the weight vector is denoted by $\mathbf{w}$. The feature vector $\tilde{\mathbf{x}}_k$ is derived from the state vector $\mathbf{x}_k$:

$\tilde{\mathbf{x}}_k = f (\mathbf{x}_k)$

The state vector $\mathbf{x}_k$ consists of the (vertical) position and the velocity.

Write a Semi-Gradient TD(0) prediction algorithm that learns the weights of this linear value function approximator.
Make use of a `feature` function, that accepts the state vector as an input and returns a feature vector that is derived from the state. The feature vector should be equal to zero ($\tilde{\mathbf{x}}_T = \mathbf{0}$) if the finish line has been passed (this happens if the position is greater than $0.5$). Can you find a feature definition that enables a good value estimation?

Hint:
As it seems, the chances of successfully reaching the finish line rise when the car's energy increases. You may want to look at the [MountainCar sourcecode](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py) to acquire expert knowledge for your feature definition.

In [None]:
def policy(state):
    # 0: left, 1: idle, 2: right

    pos, vel = state

    # fixed policy, do not change
    action = 2*int(vel > 0)

    return action

def feature(state):
    pos, vel = state

    # YOUR CODE HERE
    raise NotImplementedError()

    return feature_vec


In [None]:
### Semi-Gradient TD(0)
alpha = 0.1
gamma = 0.9
nb_episodes = 500

env = gym.make('MountainCar-v0')

state = env.reset()
feat_state = feature(state)
w = np.zeros([len(feat_state)])
visited_states = []

for j in tqdm(range(nb_episodes)):

    # YOUR CODE HERE
    raise NotImplementedError()

Evaluate the result parameter vector by investigating the performance of the value function approximator on the whole state space, preferrably in a plot. Which parts of the estimation seem accurate, which do not?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 2) Recursive Least Squares TD

In the previous task we computed the parameters of the value function iteratively, by using old parameters and new observations to calculate new parameters. This works, but we can also be more data efficient. Recursive Least Squares Temporal Difference Learning (RLS-TD) allows to determine new parameters on the basis of new AND old observations, such that the parameters we receive are optimally fitted while taking past experiences into account. This method does not use a step size but only a forgetting factor $\lambda \in [0,1]$ which defines the impact of past experiences.

Write an RLS-TD algorithm to solve the prediction problem. Check the stability of your code for the forgetting factors $\lambda \in \{0.9, 0.99, 1\}$. As this algorithm contains a lot of matrix multiplication, pay attention to vectors being represented within the correct dimensions. One could e.g. use the `row_vector.reshape(-1, 1)` command to turn a row vector into a column vector, or use the postix `.T` to transpose an array (note, this has no effect on one-dimensional arrays).

The feature definition from task (1) can be reused here as long as $\tilde{\mathbf{x}}_T = \mathbf{0}$ holds.

In [None]:
def feature(state):
    pos, vel = state
    win = int(pos > 0.5)
    # YOUR CODE HERE
    raise NotImplementedError()

    return feature_vec

In [None]:
gamma = 0.9
_lambda = 1 # we call it like that because "lambda" is a built-in command/ reserved syntax in python
nb_episodes = 500

env = gym.make('MountainCar-v0')
state = env.reset()
feat_state = feature(state)
feat_dims = len(feat_state)

P = np.eye(feat_dims)
w = np.zeros((feat_dims, 1))  # column vector
ident_mat = np.eye(feat_dims)  # identity matrix

for j in tqdm(range(nb_episodes)):

    # YOUR CODE HERE
    raise NotImplementedError()

To simplify the comparison with your results from task (1), visualize the results from this task as well.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 3) Nonlinear Function Approximation with Artificial Neural Networks

Maybe we can achieve a more precise value estimation if we use a nonlinear function approximator. The dynamics of the environment are nonlinear, so it would be reasonable to expect a better result. We will use an artificial neural network as our function approximator without feature engineering.

Neural networks tend to learn more reliably if the input variables are normalized. In this case, we will use minmax normalization. Write the function `normalize` that normalizes the MounatinCar state:

\begin{align}
\text{state}&\in
\begin{bmatrix}
[-1.2, 0.6] \\
[-0.07, 0.07]
\end{bmatrix}
\\
\text{normalize(state)}&\in
\begin{bmatrix}
[-1, 1] \\
[-1, 1]
\end{bmatrix}
\end{align}

`Tensorflow` is somewhat complicated to understand if you are new to it and efficient usage of it in reinforcement learning looks different from the usage in "traditional" supervised learning. That is why we prepared the code for the network training this time. You only need to write a proper `normalize` function here, but of course feel free to explore the learning algorithm (e.g. experiment with different ANN topologies) as a preparation for the upcoming exercises. A more detailed explanation about the usage of `Tensorflow` in this task will be presented in the corresponding tutorial video.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

def normalize(state):
    # YOUR CODE HERE
    raise NotImplementedError()
    return normed_state

In [None]:
alpha = 0.0001
gamma = 0.9
nb_episodes = 1000

env = gym.make('MountainCar-v0')

# define ANN topology
model = Sequential()
model.add(Dense(16, activation='relu', input_dim=2))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='linear'))

opt = SGD(learning_rate=alpha)
mse = tf.keras.losses.MeanSquaredError()

errors = []
visited_states = []

for j in tqdm(range(nb_episodes)):
    state = env.reset()
    norm_state = normalize(state)

    while True:
        #env.render()
        action = policy(state)
        next_state, reward, done, _ = env.step(action)

        # calculate the target (the value we want to estimate)
        norm_next_state = normalize(next_state)
        next_value = model(norm_next_state)
        if not done:
            target = reward + gamma * next_value
        else:
            target = np.array([[reward]])

        # calculate the loss (the prediction error)
        with tf.GradientTape() as tape:
            prediction = model(norm_state)
            loss = mse(target, prediction)

        # calculate and apply the gradients dMSE/dw
        gradients = tape.gradient(loss, model.trainable_variables)
        opt.apply_gradients(zip(gradients, model.trainable_variables))

        norm_state = norm_next_state
        state = next_state

        errors.append(loss.numpy())
        visited_states.append(state)

        if done:
            break

    #env.close()
visited_states = np.vstack(visited_states)

This is how the `relu` activation looks like:

In [None]:
x = np.linspace(-1, 1, 1000)
y = np.maximum(0, x)
plt.figure(figsize=(3, 3))
plt.plot(x, y)
plt.grid(True)
plt.xlim([-1, 1])
plt.ylim([-0.1, 1])
plt.xlabel(r"$x$")
_ = plt.ylabel(r"$\mathrm{relu}(x)$")

Try to  visualize the results from this task as well.

In [None]:
resolution = 100
pos_vec = np.linspace(-1.2, 0.6, resolution)
vel_vec = np.linspace(-0.07, 0.07, resolution)
pos_mat, vel_mat = np.meshgrid(pos_vec, vel_vec)  # shapes: (100, 100), (100, 100)

value_mat = model(normalize(
                    np.column_stack([pos_mat.ravel(), vel_mat.ravel()])))\
                 .numpy().reshape(pos_mat.shape)

# Plot
plt.pcolormesh(pos_mat, vel_mat, value_mat)
plt.xlabel("position")
plt.ylabel("velocity")
cbar = plt.colorbar()
cbar.set_label("V")

_ = plt.scatter(visited_states[:, 0], visited_states[:, 1], color="red", s=0.5)

In [None]:
visited_values = model(normalize(visited_states)).numpy()

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_surface(pos_mat, vel_mat, value_mat, cmap="viridis")
ax.scatter(visited_states[:, 0], visited_states[:, 1], np.squeeze(visited_values), color="red")
ax.set_xlabel('\n\nposition')
ax.set_ylabel('\n\nvelocity')
ax.set_zlabel('V')
ax.view_init(30, -135)