# Exercise 11) Eligibility Traces

In this exercise we will explore whether we can enhance the learning behavior within the MountainCar environment when using eligibility traces. The basic idea behind eligibility traces is that - much like in n-step learning - past actions contributed to the current situation. Contrary to n-step learning, however, intuition tells us that more recent decisions had a more severe impact on the present situation than decisions that were made a long time ago. Thus, it may be helpful to integrate a forgetting factor $\lambda$ which decreases the assumed influence of actions over time.

Once again we will be looking at the MountainCar with discrete action space.

![](SarsaGridworld.png)

In [None]:
import numpy as np
import gym
import random
import time
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')
from mpl_toolkits.mplot3d import Axes3D
from collections import deque
from sklearn.kernel_approximation import RBFSampler
import sklearn.pipeline
import sklearn.preprocessing

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

## 0) Preprocessing Gradients

For many applications we are not satisfied with directly using the `apply_gradients` function from Tensorflow. We often need to process the gradients before applying them to our ANN. E.g. in the case of Sarsa($\lambda$) we want to compute

\begin{align}
\mathbf{z}_\text{new} = \gamma \lambda \mathbf{z}_\text{old} + \nabla_\mathbf{w}\hat{q}(\mathbf{x}_{k},u_{k},\mathbf{w})
\end{align}

before applying the gradient

\begin{align}
\mathbf{w}_\text{new} = \mathbf{w}_\text{old} + \alpha \delta \mathbf{z}_\text{new}.
\end{align}

Thus, we need to provide for two things. Firstly, we need to calculate the gradient $\nabla_\mathbf{w}\hat{q}(\mathbf{x}_{k},u_{k},\mathbf{w})$. Secondly, we need to process it accordingly.

(Recall that in the last exercise we were applying the gradient
\begin{align}
\mathbf{w}_\text{new} = \mathbf{w}_\text{old} + \alpha \delta \nabla_\mathbf{w}\hat{q}(\mathbf{x}_{k},u_{k},\mathbf{w}),
\end{align}
which we were able to compute easily using a quadratic loss function (mean squared error).) 

Note that we do not use a quadratic loss function this time! This is due to the difference in the handling of gradients in exercise 10 and 11 as presented in [this PDF](TF_Gradients.pdf).

Take a look at the code snippet below and try to understand what's supposed to happen there. The code is only an excerpt, so it is not executable, but this examplary application should give a good orientation on how to tackle task (1).

In [None]:
### example: how to apply gradients manually ###

# define a mean loss function
mean = tf.keras.backend.mean

# suppose that gradients are given by
with tf.GradientTape() as tape:
    action_values = model(norm_state)
    loss = mean(action_values[0][action])
gradients = tape.gradient(loss, model.trainable_variables)

### the following code replaces the opt.apply_gradients(zip(gradients, model.trainable_variables)) command ###

# get the weights from the model, this is a list of arrays so we cannot perform calculations on it,
# we have to carry out calculations per list element
w = model.get_weights()

# go over each element
for i in range(len(w)):
    w[i] += alpha * delta * gradients[i] 
    # gradients as computed per tape.gradient(...) have the same structure as model.get_weights()
    
# put the freshly updated weights back into the model
model.set_weights(w)

## 1) Sarsa($\lambda$) with Neural Networks

Write a Sarsa($\lambda$) algorithm to modify the learning behavior of an ANN function approximator. Test it for different values of $\lambda$. How sensitive is the process to the choice of $\lambda$?

In [None]:
def plot_surface(model, input_dim):
    resolution = 100
    pos_vec = np.linspace(-1.2, 0.6, resolution)
    vel_vec = np.linspace(-0.07, 0.07, resolution)

    pos_mat, vel_mat = np.meshgrid(pos_vec, vel_vec)
    state_tensor = np.zeros([resolution, resolution, input_dim])

    for pos_idx, pos in enumerate(tqdm(pos_vec)):
        for vel_idx, vel in enumerate(vel_vec):
            state_tensor[vel_idx, pos_idx] = featurize(np.array([pos, vel]))

    q_mat = model(state_tensor)
    q_maxes = np.reshape(np.max(q_mat, axis=2), (resolution, resolution))
    

    # Plot
    fig = plt.subplots()
    ax = plt.axes(projection='3d')
    ax.plot_surface(pos_mat, vel_mat, -q_maxes, cmap="viridis")
    ax.set_xlabel('\n\nposition')
    ax.set_ylabel('\n\nvelocity')
    ax.set_zlabel(r'$-V_\mathrm{greedy}$', labelpad=12)
    ax.view_init(50, -135)
    plt.show()

In [None]:
env = gym.make('MountainCar-v0')
observation_examples = np.array([env.observation_space.sample() for x in range(10000)])
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(observation_examples)

featurizer = sklearn.pipeline.FeatureUnion([
    ("rbf0", RBFSampler(gamma=5.0, n_components = 100)),
    ("rbf1", RBFSampler(gamma=2.0, n_components = 100)),
    ("rbf2", RBFSampler(gamma=1.0, n_components = 100)),
    ("rbf3", RBFSampler(gamma=0.5, n_components = 100)),
    ])
featurizer.fit(scaler.transform(observation_examples))


def featurize(state):
    try:
        scaled = scaler.transform([state])
    except:
        print(state)
    featurized = featurizer.transform(scaled)
    return featurized

In [None]:
alpha = 0.0001
gamma = 1
_lambda = 0.1
epsilon = 0.15
nb_episodes = 300

env = gym.make('MountainCar-v0')

state = env.reset()
norm_state = featurize(state)
input_dim = len(norm_state[0])


# define ANN topology
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=input_dim))
model.add(Dense(64, activation='relu'))
model.add(Dense(3, activation='linear'))

mean = tf.keras.backend.mean
opt = SGD(learning_rate=alpha)

needed_steps_lambda = []

for j in tqdm(range(nb_episodes)):
    k = 0
    rewards = 0

    # initialize z to zero; 
    # needs to be done in a loop because get_weights and gradients are lists 
    # of arrays that preserve the structure of the ANN
    z = model.get_weights()
    for i in range(len(z)):
        z[i] = z[i] * 0

    state = env.reset()
    norm_state = featurize(state)

    action_values = np.squeeze(model(norm_state).numpy())

    # Choose Initial Action greedy
    if epsilon < np.random.rand(1):
        action = np.argmax(action_values)
    else:
        action = random.choice(range(3))

    while True:
        
        # YOUR CODE HERE
        raise NotImplementedError()

### Greedy Execution

In [None]:
env = gym.make('MountainCar-v0')
state = env.reset()

k = 0

while True:
    env.render()
    k += 1
    
    norm_state = featurize(state)
    action_values = np.squeeze(model(norm_state).numpy())
    action = np.argmax(action_values)
    state, reward, done, _ = env.step(action)
    
    if done:
        break

print(k)
env.close()

## 2) True Online Sarsa($\lambda$)

Similar to LSPI in exercise 10, also for Sarsa($\lambda$) there exists a powerful algorithm that employs a linear approximator. This algorithm is named True Online Sarsa($\lambda$). Let's see if the use of this algorithm shows a different behavior concerning the choice of $\lambda$.

In [None]:
observation_examples = np.array([env.observation_space.sample() for x in range(10000)])
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(observation_examples)

featurizer = sklearn.pipeline.FeatureUnion([
    ("rbf0", RBFSampler(gamma=5.0, n_components = 100)),
    ("rbf1", RBFSampler(gamma=2.0, n_components = 100)),
    ("rbf2", RBFSampler(gamma=1.0, n_components = 100)),
    ("rbf3", RBFSampler(gamma=0.5, n_components = 100)),
    ])
featurizer.fit(scaler.transform(observation_examples))


def featurize(state, action):
    action_vec = np.zeros([3, 1])
    action_vec[action] = 1
    
    win = 0
    if state[0] > 0.5:
        win = 1
    
    try:
        scaled = scaler.transform([state])
    except:
        print(state)
    featurized = featurizer.transform(scaled)
    featurized = np.reshape(featurized, (-1, 1)) # make column vector
    
    featurized = np.append(featurized, np.array([[1]]), axis = 0)
    
    featurized_vec = np.array([])
    featurized_vec = np.expand_dims(featurized_vec, axis=-1)
    for a in action_vec:
        if a == 1:
            featurized_vec = np.append(featurized_vec, featurized, axis = 0)
        elif a == 0:
            featurized_vec = np.append(featurized_vec, np.zeros([len(featurized), 1]), axis = 0)        
    
    return featurized_vec * (1 - win) # append the action to the column vector


def policy(state, w, n, epsilon):
    feat_states = np.zeros([len(w), n, 1])
    q_value = np.zeros([n])

    for i in range(n):    
        feat_state = featurize(state, i)
        feat_states[:, i] = feat_state
        q_value[i] = np.transpose(feat_state) @ w
            
    if epsilon < np.random.rand(1):
        action = np.argmax(q_value)
    else:
        action = random.choice(range(n))
        
    return feat_states[:, action], action

In [None]:
def plot_surface_LSPI(w, input_dim):
    resolution = 100
    pos_vec = np.linspace(-1.2, 0.6, resolution)
    vel_vec = np.linspace(-0.07, 0.07, resolution)

    pos_mat, vel_mat = np.meshgrid(pos_vec, vel_vec)
    value_tensor = np.zeros([resolution, resolution])

    for pos_idx, pos in enumerate(tqdm(pos_vec)):
        for vel_idx, vel in enumerate(vel_vec):
            feat_state, _ =  policy(np.array([pos, vel]), w, env.action_space.n, 0)
            value_tensor[vel_idx, pos_idx] = np.transpose(feat_state) @ w

    # Plot
    fig = plt.subplots()
    ax = plt.axes(projection='3d')
    ax.plot_surface(pos_mat, vel_mat, -value_tensor, cmap="viridis")
    ax.set_xlabel('\n\nposition')
    ax.set_ylabel('\n\nvelocity')
    ax.set_zlabel(r'$-V_\mathrm{greedy}$', labelpad=12)
    ax.view_init(50, -135)
    plt.show()

In [None]:
alpha = 0.01
gamma = 1
_lambda = 0.25 # we call it like that because lambda is a defined command in python
epsilon = 0.15
nb_episodes = 300

env = gym.make('MountainCar-v0')
state = env.reset()
feat_state = featurize(state, 0)
feat_dims = len(feat_state)

w = np.zeros(feat_dims)
w = np.expand_dims(w, axis=-1)


for j in tqdm(range(nb_episodes)):
    k = 0
    
    state = env.reset()
    feat_state, action = policy(state, w, env.action_space.n, epsilon)
    
    q_old = 0
    z = np.zeros_like(feat_state)
    
    while True:
        
        # YOUR CODE HERE
        raise NotImplementedError()

### Greedy Execution

In [None]:
env.close()
env = gym.make('MountainCar-v0')
state = env.reset()

k = 0

while True:
    env.render()
    k += 1
    
    _, action = policy(state, w, env.action_space.n, 0)
    state, reward, done, _ = env.step(action)
    
    if done:
        break

print(k)
env.close()

## 3) Optional: Comparison

Compare the algorithms from exercise 11 to the ones from exercise 10. For which value of $\lambda$ and $\alpha$ do they learn the fastest? Have we achieved a major breakthrough?

This is covered within the corresponding tutorial video.