# Reinforcement Learning

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/300px-Reinforcement_learning_diagram.svg.png).

## Introduction

Reinforcement Learning is a special form of machine learning, where an agent interacts with an environment, conducts observations on the effects of actions and collects rewards.

The goal of reinforcement learning is to learn an optimal policy, so that given a state an agent is able to decide what it should do next.

In this exercise we will look into tow fundamental algorithms that are capable of solving MDPs, namely Monte Carlo Tree Search [Monte Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) and [Q-Learning](https://en.wikipedia.org/wiki/Q-learning) (optional).

## Objectives

By the time you complete this lab, you should know:

- The relevant pieces for a reinforcement learning system
- The basics of *[gym](https://gym.openai.com/envs/#classic_control)* to conduct your own RL experiments
- How Monte Carlo evaluations works
- How Monte Carlo Tree Search works
- The Advantages of MCTS vs. MC evaluation

## MDP

A Markov decision process is a 4-tuple $(S,A,P_{a},R_{a})$

![MDP](mdp.png "MDP")

## Problem

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. (However, the ice is slippery, so you won't always move in the direction you intend.)

## Setup

To begin we'll need to install all the required python package dependencies.



In [None]:
#!pip install --quiet gym

### Imports and Helper Functions

#### Imports

In [None]:
# Python imports
import random
import heapq
import collections
import math

# Reinforcement Learning environments
import gym
# Scientific computing
import numpy as np
# Plotting library
import matplotlib.pyplot as plt
import matplotlib.cm as cm


#### Helper Functions

In [None]:
# Define the default figure size
plt.rcParams['figure.figsize'] = [16, 4]

def create_numerical_map(env):
    """Convert the string map of the environment to a numerical version"""
    numerical_map = np.zeros(env.env.desc.shape)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            if col.decode('UTF-8') == 'S':
                numerical_map[i, j] = 2
            elif col.decode('UTF-8') == 'G':
                numerical_map[i, j] = 1
            elif col.decode('UTF-8') == 'F':
                numerical_map[i, j] = 2
            elif col.decode('UTF-8') == 'H':
                numerical_map[i, j] = 3
            j += 1
        i += 1
    numerical_map[env.unwrapped.s//i, env.unwrapped.s%i] = 0
    return numerical_map


def visualize_env(env):
    """Plot the environment"""
    fig, ax = plt.subplots()
    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title('The frozen Lake')
    i = ax.imshow(create_numerical_map(env), cmap=cm.jet)
    plt.show()
    print('the position is blue, holes are red, ice is yellow and the goal is teal')

def visualize_policy(env, policy, ax=None, title=None):
    """Plot the policy in the environment"""
    if ax is None:
        ax = plt.gca()
    font_size = 10 if env.observation_space.n > 16 else 20
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            if policy[s] == 0:
                ax.annotate("L", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 1:
                ax.annotate("D", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 2:
                ax.annotate("R", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 3:
                ax.annotate("U", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            j += 1
        i += 1

    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    if title is None:
        ax.set_title('Policy for the Frozen Lake')
    else:
        ax.set_title(title)
    ax.imshow(create_numerical_map(env), cmap=cm.jet)
    return


def visualize_v(env, v, ax=None, title=None):
    """Plot value function values in the environment"""
    if ax is None:
        ax = plt.gca()
    font_size = 10 if env.observation_space.n > 16 else 20
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            ax.annotate("{:.2f}".format(v[s]), xy=(j, i), xytext=(j, i), ha="center",
                        va="center", size=font_size, color="white")
            j += 1
        i += 1

    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    if title is None:
        ax.set_title('State Value Function for the Frozen Lake')
    else:
        ax.set_title(title)
    ax.imshow(create_numerical_map(env), cmap=cm.jet)
    return


def compute_v_from_q(env, q):
    """Compute the v function given the q function, maximizing over the actions of a given state."""
    v = np.zeros(env.observation_space.n)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            v[s] = np.max(q[s, :])
            j += 1
        i += 1
    return v

def compute_policy_from_q(env, q):
    """Compute the policy function given the q function, finding the action that yields the maximum of a given state."""
    policy = np.zeros(env.observation_space.n)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            policy[s] = np.argmax(q[s, :])
            j += 1
        i += 1
    return policy

#### Deterministic Environments

In [None]:
# register variants of the frozen lake without execution uncertainty i.e. deterministic environments
from gym.envs.registration import register

register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78,  # optimum = .8196
)

register(
    id='FrozenLakeNotSlippery8x8-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '8x8', 'is_slippery': False},
    max_episode_steps=200,
    reward_threshold=0.99,  # optimum = 1
)

#### Policy Evaluation

In [None]:
def evaluate_episode(env, policy, discount_factor):
    """Evaluates a policy by running it until termination and collect its reward"""
    state = env.reset()
    total_return = 0
    step = 0
    while True:
        state, reward, done, _ = env.step(int(policy[state]))
        # Calculate the total
        total_return += (discount_factor ** step * reward)
        step += 1
        if done:
            break
    return total_return


def evaluate_policy(env, policy, discount_factor=0.95, number_episodes=1000):
    """ Evaluates a policy by running it n times"""
    return np.mean([evaluate_episode(env, policy, discount_factor) for _ in range(number_episodes)])

#### Policy and Value Iteraton Parameters

In [None]:
# Set parameters
max_iterations = 1000
num_episodes = 100
discount_factor = 0.95

### Environment

In [None]:
# Deterministic environments
env_name = 'FrozenLakeNotSlippery-v0'
#env_name = 'FrozenLakeNotSlippery8x8-v0'

# Stochastic environments
#env_name = 'FrozenLake-v0'
#env_name = 'FrozenLake8x8-v0'

Create the environment with the previously selected name

In [None]:
env = gym.make(env_name)
print('Generated the frozen lake with config: ' + env_name)
env.reset()
visualize_env(env)
env.unwrapped.s = 4
visualize_env(env)

#### Understanding the Environment (Object)

**TASK :**
Analyze the environment object and figure out its *observation-* and *actionspace* as well as its *reward range*.

What is the size of the observation space?

In [None]:
env.observation_space

What is the size of the action space?

In [None]:
env.action_space

What is the range of rewards?

In [None]:
env.reward_range

### Uncertainty in Execution

In [None]:
actions = {0:"left",
           1:"down",
           2:"right",
           3:"up"}

s = env.reset()
print("the initial state is: {}".format(s))
visualize_env(env)

# The agent should go right
print("executing action 2, should go right")
s1, r, d, _ = env.step(2)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go left
print("executing action 0, should go left")
s1, r, d, _ = env.step(0)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go down
print("executing action 1, should go down")
s1, r, d, _ = env.step(1)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go up
print("executing action 3, should go up")
s1, r, d, _ = env.step(3)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)


## Monte Carlo Evaluator/Search
* Simulate trajectories through the MDP from the current state $s_t$
* Apply model-free RL to simulated episodes

![Monte Carlo Evaluator/Search](./img/monte_carlo_search.png)

### Monte Carlo Estimate
###  $\hat{V}(s)=\frac{1}{K}\sum_{k=1}^{K}{G_t}$


In [None]:
class MCE:
    def __init__(self, env, state = 0, iterations = 1000, discount_factor = 0.95):
        # maximum length of evaluation
        self.number_iterations = iterations
        # discount factor for future rewards
        self.discount_factor = 0.95
        # environment
        self.env = env
        # initial state
        self.state = state
        self.env.unwrapped.s = self.state
        #visualize_env(self.env)
    
    def evaluate_v(self):
        # determine v
        v_avg = 0.0
        v_max = 0.0
        for i in range(self.number_iterations):
            v = self.simulate(random.randint(0,self.env.action_space.n-1))
            if v > v_max:
                v_max = v
            v_avg += v
        v_avg /= self.number_iterations
        return v_avg, v_max
    

    def evaluate_q(self, action):
        # determine v
        q_avg = 0.0
        q_max = 0.0
        for i in range(self.number_iterations):
            q = self.simulate(action)
            if q > q_max:
                q_max = q
            q_avg += q
        q_avg /= self.number_iterations
        return q_avg, q_max
    
    def best_action(self):
        actions_q = np.zeros(self.env.action_space.n)
        actions_visits = np.zeros(self.env.action_space.n)
        for i in range(self.number_iterations):
            action = random.randint(0,self.env.action_space.n-1)
            actions_q[action] += self.simulate(action)
            actions_visits[action] += 1
        actions_q = np.divide(actions_q, actions_visits, out=np.zeros_like(actions_q), where=actions_visits!=0)
        return np.argmax(actions_q)
    
    def simulate(self, action):
        self.env.reset()
        self.env.unwrapped.s = self.state
        done = False
        depth = 0
        g = 0
        state, r, done, _ = self.env.step(action)
        g += r*self.discount_factor**depth
        depth +=1
        while not done:
            action = random.randint(0,self.env.action_space.n-1)
            state, r, done, _ = self.env.step(action)
            g += r*self.discount_factor**depth
            depth +=1
        return g

In [None]:
mc_evaluation = MCE(env,0,discount_factor=0.95, iterations=1000)
print("avg V(s):\t {0:.3f}, max V(s):\t {1:.3f}".format(*mc_evaluation.evaluate_v()))

for key, val in actions.items():
    print("avg Q(s,{2}):\t {0:.3f}, max Q(s,{2}):\t {1:.3f}".format(*mc_evaluation.evaluate_q(key), val))

## Monte Carlo Tree Search
* Simulate trajectories through the MDP from the current state $s_t$ building a tree
* Apply model-free RL to simulated episodes

### In-Tree and Out-of-Tree
* Selection Policy (improves): select actions maximizing action values
* Simulation Policy (fixed): selection actions randomly

### Balance Exploration and Exploitation

### $UCT(s,a) = \hat{Q}(s,a)+c\sqrt{\frac{\ln{N(s)}}{N(s,a)}}$

### Phases
* Selection
* Expansion
* Simulation
* Update

In [None]:
class Node:
    def __init__(self, state=0, action=-1, done=False, parent={}):
        # current state of the environment
        self.state = state
        # number of trajectories that passed through this node
        self.visits = 0
        # average v value that results from starting in this node
        self.v_value = 0
        # action that led to this node
        self.action = action
        # untried actions (i.e. the actions that have not been explored)
        self.untried_actions = [0, 1, 2, 3]
        # parent node pointer
        self.parent = parent
        # children node pointers
        self.children = []
        # flag that indicates that the node is terminal (e.g. the environment is in a terminal state)
        self.done = done
        
    def uct(self, c = 5):
        """Calculate the UCT value for a given child node (i.e. the value from executing a in s)"""
        # if the node has not been visited return a high UCT score, forcing expansion
        if self.visits == 0:
            return 1000
        # if the node has been visited calculate it using the UCB formula
        return self.v_value + c* math.sqrt(math.log(self.parent.visits)/self.visits)
    
    def best_child(self):
        """Return the best child based on the maximum UCT value."""
        uct_values = []
        for child in self.children:
            uct_values.append(child.uct())
        uct_index = np.argmax(uct_values)
        return self.children[uct_index]
    
    def max_action_value(self):
        """Return the child with the highest action value."""
        v_values = []
        for child in self.children:
            v_values.append(child.v_value)
        v_values_index = np.argmax(v_values)
        return self.children[v_values_index]
    
    def max_visits(self):
        """Return the child with the highest visit count."""
        visits = []
        for child in self.children:
            visits.append(child.visits)
        visits_index = np.argmax(visits)
        return self.children[visits_index]
    
    def str(self):
        if not self.parent:
            return "s:{}, N(s):{}, \ta: {}, \tQ(s, a):{:.3f}, parent:{}".format(self.state, self.visits, "none", self.v_value, self.parent)
        else:
            return "s:{}, N(s):{}, \ta: {}, \tQ(s, a):{:.3f}, parent:{}".format(self.state, self.visits, actions[self.action], self.v_value, self.parent)

In [None]:
class MCTS:
    def __init__(self, env, state = 0, iterations = 1000, discount_factor = 0.95):
        # maximum number of simulations
        self.number_iterations = iterations
        # discount factor for future rewards
        self.discount_factor = discount_factor
        # environment
        self.env = env
        # initial state
        self.state = state
        self.env.unwrapped.s = self.state
        #visualize_env(self.env)
    
    def select(self, node):
        # if the node has no untried actions left, choose the best child using UCB1
        while len(node.untried_actions) == 0:
            node = node.best_child()
        return node
    
    def expand(self, node):
        # expand the node with a random action
        if not node.done:
            action = np.random.choice(node.untried_actions)
            node.untried_actions.remove(action)

            self.env.reset()
            self.env.unwrapped.s = node.state
            state, r, done, _ = self.env.step(action)
            child = Node(state, action, done, node)
            node.children.append(child)
            return child, r
        else:
            self.env.reset()
            self.env.unwrapped.s = node.parent.state
            state, r, done, _ = self.env.step(node.action)
            return node, r
    
    def simulate(self, node):
        """Monte Carlo Evaluator"""
        self.env.reset()
        self.env.unwrapped.s = node.state
        done = False
        depth = 0
        g = 0
        action = random.randint(0,env.action_space.n-1)
        state, r, done, _ = self.env.step(action)
        g += r*self.discount_factor**depth
        depth +=1
        while not done:
            action = random.randint(0,env.action_space.n-1)
            state, r, done, _ = self.env.step(action)
            g += r*self.discount_factor**depth
            depth +=1
        return g
       
    def update(self,node,g):
        depth = 0
        while node.parent:
            node.visits += 1
            node.v_value = (node.v_value*(node.visits-1)+g*self.discount_factor**depth)/node.visits
            node = node.parent
            depth += 1
        node.visits += 1
        node.v_value = (node.v_value*(node.visits-1)+g*self.discount_factor**depth)/node.visits
            
    def best_action(self, root):
        for i in range(self.number_iterations):
            self.env.reset()
            self.env.unwrapped.s = root.state
            node = self.select(root)
            child, r = self.expand(node)
            if not child.done:
                g = self.simulate(child)
            else:
                g = r
            self.update(child, g)
        return root.max_action_value().action

In [None]:
env = gym.make(env_name)
sim = gym.make(env_name)
env.reset()
sim.reset()
# set initial state
state = 0
env.unwrapped.s = state
mcts = MCTS(sim, state, iterations = 1000)

root_node = Node(state)
action = mcts.best_action(root_node)
print(root_node.str())

print(root_node.children[0].str())
print(root_node.children[1].str())
print(root_node.children[2].str())
print(root_node.children[3].str())
print("the best action is action {}, {}".format(action, actions[action]))
print(env.step(action))
visualize_env(env)

In [None]:
def plan_mce(iterations, output = False):
    env = gym.make(env_name)
    sim = gym.make(env_name)
    env.reset()
    sim.reset()
    # set initial state
    state = 0
    # initialize the Monte Carlo Evaluator
    mce = MCE(sim, state, iterations = iterations)
    done = False
    steps = 0
    while not done:
        mce.state = state
        action = mce.best_action()
        steps += 1
        # take one step in the environment
        state, r, done, _ = env.step(action)
    if output:
        visualize_env(env)
        print("reached state: {}, after {}".format(state, steps))
        
    return steps, r

In [None]:
def plan_mcts(iterations, output = False):
    env = gym.make(env_name)
    sim = gym.make(env_name)
    env.reset()
    sim.reset()
    # set initial state
    state = 0
    # initialize the Monte Carlo Tree Search
    mcts = MCTS(sim, state, iterations = iterations)
    done = False
    steps = 0
    while not done:
        root_node = Node(state)
        action = mcts.best_action(root_node)
        steps += 1
        # take one step in the environment
        state, r, done, _ = env.step(action)
    if output:
        visualize_env(env)
        print("reached state: {}, after {}".format(state, steps))
        
    return steps, r

## Comparison of MCE and MCTS
* MCTS requires less iterations to reach the goal state
* Due to the uniform action exploration in the plan_mce function the variance estimates for all actions are less skewed as they are for MCTS, thus reaching the goal more frequently (but slower)

In [None]:
# specify evaluation points
iterations = [20 , 50, 100, 200, 500]
# specify number of runs for each evaluation point
runs = 50

# initialize empty metrics
avg_mce_steps = []
avg_mcts_steps = []
avg_mce_r = []
avg_mcts_r = []

for it in iterations:
    # reset counters
    mce_steps = 0
    mcts_steps = 0
    mce_rs = 0
    mcts_rs = 0
    for i in range(runs):
        # solve MDP with MCE
        mce_step, mce_r = plan_mce(it, False)
        # solve MDP with MCTS
        mcts_step, mcts_r = plan_mcts(it, False)
        # increment counters
        mce_steps += mce_step
        mcts_steps += mcts_step
        mce_rs += mce_r
        mcts_rs += mcts_r

    # aggregate values
    avg_mce_steps.append(mce_steps/runs)
    avg_mcts_steps.append(mcts_steps/runs)
    avg_mce_r.append(mce_rs/runs)
    avg_mcts_r.append(mcts_rs/runs)
    
fig, ax = plt.subplots(1, 2)
# Plot the average episode length
ax[0].plot(iterations, avg_mce_steps, color="red", label='MCE')
ax[0].plot(iterations, avg_mcts_steps, color="blue", label='MCTS')
ax[0].set(xlabel='#Simulations', ylabel='Steps', title='Average Episode Length')
ax[0].grid()
ax[0].legend()

# Plot the average episode reward
ax[1].plot(iterations, avg_mce_r, color="red", label='MCE')
ax[1].plot(iterations, avg_mcts_r, color="blue", label='MCTS')
ax[1].set(xlabel='#Simulations', ylabel='Reward', title='Average Episode Reward')
ax[1].grid()
ax[1].legend();

## Q-Learning

![Q-Learning](q_learning.png "Q-Learning")

### Temporal Difference Error
### $\delta_t = \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} - \underbrace{Q(s_{t}, a_{t})}_{\text{estimate of optimal current value}}$

### Temporal Difference Update
### $Q^{new}(s_{t},a_{t}) \leftarrow \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot \underbrace{\delta_t}_\text{temporal difference error}$

### Transiton Tuple
For ease of use we define a transition tuple that allows us to combine all the relevant information from one state to another.

In [None]:
# p = priority (only needed for (prioritized) experience replay)
# s = state
# a = action
# s1 = successor state
# r = reward
# td_e = temporal difference error
Transition = collections.namedtuple('Transition', ('p', 's', 'a', 's1', 'r', 'td_e'))

### Replay Memory and Prioritized Experience Replay (optional)

Experience Replay and prioritization of specific experiences are common techniques to make the training more data efficient.

* [Paper - Experience Replay, 1992](https://link.springer.com/content/pdf/10.1007%2FBF00992699.pdf)
* [Paper - Prioritized Experience Replay, 2015](https://arxiv.org/abs/1511.05952)

In [None]:
class ReplayMemory():
    def __init__(self, config):
        # transitions memory
        self.transitions = []
        # size of the memory
        self.memory_size = config.memory_size
        # size of the batches
        self.batch_size = config.batch_size
        # flag for prioritized experience replay
        self.prioritized = config.prioritized
        
    def push(self, transition):
        # if the memory is not yet full add the new transition
        if len(self.transitions) < self.memory_size:
            heapq.heappush(self.transitions, transition)
        # if the memory is full remove the smallest transition and add the new transition 
        else:
            del self.transitions[-1]
            heapq.heappush(self.transitions, transition)
    
    def replay(self, batch_size):
        if self.prioritized:
            return heapq.nsmallest(self.batch_size,self.transitions)
        else: 
            return random.sample(sorted(self.transitions),self.batch_size)
    
    def __len__(self):
        return len(self.transitions)    

### The Q Agent
So far we have only defined simple function calls with
```python
def function_name(arg1, arg2):
    # compute something with arg1 and arg2 and return something
    if arg2 > 0:
        something = other_function(arg1) - arg2
    else:
        something = arg1
    return something
```
However for more complex tasks it is advisable to write object oriented code using classes. Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.



Hence we create a class **QAgent** that incorporates all the methods needed for Q-Learning.

```python
class QAgent:
    
    def __init__(self): # constructor method that gets called when the object is being created
        
    def td_error(self): # Temporal Difference Error
        
    def td_update(self): # Temporal Difference Update
    
    def train(self, env): # Train the agent     
```

**TASK :**
Add the missing formulars for the TD-error and the TD-update.

In [None]:
class QAgent:
    def __init__(self, config):
        # Maximum length of training
        self.training_length = config.training_length
        # Maximum length of an episode 
        self.episode_length = config.episode_length
        # TD error update step size
        self.learning_rate = config.learning_rate
        # TD error update step size
        self.discount_factor = config.discount_factor
        # Enabling experience replay
        self.replay_memory_enabled = True if config.config_replay_memory else False
        # Initialize the replay memory of the agent
        if self.replay_memory_enabled:
            self.replay_memory = ReplayMemory(config.config_replay_memory)
        
    def td_error(self, q, s, a, s1, r):
        # TASK: return the TD-Error
        # Calculates the temporal difference error given the current model and transition
        td_e = r + self.discount_factor*np.max(q[s1, :]) - q[s, a]
        return td_e
    
    def td_update(self, q, t):
        # TASK: return the update for the q value
        # Calculates the adjusted action value (q) given the td error from a single transition
        q = q + self.learning_rate * t.td_e
        return q

    def td_replay(self, q, q_target):
        # Use the replay memory to run additional updates
        if len(self.replay_memory) >= self.replay_memory.batch_size:
            for t in self.replay_memory.replay(self.replay_memory.batch_size):
                # Recalculate the temporal difference error for this transition
                td_e = self.td_error(q_target, t.s, t.a, t.s1, t.r)
                # Create an updated transition tuple
                updated_t = Transition(-td_e, t.s, t.a, t.s1, t.r, td_e)
                # Save the transition in replay memory
                self.replay_memory.push(updated_t)
                # Update model / q table
                q[t.s, t.a] = self.td_update(q_target[t.s, t.a], updated_t)
        return q
    
    def epsilon_greedy_noise(self, episode):
        epsilon = np.random.randn(1, env.action_space.n)*(1./(episode+1))
        a = np.argmax(self.q_target[s, :] + epsilon)
        return a, epsilon
    
    def epsilon_greedy_linear(self, env, episode):
        epsilon = (1-(episode+1)/self.training_length)
        if epsilon > np.random.rand():
            a = np.random.randint(env.action_space.n)
        else:
            a = np.argmax(self.q_target[s, :])
        return a, epsilon
    
    def train(self, env):
        # Initialize the model / q table with zeros/random
        self.q = np.zeros([env.observation_space.n, env.action_space.n])
        # Create a target model / q table
        self.q_target = self.q
        
        ### METRICS
        # create lists to contain various metrics that should be tracked during the training process
        self.metrics = {
            'return': np.zeros(self.training_length),
            'q_avg': np.zeros(self.training_length),
            'epsilon': np.zeros(self.training_length),
            'td_error': np.zeros(self.training_length)
        }
        
        for episode in range(self.training_length):
            # Reset the environment and retrieve the initial state
            s = env.reset()
            # Set the 'done' flag to false
            d = False
            # Set the step of the episode to 0
            step = 0
            # Start the Q-Learning algorithm
            while step < self.episode_length:
                # Derive action from current policy (epsilon_greedy noise)
                epsilon = np.random.randn(1, env.action_space.n)*(1./(episode+1))
                a = np.argmax(self.q_target[s, :] + epsilon)

                # Execute the action and generate a succesor state as well as receive an immediate reward
                s1, r, d, _ = env.step(a)

                # Calculate the temporal difference error
                td_e = self.td_error(self.q_target, s, a, s1, r)

                # Create a transition tuple
                transition = Transition(-(td_e+0.001), s, a, s1, r, td_e)       

                # Save the transition in replay memory
                if self.replay_memory_enabled:
                    self.replay_memory.push(transition)

                # Update model / q table
                self.q[s, a] = self.td_update(self.q_target[s, a], transition)

                # Assign the current state the value of the successor state
                s = s1

                # Increment the step
                step += 1

                ### METRICS
                # Accumulate the episode return
                self.metrics['return'][episode] += self.discount_factor**step*r
                # Track the temporal difference error
                self.metrics['td_error'][episode] += td_e
                # Track the max epsilon values
                self.metrics['epsilon'][episode] += np.max(epsilon)
                # Track the average q values
                self.metrics['q_avg'][episode] = np.average(self.q)


                # If we reached a terminal state abort the while loop reset the environment and start over
                if d == True or step == 100:

                    # At the end of the episode update the target model with the current model

                    # If experience replay is enabled replay the experience collected so far
                    if self.replay_memory_enabled:
                        self.q_target = self.td_replay(self.q, self.q_target)
                    else:
                        self.q_target = self.q

                    ### METRICS
                    self.metrics['epsilon'][episode] /= step
                    self.metrics['q_avg'][episode] /= step
                    self.metrics['td_error'][episode] /= step
                    break

### Configuration Tuple
For ease of use we define a configuration tuple that allows us to combine all the relevant configuration from into one object.

In [None]:
ConfigQAgent = collections.namedtuple('ConfigQAgent', ('learning_rate',
                                                       'training_length',
                                                       'episode_length',                                                       'discount_factor',
                                                       'config_replay_memory')
                                     )

ConfigReplayMemory = collections.namedtuple('ConfigReplayMemory', ('memory_size', 
                                                                   'batch_size', 
                                                                   'prioritized')
                                           )

### Configure and Train the Agent


In [None]:
# Agent 1
config_replay_memory = ConfigReplayMemory(500, 50, False)
config_q_agent = ConfigQAgent(0.1, 400, 50, discount_factor, None)

q_agent = QAgent(config_q_agent)
q_agent.train(env)
policy = compute_policy_from_q(env, q_agent.q_target)     

### Visualize Metrics

In [None]:
print('Average return: {:.2f}'.format(evaluate_policy(env, policy, q_agent.discount_factor, 1000)))
print("Score over time: " + str(sum(q_agent.metrics['return'])/q_agent.training_length))

fig, axi = plt.subplots(1, 2)
visualize_policy(env, policy, axi[0])
visualize_v(env, compute_v_from_q(env, q_agent.q_target),axi[1])

fig, ax = plt.subplots(1, 4)
# Plot the return over time
ax[0].plot(range(q_agent.training_length), q_agent.metrics['return'], ".")
ax[0].set(xlabel='episode', ylabel='reward', title='Return')
ax[0].grid()

# Plot the Q value over time
ax[1].plot(range(q_agent.training_length), q_agent.metrics['q_avg'], ".")
ax[1].set(xlabel='episode', ylabel='Q Value', title='Average Q Value')
ax[1].grid()

# Plot the epsilon over time
ax[2].plot(range(q_agent.training_length), q_agent.metrics['epsilon'], ".")
ax[2].set(xlabel='episode', ylabel='epsilon', title='Epsilon')
ax[2].grid()

# Plot the td error over time
ax[3].plot(range(q_agent.training_length), q_agent.metrics['td_error'], ".")
ax[3].set(xlabel='episode', ylabel='TD Error', title='TD Error')
ax[3].grid()   

### Evaluate different Hyperparameters (optional)

In [None]:
def evaluate_training_episodes(number_evaluation_points, number_evaluations):
    ### Evaluate different episode lengths
    training_lengths = np.linspace(1, 501, number_evaluation_points, dtype = int)
    returns = np.zeros(number_evaluation_points)
    for i in range(number_evaluation_points):
        config_q_agent = ConfigQAgent(0.1, training_lengths[i], 100, discount_factor, None)
        for j in range(number_evaluations):
            q_agent = QAgent(config_q_agent)
            q_agent.train(env)
            returns[i] += np.max(q_agent.metrics['return'])
        returns[i] /= number_evaluations

    fig, ax = plt.subplots()
    ax.set(xlabel='#Episodes', ylabel='Max. Return', title='#Episodes vs. Return')
    ax.plot(training_lengths, returns, '-o');
    
evaluate_training_episodes(10, 10)
