# Reinforcement Learning

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/300px-Reinforcement_learning_diagram.svg.png).

## Introduction

Reinforcement Learning is a special form of machine learning, where an agent interacts with an environment, conducts observations on the effects of actions and collects rewards.

The goal of reinforcement learning is to learn an optimal policy, so that given a state an agent is able to decide what it should do next.

In this exercise we will look into one of  fundamental algorithms that are capable of solving MDPs, namely [Policy Iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Policy_iteration).

By the time you complete this lab, you should know:

- The relevant pieces for a reinforcement learning system
- The basics of *[gym](https://gym.openai.com/envs/#classic_control)* to conduct your own RL experiments
- Why Policy Iteration can be slower than Value Iteration
- The differences of value and policy iteration compared with Q-Learning
- How Q-Learning converges towards a stable policy
    - Some optional extensions to Q-Learning

## MDP

A Markov decision process is a 4-tuple $(S,A,P_{a},R_{a})$

![MPD](mdp.png "MDP")

## Problem

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. (However, the ice is slippery, so you won't always move in the direction you intend.)

## Setup

To begin we'll need to install all the required python package dependencies.



In [None]:
#!pip install --quiet gym stabel

### Imports and Helper Functions

#### Imports

In [None]:
# Python imports
import random
import heapq
import collections

# Reinforcement Learning environments
import gym
# Scientific computing
import numpy as np
# Plotting library
import matplotlib.pyplot as plt
import matplotlib.cm as cm


#### Helper Functions

In [None]:
# Define the default figure size
plt.rcParams['figure.figsize'] = [16, 4]

def create_numerical_map(env):
    """Convert the string map of the environment to a numerical version"""
    numerical_map = np.zeros(env.env.desc.shape)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            if col.decode('UTF-8') == 'S':
                numerical_map[i, j] = 0
            elif col.decode('UTF-8') == 'G':
                numerical_map[i, j] = 1
            elif col.decode('UTF-8') == 'F':
                numerical_map[i, j] = 2
            elif col.decode('UTF-8') == 'H':
                numerical_map[i, j] = 3
            j += 1
        i += 1
    return numerical_map


def visualize_env(env):
    """Plot the environment"""
    fig, ax = plt.subplots()
    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title('The frozen Lake')
    i = ax.imshow(create_numerical_map(env), cmap=cm.jet)
    plt.show()
    print('the start is blue, holes are red, ice is yellow and the goal is teal')

def visualize_policy(env, policy, ax=None, title=None):
    """Plot the policy in the environment"""
    if ax is None:
        ax = plt.gca()
    font_size = 10 if env.observation_space.n > 16 else 20
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            if policy[s] == 0:
                ax.annotate("L", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 1:
                ax.annotate("D", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 2:
                ax.annotate("R", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            elif policy[s] == 3:
                ax.annotate("U", xy=(j, i), xytext=(j, i), ha="center",
                            va="center", size=font_size, color="white")
            j += 1
        i += 1

    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    if title is None:
        ax.set_title('Policy for the Frozen Lake')
    else:
        ax.set_title(title)
    ax.imshow(create_numerical_map(env), cmap=cm.jet)
    return


def visualize_v(env, v, ax=None, title=None):
    """Plot value function values in the environment"""
    if ax is None:
        ax = plt.gca()
    font_size = 10 if env.observation_space.n > 16 else 20
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            ax.annotate("{:.2f}".format(v[s]), xy=(j, i), xytext=(j, i), ha="center",
                        va="center", size=font_size, color="white")
            j += 1
        i += 1

    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    if title is None:
        ax.set_title('State Value Function for the Frozen Lake')
    else:
        ax.set_title(title)
    ax.imshow(create_numerical_map(env), cmap=cm.jet)
    return


def compute_v_from_q(env, q):
    """Compute the v function given the q function, maximizing over the actions of a given state."""
    v = np.zeros(env.observation_space.n)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            v[s] = np.max(q[s, :])
            j += 1
        i += 1
    return v

def compute_policy_from_q(env, q):
    """Compute the policy function given the q function, finding the action that yields the maximum of a given state."""
    policy = np.zeros(env.observation_space.n)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            s = i * env.env.desc.shape[0]+j
            policy[s] = np.argmax(q[s, :])
            j += 1
        i += 1![MPD](mdp.png "MDP")
    return policy

#### Frozen Lake Environments

![Frozen Lake](frozen_lake.gif "Frozen Lake")

In [None]:
# register variants of the frozen lake without execution uncertainty i.e. deterministic environments
from gym.envs.registration import register

register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78,  # optimum = .8196
)

register(
    id='FrozenLakeNotSlippery8x8-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '8x8', 'is_slippery': False},
    max_episode_steps=200,
    reward_threshold=0.99,  # optimum = 1
)

#### Policy Evaluation

In [None]:
def evaluate_episode(env, policy, discount_factor):
    """Evaluates a policy by running it until termination and collect its reward"""
    state = env.reset()
    total_return = 0
    step = 0
    while True:
        state, reward, done, _ = env.step(int(policy[state]))
        # Calculate the total
        total_return += (discount_factor ** step * reward)
        step += 1
        if done:
            break
    return total_return


def evaluate_policy(env, policy, discount_factor=0.95, number_episodes=1000):
    """ Evaluates a policy by running it n times"""
    return np.mean([evaluate_episode(env, policy, discount_factor) for _ in range(number_episodes)])

#### Policy and Value Iteraton Parameters

In [None]:
# Set parameters
max_iterations = 1000
discount_factor = 0.95

### Environment

In [None]:
# Deterministic environments
env_name = 'FrozenLakeNotSlippery-v0'
#env_name = 'FrozenLakeNotSlippery8x8-v0'

# Stochastic environments
#env_name = 'FrozenLake-v0'
#env_name = 'FrozenLake8x8-v0'

Create the environment with the previously selected name

In [None]:
env = gym.make(env_name)
env.render()

S : starting point, safe  
F : frozen surface, safe  
H : hole, fall to your doom  
G : goal, where the frisbee is located  

In [None]:
print('Generated the frozen lake with config: ' + env_name)
visualize_env(env)

#### Understanding the Environment (Object)

**TASK :**
Analyze the environment object and figure out its *observation-* and *actionspace* as well as its *reward range*.

What is the size of the observation space?

In [None]:
env.observation_space

What is the size of the action space?

In [None]:
env.action_space

What is the range of rewards?

In [None]:
env.reward_range

The episode ends when you reach the goal or fall in a hole.  
You receive a reward of 1 if you reach the goal, and zero otherwise.

### Uncertainty in Execution

In [None]:
s = env.reset()
print("the initial state is: {}".format(s))
env.render()

# The agent should go down
print("executing action 1, should go down")
s1, r, d, _ = env.step(1)
print("new state is: {} done: {}".format(s1, d))
env.render()

# The agent should go up
print("executing action 3, should go up")
s1, r, d, _ = env.step(3)
print("new state is: {} done: {}".format(s1, d))
env.render()

# The agent should go right
print("executing action 2, should go right")
s1, r, d, _ = env.step(2)
print("new state is: {} done: {}".format(s1, d))
env.render()

# The agent should go left
print("executing action 0, should go left")
s1, r, d, _ = env.step(0)
print("new state is: {} done: {}".format(s1, d))
env.render()

## Policy Evaluation

In [None]:
def policy_evaluation(policy, env, discount_factor, mode):
    """ Iteratively evaluate the value function under the given policy"""
    # Initialize the state value function
    v = np.zeros(env.observation_space.n)
    iteration = 0
    while True:
        iteration += 1
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            if mode == "policy":
                v[s] = evaluate_action(s, v, prev_v, policy, env, discount_factor)
            elif mode == "optimal_policy":
                v[s] = evaluate_max_action(s, v, prev_v, env, discount_factor)
        if (np.sum((np.fabs(prev_v - v))) <= 1e-4):
            break
    return v, iteration

## Policy $\pi$

In [None]:
def evaluate_action(s, v, prev_v, policy, env, discount_factor):
    # Retrieve the action under the current policy
    a = policy[s]
    expected_reward = 0
    expected_discounted_return = 0
    # Calculate the expected reward and the expected discounted return | p = probability
    for p, s1, r, _ in env.env.P[s][a]:
        ### TASK: define the expected_reward and the expected_discounted_return
        expected_reward += 
        expected_discounted_return += 
    # Calculate the V-Value
    return expected_reward + expected_discounted_return

Solution: 

```python:
expected_reward += p*r
expected_discounted_return += discount_factor*p*prev_v[s1]
```

## Optimal Policy $\pi^*$

In [None]:
def evaluate_max_action(s, v, prev_v, env, discount_factor):
    # Initialize the action value function
    q = np.zeros([env.observation_space.n, env.action_space.n])
    # Iterate over each action
    for a in range(env.action_space.n):
        expected_reward = 0
        expected_discounted_return = 0
        # Calculate the expected reward and the expected discounted return | p = probability
        for p, s1, r, _ in env.env.P[s][a]:
            ### TASK: define the expected_reward and the expected_discounted_return
            expected_reward += 
            expected_discounted_return += 
        # Calculate the Q-Value
        q[s, a] = expected_reward + expected_discounted_return
    ### TASK: define the value function with respect to q
    # Choose the max Q-Value over all actions
    return np.max(q[s, :])

Solution: 

```python:
expected_reward += p*r
expected_discounted_return += discount_factor*p*prev_v[s1]
```

## Policy Improvement

In [None]:
def policy_improvement(v, policy, env, discount_factor):
    """ Improve the policy given a value-function """
    # Initialize the policy
    policy = np.zeros(env.observation_space.n)
    # Initialize the action value function
    q = np.zeros([env.observation_space.n, env.action_space.n])
    for s in range(env.observation_space.n):
        for a in range(env.action_space.n):
            q[s,a] = np.sum([p * (r + discount_factor * v[s1]) for p, s1, r, _ in  env.env.P[s][a]])
        policy[s] = np.argmax(q[s,:])
    return policy

## Policy Iteration
![Policy Iteration](policy_iteration.png "Policy Iteration")

### Algorithm

**TASK :**
Add the missing steps for the policy iteration algorithm.

In [None]:
def policy_iteration(env, discount_factor, max_iterations):
    """ Policy-Iteration algorithm """
    # Initialize the policy
    policy = np.zeros(env.observation_space.n)*2
    for i in range(max_iterations):
        # TASK: evaluate the current policy
        v, iteration = 
        # TASK: define the new policy
        new_policy = 
        if (np.all(policy == new_policy)):
            print ('Policy-Iteration converged at iteration #{:d}'.format((i)))
            break
        # Plot the current policy
        title_p = 'Policy Improvement #{:d}'.format((i+1))
        title_v = '#Policy Evaluations {:d}'.format(iteration)
        fig, ax = plt.subplots(1,2)
        visualize_v(env, v, ax[0], title_v)
        visualize_policy(env, new_policy, ax[1], title_p)
        policy = new_policy
    return policy, v

Solution: 

```python:
# TASK: evaluate the current policy
v, iteration = policy_evaluation(policy, env, discount_factor, "policy")
# TASK: define the new policy
new_policy = policy_improvement(v, policy, env, discount_factor)
```

Run the algorithm and evaluate the result.

In [None]:
# Determine the optimal value function and policy given the model of the environment
policy_opt, v_opt = policy_iteration(env, discount_factor, 1000)

num_episodes = 100

# Evalutate the found value function and policy given the model of the environment
policy_return = evaluate_policy(env, policy_opt, discount_factor, num_episodes)
print('Average return of the policy: {:.2f}'.format(policy_return))