# Reinforcement Learning &ndash; Monte Carlo Tree Search

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/300px-Reinforcement_learning_diagram.svg.png).

## Introduction

Reinforcement Learning is a special form of machine learning, where an agent interacts with an environment, conducts observations on the effects of actions and collects rewards.

The goal of reinforcement learning is to learn an optimal policy, so that given a state an agent is able to decide what it should do next.

In this exercise we will look into tow fundamental algorithms that are capable of solving MDPs, namely Monte Carlo Tree Search [Monte Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) and [Q-Learning](https://en.wikipedia.org/wiki/Q-learning) (optional).

## Objectives

By the time you complete this lab, you should know:

- The relevant pieces for a reinforcement learning system
- The basics of *[gym](https://gym.openai.com/envs/#classic_control)* to conduct your own RL experiments
- How Monte Carlo evaluations works
- How Monte Carlo Tree Search works
- The Advantages of MCTS vs. MC evaluation

## MDP

A Markov decision process is a 4-tuple $(S,A,P_{a},R_{a})$

![MDP](mdp.png "MDP")

## Problem

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. (However, the ice is slippery, so you won't always move in the direction you intend.)

## Setup

To begin we'll need to install all the required python package dependencies.



In [None]:
#!pip install --quiet gym

### Imports and Helper Functions

#### Imports

In [None]:
# Python imports
import random
import heapq
import collections
import math

# Reinforcement Learning environments
import gym
# Scientific computing
import numpy as np
# Plotting library
import matplotlib.pyplot as plt
import matplotlib.cm as cm


#### Helper Functions

In [None]:
# Define the default figure size
plt.rcParams['figure.figsize'] = [16, 4]

def create_numerical_map(env):
    """Convert the string map of the environment to a numerical version"""
    numerical_map = np.zeros(env.env.desc.shape)
    i = 0
    for row in env.env.desc:
        j = 0
        for col in row:
            if col.decode('UTF-8') == 'S':
                numerical_map[i, j] = 2
            elif col.decode('UTF-8') == 'G':
                numerical_map[i, j] = 1
            elif col.decode('UTF-8') == 'F':
                numerical_map[i, j] = 2
            elif col.decode('UTF-8') == 'H':
                numerical_map[i, j] = 3
            j += 1
        i += 1
    numerical_map[env.unwrapped.s//i, env.unwrapped.s%i] = 0
    return numerical_map


def visualize_env(env):
    """Plot the environment"""
    fig, ax = plt.subplots()
    # Hide grid lines
    ax.grid(False)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title('The frozen Lake')
    i = ax.imshow(create_numerical_map(env), cmap=cm.jet)
    plt.show()
    print('the position is blue, holes are red, ice is yellow and the goal is teal')

#### Deterministic Environments

In [None]:
# register variants of the frozen lake without execution uncertainty i.e. deterministic environments
from gym.envs.registration import register

register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78,  # optimum = .8196
)

register(
    id='FrozenLakeNotSlippery8x8-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name': '8x8', 'is_slippery': False},
    max_episode_steps=200,
    reward_threshold=0.99,  # optimum = 1
)

### Environment

In [None]:
# Deterministic environments
env_name = 'FrozenLakeNotSlippery-v0'
#env_name = 'FrozenLakeNotSlippery8x8-v0'

# Stochastic environments
#env_name = 'FrozenLake-v0'
#env_name = 'FrozenLake8x8-v0'

Create the environment with the previously selected name

In [None]:
env = gym.make(env_name)
print('Generated the frozen lake with config: ' + env_name)
env.reset()
visualize_env(env)
env.unwrapped.s = 4
visualize_env(env)

#### Understanding the Environment (Object)

**TASK :**
Analyze the environment object and figure out its *observation-* and *actionspace* as well as its *reward range*.

What is the size of the observation space?

In [None]:
env.observation_space

What is the size of the action space?

In [None]:
env.action_space

What is the range of rewards?

In [None]:
env.reward_range

### Uncertainty in Execution

In [None]:
actions = {0:"left ",
           1:"down ",
           2:"right",
           3:"up   "}

s = env.reset()
print("the initial state is: {}".format(s))
visualize_env(env)

# The agent should go right
print("executing action 2, should go right")
s1, r, d, _ = env.step(2)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go left
print("executing action 0, should go left")
s1, r, d, _ = env.step(0)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go down
print("executing action 1, should go down")
s1, r, d, _ = env.step(1)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)

# The agent should go up
print("executing action 3, should go up")
s1, r, d, _ = env.step(3)
print("new state is: {} done: {}".format(s1, d))
visualize_env(env)


## Monte Carlo Evaluator/Search
* Simulate trajectories through the MDP from the current state $s_t$
* Apply model-free RL to simulated episodes

![Monte Carlo Evaluator/Search](./img/monte_carlo_search.png)

### Monte Carlo Estimate
###  $\hat{V}(s)=\frac{1}{K}\sum_{k=1}^{K}{G_t}$


In [None]:
class MCS:
    def __init__(self, env, state = 0, iterations = 1000, discount_factor = 0.8):
        # maximum length of evaluation
        self.number_iterations = iterations
        # discount factor for future rewards
        self.discount_factor = discount_factor
        # environment
        self.env = env
        # initial state
        self.state = state
        self.env.unwrapped.s = self.state
        self.actions_q = np.zeros(self.env.action_space.n)
        self.actions_q_max = np.zeros(self.env.action_space.n)
        self.actions_visits = np.zeros(self.env.action_space.n)
        #visualize_env(self.env)
    
    def best_action(self):
        for i in range(self.number_iterations):
            action = self.random_action()
            g = self.simulate(action)
            self.actions_q[action] += g
            if g > self.actions_q_max[action]:
                self.actions_q_max[action] = g
            self.actions_visits[action] += 1
        self.actions_q = np.divide(self.actions_q, self.actions_visits, out=np.zeros_like(self.actions_q), where=self.actions_visits!=0)
        return np.argmax(self.actions_q)
    
    def random_action(self):
        return random.randint(0,self.env.action_space.n-1)
    
    def simulate(self, action):
        self.env.reset()
        self.env.unwrapped.s = self.state
        done = False
        depth = 0
        g = 0
        state, r, done, _ = self.env.step(action)
        g += r*self.discount_factor**depth
        depth +=1
        while not done:
            action = self.random_action()
            state, r, done, _ = self.env.step(action)
            g += r*self.discount_factor**depth
            depth +=1
        return g

In [None]:
env = gym.make(env_name)
sim = gym.make(env_name)
env.reset()
sim.reset()
# set initial state
state = 0
env.unwrapped.s = state
mcs = MCS(sim, state, iterations=10000)
visualize_env(env)
action = mcs.best_action()
print("the best action is action {}, {}".format(action, actions[action]))
print(env.step(action))
visualize_env(env)

print("avg V(s):\t {0:.3f}, max V(s):\t {1:.3f}".format(np.mean(mcs.actions_q), np.max(mcs.actions_q_max)))

for key, val in actions.items():
    print("avg Q(s,{2}):\t {0:.3f}, max Q(s,{2}):\t {1:.3f}".format(mcs.actions_q[key], mcs.actions_q_max[key], val))

## Monte Carlo Tree Search
* Simulate trajectories through the MDP from the current state $s_t$ building a tree
* Apply model-free RL to simulated episodes

### In-Tree and Out-of-Tree
* Selection Policy (improves): select actions maximizing action values
* Simulation Policy (fixed): selection actions randomly

### Balance Exploration and Exploitation

### $UCT(s,a) = \hat{Q}(s,a)+c\sqrt{\frac{\ln{N(s)}}{N(s,a)}}$

### Phases
* Selection
* Expansion
* Simulation
* Update

In [None]:
class Node:
    def __init__(self, state=0, action=-1, done=False, parent={}):
        # current state of the environment
        self.state = state
        # number of trajectories that passed through this node
        self.visits = 0
        # average v value that results from starting in this node
        self.v_value = 0
        # action that led to this node
        self.action = action
        # untried actions (i.e. the actions that have not been explored)
        self.untried_actions = [0, 1, 2, 3]
        # parent node pointer
        self.parent = parent
        # children node pointers
        self.children = []
        # flag that indicates that the node is terminal (e.g. the environment is in a terminal state)
        self.done = done
        
    def uct(self, c = 0.7):
        """Calculate the UCT value for a given child node (i.e. the value from executing a in s)"""
        # if the node has not been visited return a high UCT score, forcing expansion
        if self.visits == 0:
            return 100
        # if the node has been visited calculate it using the UCB formula
        return self.v_value + c* math.sqrt(math.log(self.parent.visits)/self.visits)
    
    def best_child(self):
        """Return the best child based on the maximum UCT value."""
        uct_values = []
        for child in self.children:
            uct_values.append(child.uct())
        uct_index = np.argmax(uct_values)
        return self.children[uct_index]
    
    def max_action_value(self):
        """Return the child with the highest action value."""
        v_values = []
        for child in self.children:
            v_values.append(child.v_value)
        v_values_index = np.argmax(v_values)
        return self.children[v_values_index]
    
    def max_visits(self):
        """Return the child with the highest visit count."""
        visits = []
        for child in self.children:
            visits.append(child.visits)
        visits_index = np.argmax(visits)
        return self.children[visits_index]
    
    def str(self):
        if not self.parent:
            return "s:{}, N(s,{}):{}, \tV(s):{:.3f}, parent:{}".format(self.state, "none ", self.visits, self.v_value, self.parent)
        else:
            return "s:{}, N(s,{}):{}, \tQ(s,a):{:.3f}, parent:{}".format(self.state, actions[self.action], self.visits, self.v_value,  self.parent)

In [None]:
class MCTS:
    def __init__(self, env, state = 0, iterations = 1000, discount_factor = 0.8):
        # maximum number of simulations
        self.number_iterations = iterations
        # discount factor for future rewards
        self.discount_factor = discount_factor
        # environment
        self.env = env
        # initial state
        self.state = state
        self.env.unwrapped.s = self.state
        #visualize_env(self.env)
    
    def select(self, node):
        # if the node has no untried actions left, choose the best child using UCB1
        while len(node.untried_actions) == 0:
            node = node.best_child()
        return node
    
    def expand(self, node):
        # expand the node with a random action
        if not node.done:
            action = np.random.choice(node.untried_actions)
            node.untried_actions.remove(action)

            self.env.reset()
            self.env.unwrapped.s = node.state
            state, r, done, _ = self.env.step(action)
            child = Node(state, action, done, node)
            node.children.append(child)
            return child, r
        else:
            self.env.reset()
            self.env.unwrapped.s = node.parent.state
            state, r, done, _ = self.env.step(node.action)
            return node, r
    
    def simulate(self, node):
        """Monte Carlo Evaluator"""
        self.env.reset()
        self.env.unwrapped.s = node.state
        done = False
        depth = 0
        g = 0
        action = self.random_action()
        state, r, done, _ = self.env.step(self.random_action())
        g += r*self.discount_factor**depth
        depth +=1
        while not done:
            action = self.random_action()
            state, r, done, _ = self.env.step(self.random_action())
            g += r*self.discount_factor**depth
            depth +=1
        return g
       
    def update(self,node,g):
        depth = 0
        while node.parent:
            node.visits += 1
            node.v_value = (node.v_value*(node.visits-1)+g*self.discount_factor**depth)/node.visits
            node = node.parent
            depth += 1
        node.visits += 1
        node.v_value = (node.v_value*(node.visits-1)+g*self.discount_factor**depth)/node.visits
            
    def best_action(self, root):
        for i in range(self.number_iterations):
            self.env.reset()
            self.env.unwrapped.s = root.state
            node = self.select(root)
            child, r = self.expand(node)
            if not child.done:
                g = self.simulate(child)
            else:
                g = r
            self.update(child, g)
        return root.max_action_value().action
    
    def random_action(self):
        return random.randint(0,env.action_space.n-1)

In [None]:
env = gym.make(env_name)
sim = gym.make(env_name)
env.reset()
sim.reset()
# set initial state
state = 0
env.unwrapped.s = state
mcts = MCTS(sim, state, iterations = 10000)
visualize_env(env)
root_node = Node(state)
action = mcts.best_action(root_node)
print(root_node.str())

print(root_node.children[0].str())
print(root_node.children[1].str())
print(root_node.children[2].str())
print(root_node.children[3].str())
print("the best action is action {}, {}".format(action, actions[action]))
print(env.step(action))
visualize_env(env)

In [None]:
def plan_mcs(iterations, state = 0, output = False):
    env = gym.make(env_name)
    sim = gym.make(env_name)
    env.reset()
    sim.reset()
    # initialize the Monte Carlo Evaluator
    mcs = MCS(sim, state, iterations = iterations)
    done = False
    steps = 0
    while not done:
        mcs.state = state
        action = mcs.best_action()
        steps += 1
        # take one step in the environment
        state, r, done, _ = env.step(action)
    if output:
        visualize_env(env)
        print("reached state: {}, after {}".format(state, steps))
    
    if done and r != 1:
        r = -1
    return steps, r

In [None]:
def plan_mcts(iterations, state = 0, output = False):
    env = gym.make(env_name)
    sim = gym.make(env_name)
    env.reset()
    sim.reset()
    # initialize the Monte Carlo Tree Search
    mcts = MCTS(sim, state, iterations = iterations)
    done = False
    steps = 0
    while not done:
        root_node = Node(state)
        action = mcts.best_action(root_node)
        steps += 1
        # take one step in the environment
        state, r, done, _ = env.step(action)
    if output:
        visualize_env(env)
        print("reached state: {}, after {}".format(state, steps))
    if done and r != 1:
        r = -1
    return steps, r

## Comparison of MCS and MCTS
* MCTS requires less iterations to reach the goal state
* Due to the uniform action exploration in the plan_mcs function the variance estimates for all actions are less skewed as they are for MCTS, thus reaching the goal more frequently (but slower)

In [None]:
# specify evaluation points
iterations = [25, 50, 100, 200, 400]
#iterations = [200, 400, 800, 1600]
# specify number of runs for each evaluation point
runs = 10

# initialize empty metrics
avg_mcs_steps = []
std_mcs_steps = []
avg_mcts_steps = []
std_mcts_steps = []
avg_mcs_rs = []
std_mcs_rs = []
avg_mcts_rs = []
std_mcts_rs = []

mcs_steps = []
mcts_steps = []
mcs_rs = []
mcts_rs = []

start_state = 1

for it in iterations:
    # reset counters
    mcs_steps = []
    mcts_steps = []
    mcs_rs = []
    mcts_rs = []
    for i in range(runs):
        # solve MDP with MCS
        mcs_step, mcs_r = plan_mcs(it, start_state, False)
        # solve MDP with MCTS
        mcts_step, mcts_r = plan_mcts(it, start_state, False)
        # track metrics counters
        mcs_steps = np.append(mcs_steps,mcs_step)
        mcts_steps = np.append(mcts_steps,mcts_step)
        mcs_rs = np.append(mcs_rs,mcs_r/mcs_steps)
        mcts_rs = np.append(mcts_rs,mcts_r/mcts_steps)
        
    # aggregate values
    avg_mcs_steps = np.append(avg_mcs_steps,np.mean(mcs_steps))
    std_mcs_steps = np.append(std_mcs_steps, np.std(mcs_steps))
    avg_mcts_steps = np.append(avg_mcts_steps,np.mean(mcts_steps))
    std_mcts_steps = np.append(std_mcts_steps,np.std(mcts_steps))
    avg_mcs_rs = np.append(avg_mcs_rs,np.mean(mcs_rs))
    std_mcs_rs = np.append(std_mcs_rs,np.std(mcs_rs))
    avg_mcts_rs = np.append(avg_mcts_rs,np.mean(mcts_rs))
    std_mcts_rs = np.append(std_mcts_rs,np.std(mcts_rs))
    
fig, ax = plt.subplots(1, 2)
# Plot the average episode length
ax[0].errorbar(iterations, avg_mcs_steps, yerr=std_mcs_steps, color="red", label='MCS')
ax[0].errorbar(iterations, avg_mcts_steps, yerr=std_mcts_steps, color="blue", label='MCTS')
ax[0].set(xlabel='#Simulations', ylabel='Steps', title='Average Episode Length')
ax[0].grid()
ax[0].legend()

# Plot the average episode reward
ax[1].errorbar(iterations, avg_mcs_rs, yerr=std_mcs_rs, color="red", label='MCS')
ax[1].errorbar(iterations, avg_mcts_rs, yerr=std_mcts_rs, color="blue", label='MCTS')
ax[1].set(xlabel='#Simulations', ylabel='Reward', title='Average Step Reward')
ax[1].grid()
ax[1].legend();