# Machine Learning Summer School - London 2019
## Reinforcement Learning Tutorial

### Author: [Katja Hofmann](https://www.microsoft.com/en-us/research/people/kahofman/)

This tutorial uses the [MineRL package](http://minerl.io/) to illustrate how a Reinforcement Learning (RL) agent can learn to interact with the popular video game [Minecraft](https://www.minecraft.net/en-us/). MineRL was developed by a team led by [William H. Guss](http://wguss.ml/) and [Brandon Houghton](https://github.com/brandonhoughton) for the NeurIPS 2019 MineRL competition, hosted by AICrowd and sponsored by Microsoft. MineRL is based on [Project Malmo](https://www.microsoft.com/en-us/research/project/project-malmo/), developed at [Microsoft Research](https://www.microsoft.com/en-us/research/theme/game-intelligence/). This tutorial uses the deep learning framework [chainer](https://chainer.org/) to implement RL algorithms.

**Further reading:**
- [MineRL Competition at AICrowd](https://www.aicrowd.com/challenges/neurips-2019-minerl-competition)
- [Guss et al. 2019: The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors](https://arxiv.org/abs/1904.10079)

### Setup

If this is the first time you run the tutorial, install the latest MineRL package as shown below. 
For details and prerequisites, see http://minerl.io/docs/tutorials/getting_started.html

In [None]:
# uncomment to install minerl
# check recommended version - currently: 0.1.9
#!pip install --upgrade minerl==0.1.9
#!pip install --upgrade chainer, cv2, matplotlib, pylab, logging, numpy

Import required packages:

In [None]:
# environments
import gym
import minerl

# chainer
import chainer.functions as F
import chainer.links as L
from chainer import initializers
from chainer import serializers
from chainer import optimizers, Chain, Variable

# visualization
%matplotlib nbagg
import matplotlib.pyplot as plt
import matplotlib.animation as anim
import matplotlib.gridspec as gridspec
import matplotlib.image as mpimg

import pylab
from IPython import display

# get DEBUG logging from MineRL while Minecraft starts up
import sys
import logging
logger = logging.getLogger("minerl")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

# utilities
import random
import numpy as np
import cv2

### Test

Test your MineRL installation by instantiating an environment, and creating a first agent to navigate this environment. The example below is based on the [MineRL Tutorial](http://minerl.io/docs/tutorials/first_agent.html).

First, we instantiate a MineRL gym environment. This will take a couple of minutes, as Minecraft is started in the background. Debug output will be generated while Minecraft is starting. The generated output can usually be ignored, but can be useful in case something goes wrong.

In [None]:
# Create a MineRL environment - be patient
nav_env = gym.make('MineRLNavigateDense-v0')

Now we're ready for our first interaction with the environment. Initially, actions are hard coded to move towards the direction indicated by the compass, as shown in the [MineRL Tutorial](http://minerl.io/docs/tutorials/first_agent.html). 

In [None]:
# test interaction with the environment - setup

obs, _ = nav_env.reset() # this may take up to a minute
done = False

from time import time
start = time()
net_reward = 0
stepcount = 0
maxsteps = 1000

# prepare visuals
fig = pylab.figure(figsize=(10, 5))
gs = gridspec.GridSpec(1, 2)
ax1 = pylab.subplot(gs[0, 0])
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
imgplot = ax1.imshow(obs['pov'])
ax2 = pylab.subplot(gs[0, 1])
rewards = [0]
line, = ax2.plot(range(len(rewards)), rewards)
pylab.show()


In [None]:
# interact with the environment, results are updated in the plot above
while not done:
    action = nav_env.action_space.noop()
    action['camera'] = [0, 0.03*obs['compassAngle']]
    action['back'] = 0
    action['forward'] = 1
    action['jump'] = 1
    action['attack'] = 1

    obs, reward, done, info = nav_env.step(action)
    rewards.append(reward)

    if stepcount % 50 == 0:
        imgplot.set_data(obs['pov'])
        line.set_data(range(len(rewards)), rewards)
        ax2.set_xlim(0, len(rewards))
        ax2.set_ylim(min(rewards), max(rewards))
        fig.canvas.draw()

    stepcount += 1
    if stepcount >= maxsteps:
        break

print("Time taken for %d steps: %.1f seconds" % (maxsteps, time() - start))
print("Total reward: ", sum(rewards))

In [None]:
# See the last action taken
action

### RL Experiment Setup

In this section, we set up a number of components that make it easy to run and visualize RL experiments in Jupyter Notebooks. The main abstractions are:
- Environment: defines an interactive control task. We assume environments implement the [OpenAI gym interface](https://gym.openai.com/). In addition to the MineRL environment introduced above, we will also implement a toy task, SimpleRooms, to illustrate typical environment functionality.
- Agent: interacts with an environment by receiving observations and rewards, and taking actions. We will implement several agents throughout this tutorial, starting from a random agent and moving to a Deep Q-Network agent that implements Q-Learning.
- Experiment: connects agents and environments, collect and report results.

In [None]:
# environment interface

class Environment(object):

    def reset(self):
        raise NotImplementedError('Inheriting classes must override reset.')

    def actions(self):
        raise NotImplementedError('Inheriting classes must override actions.')

    def step(self):
        raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):
    
    def __init__(self, actions):
        self.actions = actions
        self.n = len(actions)

In [None]:
class SimpleRooms(Environment):
    """Define a simple 4-room environment with 16 states
       actions: 0 - north, 1 - east, 2 - west, 3 - south"""

    def __init__(self):
        super(SimpleRooms, self).__init__()

        # define state and action space
        self.S = range(16)
        self.action_space = ActionSpace(range(4))

        # define reward structure
        self.R = [0] * len(self.S)
        self.R[random.choice(self.S)] = 1

        # define transitions
        self.P = {}
        self.P[0] = [1, 4]
        self.P[1] = [0, 2, 5]
        self.P[2] = [1, 3, 6]
        self.P[3] = [2, 7]
        self.P[4] = [0, 5, 8]
        self.P[5] = [1, 4]
        self.P[6] = [2, 7]
        self.P[7] = [3, 6, 11]
        self.P[8] = [4, 9, 12]
        self.P[9] = [8, 13]
        self.P[10] = [11, 14]
        self.P[11] = [7, 10, 15]
        self.P[12] = [8, 13]
        self.P[13] = [9, 12, 14]
        self.P[14] = [10, 13, 15]
        self.P[15] = [11, 14]

        self.max_trajectory_length = 50
        self.tolerance = 0.1
        self._rendered_maze = self._render_maze()

    def step(self, action):
        s_prev = self.s
        self.s = self.single_step(self.s, action)
        reward = self.single_reward(self.s, s_prev, self.R)
        self.nstep += 1
        self.is_reset = False

        if (reward < -1. * (self.tolerance) or reward > self.tolerance) or self.nstep == self.max_trajectory_length:
            self.reset()

        return (self._convert_state(self.s), reward, self.is_reset, '')

    def single_step(self, s, a):
        if a < 0 or a > 3:
            raise ValueError('Unknown action', a)
        if a == 0 and (s-4 in self.P[s]):
            s -= 4
        elif a == 1 and (s+1 in self.P[s]):
            s += 1
        elif a == 2 and (s-1 in self.P[s]):
            s -= 1
        elif a == 3 and (s+4 in self.P[s]):
            s += 4
        return s

    def single_reward(self, s, s_prev, rewards):
        if s == s_prev:
            return 0
        return rewards[s]

    def reset(self):
        self.nstep = 0
        self.s = random.choice(self.S)
        # disallow spawning in a reward state
        while (self.R[self.s] < -1. * (self.tolerance) or self.R[self.s] > self.tolerance):
            self.s = random.choice(self.S)
        self.is_reset = True
        return self._convert_state(self.s)

    def _convert_state(self, s):
        converted = np.zeros(len(self.S), dtype=np.float32)
        converted[s] = 1
        return converted

    def _get_render_coords(self, s):
        return (int(s / 4) * 4, (s % 4) * 4)

    def _render_maze(self):
        # draw background and grid lines
        maze = np.zeros((17, 17))
        for x in range(0, 17, 4):
            maze[x, :] = 0.5
        for y in range(0, 17, 4):
            maze[:, y] = 0.5

        # draw reward and transitions
        for s in range(16):
            if self.R[s] != 0:
                x, y = self._get_render_coords(s)
                maze[x+1:x+4, y+1:y+4] = self.R[s]
            if self.single_step(s, 0) == s:
                x, y = self._get_render_coords(s)
                maze[x, y:y+5] = -1
            if self.single_step(s, 1) == s:
                x, y = self._get_render_coords(s)
                maze[x:x+5, y+4] = -1
            if self.single_step(s, 2) == s:
                x, y = self._get_render_coords(s)
                maze[x:x+5, y] = -1
            if self.single_step(s, 3) == s:
                x, y = self._get_render_coords(s)
                maze[x+4, y:y+4] = -1
        return maze

    def render(self, mode = 'rgb_array'):
        assert mode == 'rgb_array', 'Unknown mode: %s' % mode
        img = np.array(self._rendered_maze, copy=True)

        # draw current agent location
        x, y = self._get_render_coords(self.s)
        img[x+1:x+4, y+1:y+4] = 2.0
        return img


In [None]:
class Agent(object):
    '''Agent base class'''

    def __init__(self, actions):
        self.actions = actions
        self.num_actions = len(actions)

    def step(self, obs, reward, done, info):
        raise NotImplementedError

class RandomAgent(Agent):
    '''Agent that samples actions uniformly at random'''

    def __init__(self, actions):
        super(RandomAgent, self).__init__(actions)
    
    def step(self, obs, reward, done, info):
        self.current_loss = random.random()
        return random.randint(0, self.num_actions-1)

class DefaultAgent(Agent):
    '''Agent that always takes a default action'''
    def __init__(self, actions, default_action):
        super(DefaultAgent, self).__init__(actions)
        self.default_action = default_action
    
    def step(self, obs, reward, done, info):
        self.current_loss = random.random()
        return self.default_action

In [None]:
class Experiment(object):

    def __init__(self, env, agent, normobs=False):
        
        self.env = env
        self.agent = agent

        self.epoch_losses = [0]
        self.rolling_average = np.array([0])
        self.windowsize = 100
        self.normalize_observations = normobs

        # prepare visuals
        self.fig = pylab.figure(figsize=(10, 5))
        gs = gridspec.GridSpec(2, 2)
        self.ax = pylab.subplot(gs[:, 0])
        self.ax.title.set_text('Current frame')
        self.ax.xaxis.set_visible(False)
        self.ax.yaxis.set_visible(False)
        self.ax1 = pylab.subplot(gs[0, 1])
        self.ax1.title.set_text('Rolling average reward')
        self.ax2 = pylab.subplot(gs[1, 1])
        self.ax2.title.set_text('Average loss')
        
        self.line, = self.ax1.plot(range(len(self.rolling_average)), self.rolling_average)
        self.line2, = self.ax2.plot(range(len(self.epoch_losses)), self.epoch_losses)
        self.imgplot = self.ax.imshow(np.random.random((64,64)), interpolation='none', cmap='viridis')
        self.first_render = True

        pylab.show()

    def run(self, num_steps, display_frequency):
        self.display_frequency = display_frequency
        observation = self.env.reset()
        self.update_display()
        steps = 0
        done = False
        reward = .0
        rewards = np.array([])
        losses = []

        while steps < num_steps:
            steps += 1
            if self.normalize_observations:
                observation = (observation / 255.).astype(np.float32, copy=False)
            action = self.agent.step(observation, reward, done, None)
            observation, reward, done, _ = self.env.step(action)
            losses.append(self.agent.current_loss)

            if done:
                observation = self.env.reset()

            rewards = np.append(rewards, reward)
            self.rolling_average = np.append(self.rolling_average,
                                        sum(rewards[-self.windowsize:])/len(rewards[-self.windowsize:]))

            if steps % self.display_frequency == 0:
                self.epoch_losses = np.append(self.epoch_losses, np.mean(losses))
                self.update_display()
                losses = []
      
    def update_display(self):
        self.imgplot.set_data(self.env.render(mode='rgb_array'))
        self.line.set_data(range(len(self.rolling_average)), self.rolling_average)
        self.ax1.set_xlim(0, max(100, len(self.rolling_average)))
        self.ax1.set_ylim(min(self.rolling_average)-0.01, max(self.rolling_average)+0.01 * 1.1)

        self.line2.set_data(range(0, len(self.epoch_losses)), self.epoch_losses)
        self.ax2.set_xlim(0, max(100, len(self.epoch_losses)))
        self.ax2.set_ylim(min(min(self.epoch_losses), 1e-5), max(self.epoch_losses)+0.01 * 1.1)
        self.fig.canvas.draw()


### Experiment 1: Random Agent on Simple Rooms

We are ready to set up a first simple experiment. This illustrates how an experiment connects environment and agent. We'll run a random agent on the SimpleRooms environment.

In [None]:
# experiment setup
simple_env = SimpleRooms()
random_agent = RandomAgent(simple_env.action_space.actions)
experiment = Experiment(simple_env, random_agent)

In [None]:
# run the experiment for 1000 steps
experiment.run(1000, 20)

### Experiment 2: Random Agent on MineRL-NavDense

Now for the real thing - our first experiment with the MineRL environment. We will simplify the environment, for illustration and to make learning the task within a couple of minutes feasible. We'll simplify the action and observation space, as well as providing a simpler reward signal, as implemented in the environment wrapper below. 

In [None]:
class DiscreteMinecraftEnvWrapper(Environment):
    '''Wrap a MineRL environment to discretize actions - assume Nav environemnt'''

    def __init__(self, env):

        self.env = env
        # define action space
        self.action_space = ActionSpace(range(3))

    def reset(self):
        self.obs, _ = self.env.reset()
        self.steps_this_episode = 0
        return self._convert_obs(self.obs)

    def step(self, action):
        self.steps_this_episode += 1
        self.obs, self.reward, self.done, self.info = self.env.step(self._convert_action(action))
        # simplify reward signal
        if action == 0:
            if obs['compassAngle'] < 1:
                self.reward = .1
            else:
                self.reward = .01
        else:
            self.reward = -.1
        return self._convert_obs(self.obs), self.reward, self.done, self.info

    def _convert_obs(self, obs):
        '''Extract visuals'''
        # constructs obs of size 3 x 3 x 3 + 1 = 28
        low_res = cv2.resize(obs['pov'], dsize=(3, 3), interpolation=cv2.INTER_NEAREST)
        return np.float32(np.hstack([low_res.flatten(), obs['compassAngle']]))

    def _convert_action(self, action):
        base_action =  self.env.action_space.noop()
        base_action['jump'] = 1
        base_action['attack'] = 1

        if action == 0:
            # move forward
            base_action['forward'] = 1
        elif action == 1:
            # turn towards the compass direction
            base_action['camera'] = [0, 0.03 * obs['compassAngle']]
        elif action == 2:
            # move back
            base_action['back'] = 1
        else:
            raise NotImplementedError('Action %d is not implemented.' % action)

        return base_action

    def render(self, mode):
        return self.obs['pov']

In [None]:
# experiment setup
wrapped_env = DiscreteMinecraftEnvWrapper(nav_env)
random_agent = RandomAgent(wrapped_env.action_space.actions)
# default_agent = DefaultAgent(wrapped_env.action_space.actions, 0)
experiment = Experiment(wrapped_env, random_agent)

In [None]:
# run the experiment for 500 steps
experiment.run(500, 20)

In [None]:
# when done using the MineRL experiment, close it down - this will stop the Minecraft client
# env.close()

### RL: DQN Agent

We are ready to implement our reinforcement learning agent. The code below implements the DQN agent by [Mnih et al. 2015](https://www.nature.com/articles/nature14236/), but instead of a convolutional network we will use 2 fully connected layers (to allow running experiments in reasonable time without GPU).

The QLearningAgent class lays out the required components: model network, target network, explorer, replay memory, and optimizer. The components are implemented in turn below.

In [None]:
class QLearningAgent(Agent):
    """Q-Learning agent with function approximation."""

    def __init__(self, actions, obs_size, **kwargs):
        super(QLearningAgent, self).__init__(actions)

        self.obs_size = obs_size
        self.tau = kwargs.get('tau', .0001)
        
        self.model_network = QNetwork(self.obs_size, self.num_actions, kwargs.get('nhidden', 512))
        self.target_network = QNetwork(self.obs_size, self.num_actions, kwargs.get('nhidden', 512))
        self.target_network.copyparams(self.model_network)

        self.explorer = EpsilonGreedyExplorer(kwargs.get('epsilon', .1), self.num_actions, self.model_network)

        self.memory = ReplayMemory(self.obs_size, kwargs.get('mem_size', 100))
        self.optimizer = self.init_optimizer(self.model_network, kwargs.get('learning_rate', .01))

        self.gamma = kwargs.get('gamma', .99)
        self.minibatch_size = kwargs.get('minibatch_size', 32)
        self.epoch_length = kwargs.get('epoch_length', 100)
        
        self.step_counter = 0
        self.current_loss = .0

Note: this is the solution version - students should implement the explorer and model update.

In [None]:
class EpsilonGreedyExplorer(object):
    """Implements an epsilon greedy exploration policy"""
    
    def __init__(self, epsilon, num_actions, model):
        self.epsilon = epsilon
        self.num_actions = num_actions
        self.model = model

    def next_action(self, state):

        if random.random() < self.epsilon:
            # explore
            return random.randint(0, self.num_actions-1)

        # exploit
        Q = self.model(state)
        action_index = Q.data.argmax()
        return action_index

In [None]:
def step(self, obs, reward, done, info):

    if self.step_counter > 0:
        self.memory.observe(self.prev_obs, self.prev_action, reward, done)

    action = self.explorer.next_action(
                Variable(obs.reshape(1, obs.shape[0])))

    # start training after 1 epoch
    if self.step_counter > self.epoch_length:
        self.current_loss = self.update_model()

    self.step_counter += 1
    self.prev_action = action
    self.prev_obs = obs

    # decay epsilon after each epoch
    if self.step_counter % self.epoch_length == 0:
        self.explorer.epsilon = max(0.05, self.explorer.epsilon * .95)

    return action

QLearningAgent.step = step

In [None]:
class QNetwork(Chain):
    """The neural network architecture as a Chainer Chain - here: single hidden layer"""

    def __init__(self, obs_size, num_actions, nhidden):
        """Initialize weights"""
        # use LeCunUniform weight initialization for weights
        self.initializer = initializers.LeCunUniform()
        self.bias_initializer = initializers.Uniform(1e-4)

        super(QNetwork, self).__init__(
            feature_layer = L.Linear(obs_size, nhidden,
                                initialW = self.initializer,
                                initial_bias = self.bias_initializer),
            action_values = L.Linear(nhidden, num_actions, 
                                initialW=self.initializer,
                                initial_bias = self.bias_initializer)
        )

    def __call__(self, x):
        """implements forward pass"""
        h = F.relu(self.feature_layer(x))
        return self.action_values(h)

In [None]:
def update_model(self):
    (s, action, reward, s_next, is_terminal) = self.memory.sample_minibatch(self.minibatch_size)

    # compute Q targets (max_a' Q_hat(s_next, a'))
    Q_hat = self.target_network(s_next)
    Q_hat_max = F.max(Q_hat, axis=1, keepdims=True)
    y = (1-is_terminal)*self.gamma*Q_hat_max + reward

    # compute Q(s, action)
    Q = self.model_network(s)
    Q_subset = F.reshape(F.select_item(Q, action), (self.minibatch_size, 1))

    # compute Huber loss
    error = y - Q_subset
    loss_clipped = abs(error) * (abs(error.data) > 1) + (error**2) * (abs(error.data) <= 1)
    loss = F.sum(loss_clipped) / self.minibatch_size

    # perform model update
    self.model_network.zerograds() ## zero out the accumulated gradients in all network parameters
    loss.backward()
    self.optimizer.update()
    
    # target network tracks the model
    for dst, src in zip(self.target_network.params(), self.model_network.params()):
        dst.data = self.tau * src.data + (1 - self.tau) * dst.data

    return loss.data

QLearningAgent.update_model = update_model

In [None]:
def init_optimizer(self, model, learning_rate):

    optimizer = optimizers.SGD(learning_rate)
    # optimizer = optimizers.Adam(alpha=learning_rate)
    # optimizer = optimizers.AdaGrad(learning_rate)
    # optimizer = optimizers.RMSpropGraves(learning_rate, 0.95, self.momentum, 1e-2)

    optimizer.setup(model)
    return optimizer

QLearningAgent.init_optimizer = init_optimizer

In [None]:
class ReplayMemory(object):
    """Implements basic replay memory"""

    def __init__(self, observation_size, max_size):
        self.observation_size = observation_size
        self.num_observed = 0
        self.max_size = max_size
        self.samples = {
                 'obs'      : np.zeros(self.max_size * 1 * self.observation_size,
                                       dtype=np.float32).reshape(self.max_size, 1, self.observation_size),
                 'action'   : np.zeros(self.max_size * 1, dtype=np.int16).reshape(self.max_size, 1),
                 'reward'   : np.zeros(self.max_size * 1).reshape(self.max_size, 1),
                 'terminal' : np.zeros(self.max_size * 1, dtype=np.int16).reshape(self.max_size, 1),
               }

    def observe(self, state, action, reward, done):
        index = self.num_observed % self.max_size
        self.samples['obs'][index, :] = state
        self.samples['action'][index, :] = action
        self.samples['reward'][index, :] = reward
        self.samples['terminal'][index, :] = done
        
        self.num_observed += 1
        
    def sample_minibatch(self, minibatch_size):
        max_index = min(self.num_observed, self.max_size) - 1
        sampled_indices = np.random.randint(max_index, size=minibatch_size)
        
        s      = Variable(np.asarray(self.samples['obs'][sampled_indices, :], dtype=np.float32))
        s_next = Variable(np.asarray(self.samples['obs'][sampled_indices+1, :], dtype=np.float32))

        a      = Variable(self.samples['action'][sampled_indices].reshape(minibatch_size))
        r      = self.samples['reward'][sampled_indices].reshape((minibatch_size, 1))
        done   = self.samples['terminal'][sampled_indices].reshape((minibatch_size, 1))

        return (s, a, r, s_next, done)

### Experiment 3: DQN on SimpleRooms

It's time to test your DQN implementation. Are you ready? The experiment on the SimpleRoom task below is a good test case. The task can be learned within less than 5000 steps. If your learning curve stays flat - something is wrong.

In [None]:
simple_env = SimpleRooms()

simple_q_agent = QLearningAgent(
    simple_env.action_space.actions,
    16, # observation size
    nhidden = 512,
    epsilon = 1.,
    mem_size = 10000,
    learning_rate = .5,
    tau = .001,
    minibatch_size = 32,
    epoch_length = 100)
simple_q_experiment = Experiment(simple_env, simple_q_agent)

In [None]:
simple_q_experiment.run(5000, 10)

### Experiment 4: DQN Agent on MineRL Navigation

Now we're ready to test our DQN agent on our discretized Minecraft Navigation task. Again, if everything is implemented correctly, reward should go up within less than 3000 training steps.

In [None]:
wrapped_env = DiscreteMinecraftEnvWrapper(nav_env)
minerl_q_agent = QLearningAgent(
    wrapped_env.action_space.actions,
    28, # observation size
    nhidden = 512,
    epsilon = 1.,
    mem_size = 10000,
    learning_rate = .5,
    tau = .001,
    minibatch_size = 32,
    epoch_length = 100)
minerl_q_experiment = Experiment(wrapped_env, minerl_q_agent)

In [None]:
minerl_q_experiment.run(3000, 50)

In [None]:
nav_env.close()

### Conclusion and Next Steps

## TO DO

- Update / clean up reward visualization
- Update SimpleRooms rendering
- Remove solution code from student version
- Complete text / instructions
- Add conclusion - next steps: additional experiments to run
- Get feedback