# Learning about Reinforcement Learning #

<a href="https://colab.research.google.com/github/deepmind/educational/blob/master/colabs/introductory/reinforcement_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



> <p><small><small><b>Copyright 2020 DeepMind Technologies Limited.</b></p>
> <p><small><small> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at </p>
> <p><small><small> <a href="https://www.apache.org/licenses/LICENSE-2.0">https://www.apache.org/licenses/LICENSE-2.0</a> </p>
> <p><small><small> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. </p>



**Aim**

Reinforcement learning has been successful in solving challenging problems, such as the games of Go and StarCraft. It has significant potential for many real-world problems, such as drug discovery, global warming, or even discovering new scientific theories in physics, mathematics, and other fundamental sciences.


This tutorial will go through an example of applying Reinforcement Learning to solve a problem and familiarise you with basic concepts used to formalise it. We will load an environment and an agent, specify a reward function, and then train a neural network to perform the task!

**Disclaimer**

This code is intended for educational purposes, and in the name of readability for a non-technical audience does not always follow best practices for software engineering.


**Links to resources**
- [What is Colab?](https://colab.sandbox.google.com/notebooks/intro.ipynb) If you have never used Colab before, get started here!

## Reinforcement Learning


In Reinforcement Learning we want to train an **agent** to maximise the total **reward** it receives within a fixed duration of interacting with an **environment**. 

The following diagram illustrates the interaction between the agent and environment.
<center>
<img src="https://storage.googleapis.com/dm-educational/assets/reinforcement_learning/rl_loop_illustrated.png" width="500" />
</center>

In this tutorial we will focus on simple environments to familiarise you with the process of training an agent. The simplest environment, shown below, is CartPole. This environment consists of a pole attached to a cart via a hinge. The agent needs to move the cart to the left or to the right in order to prevent the pole from falling over - this can be learned in just a few minutes.

<center>
<img src="https://storage.googleapis.com/dm-educational/assets/reinforcement_learning/42135683-dde5c6f0-7d13-11e8-90b1-8770df3e40cf.gif" width="500" />
</center>



#Pre-requisites#

Before we start, we'll have to set up a few things. This colab will involve running a number of *cells*, each containing some code. If you look at the cell below, and hover over the brackets to the top left, it should turn into a play sign, that can be used to start or stop running the code in the cell.

Click the play button on the next three cells below to install the software packages, import Python modules, and implement some functions that we'll use under the hood. (You should see it change to a "stop" icon while it's running.)

This should only take around 30 seconds.

In [None]:
#@title Install software packages  {'form-width':'30%'}
%reset -f
!apt-get update
!apt-get install -y xvfb python-opengl ffmpeg
!pip install gym
!pip install imageio
!pip install PILLOW
!pip install pyglet
!pip install pyvirtualdisplay

!pip install dm-acme
!pip install dm-acme[reverb,tf,envs]

from IPython.display import clear_output
clear_output()

In [None]:
#@title Import python libraries {'form-width':'30%'}

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import PIL.Image
import pyvirtualdisplay
import numpy as np

import gym
import dm_env
import reverb
import sonnet as snt
import tensorflow as tf

from acme import environment_loop
from acme.tf import networks
from acme.adders import reverb as adders
from acme.agents.tf import actors
from acme.datasets import reverb as datasets
from acme.wrappers import atari_wrapper, gym_wrapper
from acme import specs
from acme import wrappers
from acme.agents.tf import dqn
from acme.agents import agent
from acme.tf import utils
from acme.utils import loggers

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.rcdefaults()
plt.xkcd()

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()


In [None]:
#@title Set up utilities {'form-width':'30%'}


def step_agent_in_environment(env, agent=None, num_episodes=3):
  """Steps an agent in an enviroment."""
  frames = []
  actions = []

  for n in range(num_episodes):
    timestep = env.reset()
    while not timestep.last():
      frames.append(env.render(mode='rgb_array'))
      if callable(agent):
        action = agent(timestep.observation)
      else:
        action = agent.select_action(timestep.observation)
      actions.append(action)
      timestep = env.step(action)

  return frames, actions


def show_video(frames):
  """Show video."""
  video_filename = 'imageio.mp4'
  # Write video:
  with imageio.get_writer(video_filename, fps=60) as video:
    for frame in frames:
      video.append_data(frame)
  # Read video and show it:
  video = open(video_filename, 'rb').read()
  b64 = base64.b64encode(video)
  tag = """
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>""".format(b64.decode())

  return IPython.display.HTML(tag)


print('All set!')

## Environments

Next, let's load the RL environment we want to use!

Environments in RL represent the task or problem that we are trying to solve. There are many types of environments, such as computer games, simulated robotics settings, etc. Some of these can take several hours or days for an agent to learn! 

First pick [CartPole](https://github.com/openai/gym/wiki/CartPole-v0) from the drop down menu as it is the fastest environment to train on. Later you can choose another environment to play with!

Running this cell will show you an image of what the environment looks like.

In [None]:
#@title Load an environment
environment_name = 'CartPole'  #@param ['MountainCar', 'CartPole', 'Atari']

if 'CartPole' in environment_name:
  environment_train = gym_wrapper.GymWrapper(gym.make('CartPole-v0'))
  environment_train = wrappers.SinglePrecisionWrapper(environment_train)
  environment = gym_wrapper.GymWrapper(gym.make('CartPole-v0'))
  environment = wrappers.SinglePrecisionWrapper(environment)
  # Just for visualisation / evaluation, we'll set different angle limits
  environment.env.theta_threshold_radians = 10.0
elif 'MountainCar' in environment_name:
  environment_train = gym_wrapper.GymWrapper(gym.make('MountainCar-v0'))
  environment_train = wrappers.SinglePrecisionWrapper(environment_train)
  environment = environment_train
elif 'Atari' in environment_name:
  environment_train = gym_wrapper.GymAtariAdapter(gym.make('Pong-v0'))
  environment_train = atari_wrapper.AtariWrapper(environment_train)
  environment_train = wrappers.SinglePrecisionWrapper(environment_train)
  environment = environment_train
else:
  raise ValueError('Unknown environment: {}.'.format(environment_name))

# Random Agent

Before we train an agent, we will start with a *random agent*. The random agent does not take environment observations into account when choosing to take an action and just chooses actions randomly from the set of all possible actions. You can generate random actions using the code below and see the behaviour of the agent:


In [None]:
action_space = environment.action_space

def int_random_action(state):
  # state is unused for random agent
  return action_space.sample()


output = environment.reset()
print('random action:', int_random_action(None))
print('random action:', int_random_action(None))
print('random action:', int_random_action(None))
print('random action:', int_random_action(None))

frames, actions = step_agent_in_environment(
    env=environment, agent=int_random_action, num_episodes=5)

print('actions = {}'.format(actions))

show_video(frames)

# Custom Agent

<center>
<img src="https://storage.googleapis.com/dm-educational/assets/reinforcement_learning/random_policy.png" width="500" />
</center>

Let's now try and manually code a better behaviour into the agent. The code below defines a way to run a custom set of actions. A value of 0 for the action means the cart is pushed left, and a value of 1 means the cart is pushed right.
Currently, the code is set to a constant zero, which pushes the cart left all the time (you'll see this if you run it and play the video!).

You can modify the function to specify what action should be taken depending on the observation.
One thing you could try is to use an ***if statement***, which in Python, looks like the following.

```
if [some_condition]:
  action = [some_value]
else:
  action = [some_other_value]
```
Looking at the code below, you have access to the `cart_position`, `cart_velocity`, `pole_angle`, and `pole_velocity_at_tip`.
Can you think of what conditions you might use to set the actions based on these?
**Ask for help if you're not sure!**

In [None]:
def custom_action_for_cartpole(state):
  # for cartpole only:
  cart_position = state[0]
  cart_velocity = state[1]
  pole_angle = state[2]
  pole_velocity_at_tip = state[3]

  # Instead of making the action 0 (in cartpole: go left), try to come up with
  # a better behavior.
  action = 0

  return action


output = environment.reset()

frames, actions = step_agent_in_environment(
    env=environment, agent=custom_action_for_cartpole, num_episodes=5)

show_video(frames)

# How to train your agent? #

Getting an agent to solve a task by using random actions is clearly not optimal. 

There are a lot of solutions invented by scientists around the world to solve the reinforcement learning problem. They all have different strengths and weaknesses. One of the most famous modern methods is called **Deep Q Networks (DQN)**, proposed by scientists at DeepMind in 2014, and was the first RL method to solve Atari games from images. 

Here we will use this method to train our agent. You don't need to worry about all the details but if you are interested you can later look at how changing some aspects of the code makes your agent learn faster or slower.

## Set up the agent #

Hit run in the next cell to set up the agent. This will set up a *Q-network*, an *optimizer* that trains the network, a *replay buffer* which stores all of the experience obtained by the agent, and a DQN agent that puts these pieces together.


In [None]:
#@title Agent setup  {'form-width':'30%'}

def setup_agent(
    environment,
    learning_rate,
    batch_size=64,
    max_replay_size=1000,
    logger=None,
):
  """Setup the agent before training"""

  environment_spec = specs.make_environment_spec(environment)

  network = snt.Sequential([
      lambda x: tf.cast(x, tf.float32),
      snt.Flatten(),
      snt.nets.MLP([100, environment_spec.actions.num_values])
  ])

  # Construct the agent.
  agent = dqn.DQN(
      environment_spec=environment_spec,
      learning_rate=learning_rate,
      batch_size=batch_size,
      max_replay_size=max_replay_size,
      network=network,
      checkpoint=False,
      logger=logger,
  )

  return agent

## How to evaluate success?#
<center>
<img src="https://storage.googleapis.com/dm-educational/assets/reinforcement_learning/rl_question.png" width="200" />
</center>


Before we start training, how can we tell if our agents are any good? To answer that, we need some measure of 'goodness', which is usually related to the total rewards obtained over an episode, which is called the **return**. We can also talk about **loss** instead of reward: these are effectively just the opposite of one another. So we want to maximise the reward or minimise the loss.

For example in CartPole the agent gets a positive reward of $1$ per timestep. The episode ends if the agent drops the pole or the cart is far from the centre. So in order to get as much reward as possible, the agent just needs to make the episode last as long as possible!



## Set up training#

Hit run in the next cell to set up the training loop. This will set up code to take the agent we set up in the previous step, perform several thousands of training steps, and print the performance at each step.


In [None]:
#@title Training loop  {'form-width':'30%'}

def train(environment, agent, num_training_episodes, log_every=10):
  """Train the agent via the DQN algorithm"""

  min_actor_steps_before_learning = 1000
  num_actor_steps_per_iteration = 1
  num_learner_steps_per_iteration = 1
  all_returns = []

  learner_steps_taken = 0
  actor_steps_taken = 0
  for episode in range(num_training_episodes):

    timestep = environment.reset()
    agent.observe_first(timestep)
    episode_return = 0

    while not timestep.last():
      # Get an action from the agent and step in the environment.
      action = agent.select_action(timestep.observation)
      next_timestep = environment.step(action)

      # Record the transition.
      agent.observe(action=action, next_timestep=next_timestep)

      # Book-keeping.
      episode_return += next_timestep.reward
      actor_steps_taken += 1
      timestep = next_timestep

      # See if we have some learning to do.
      if (actor_steps_taken >= min_actor_steps_before_learning and
          actor_steps_taken % num_actor_steps_per_iteration == 0):
        # Learn.
        for learner_step in range(num_learner_steps_per_iteration):
          agent.update()
        learner_steps_taken += num_learner_steps_per_iteration

    # Log quantities.
    if episode % log_every == 0 or episode == num_training_episodes - 1:
      print(f'Episode: {episode} | Return: {episode_return} | '
            f'Learner steps: {learner_steps_taken} | '
            f'Actor steps: {actor_steps_taken}')
    all_returns.append(episode_return)

  return all_returns

## Train the agent!

We need to specify the `hyperparameters` of our experiment to start the training. These refer to different parameters, like how many training steps (how long) we'll train the agent for, how fast it learns, how much memory it keeps to store previous experiences, etc.

**In your first try, leave them at the default values for now** (num_training_steps = 200 and learning_rate = 3e-4) and hit run. Afterwards, come back and change the hyperparameters to make your agent learn faster and better!

In [None]:
#@title Train the agent, using some specific hyperparameters

num_training_episodes = 200  # @param {type:"integer"}
learning_rate = 3e-4  # @param {type:"number"}

# Other parameters
batch_size = 64
max_replay_size = 100000

# Set how often to print logs
log_every = 10

# Setup the agent

class NoOpLogger(object):
  """Avoids logginng from Acme """

  def write(self, data):
    pass

agent_logger = NoOpLogger()

agent = setup_agent(
    environment_train,
    learning_rate,
    batch_size=batch_size,
    max_replay_size=max_replay_size,
    logger=agent_logger)

# Use the training environment to train the agent
returns = train(environment_train, agent, num_training_episodes, log_every)

## Evaluate the agent
Now that we have a trained agent, we need to *evaluate* its performance. We can do this in two ways:

 (1) quantitatively: plotting the *training curve*, which shows the performance of the agent (in terms of the total reward obtained) as a function of training time; and

 (2) qualitatively: using the trained agent to interact with the environment and generate a video.

Run the two cells below to see each of these analyses.

In [None]:
#@title Plot the training curve {'form-width':'30%'}

plt.figure(figsize=(10, 5))
plt.plot(range(0, num_training_episodes), returns)
plt.grid(True)
plt.xlabel('Episodes', fontsize=15)
plt.ylabel('Total reward', fontsize=15)
plt.tick_params(labelsize=15)
plt.locator_params(nbins=10)

In [None]:
#@title Show video of the trained agent's behaviour {'form-width':'30%'}

frames, actions = step_agent_in_environment(
    env=environment, agent=agent, num_episodes=5)

show_video(frames)

## Analysis and post-mortem (what went wrong?)

In this environment, by the end of training the agent should be able to reach a maximum reward of $200$ in an episode. Based on the plot above, and the printouts during training, did it get there?

If not, perhaps we can try training again with some different hyperparameters?
Try going back and increasing the number of training steps to $400$, which means the agent will learn for longer.
You could also (separately) try increasing the *learning rate* from 3e-4 (i.e. $3 \times 10^{-4}$) to 1e-3 (i.e. $1 \times 10^{-3}$): this quantity varies how much the agent changes on each training step.

## Additional activities ##
Now that you've used reinforcement learning to train an agent to perform the CartPole task, try giving the task a go yourself and see how you do! You can try controlling CartPole via keyboard at this link: https://fluxml.ai/experiments/cartPole/

If you have any spare time, go back and change the hyperparameters to make your agent learn faster! You can also choose a different environment (eg. Atari or MountainCar), and run everything again!

Note that Atari might take quite a long time to train - see for yourself!



