
# Code-Driven Introduction to Reinforcement Learning

Welcome to the first practice of our course. Today's practice is inspired by an exercise from the book [Reinforcement Learning](https://rl-book.com/?utm_source=winder&utm_medium=notebook&utm_campaign=rl), by Prof. Dr. Phil Winder. 

In this notebook you will be investigating the fundamentals of reinforcement learning (RL).

## Prerequisites

This practice was developed as a jupyter notebook. You can use a local installation in your machine or an online host like [Binder](https://mybinder.org/), [Google's colaboratory](https://colab.research.google.com/) or similar. No significant computer power is needed to run this practice.

We expect that you are familiar with a Python and Jupyter notebooks.


### Expectations

The practice uses a very simple and intuitive example that we saw already in the lecture. This makes it easier to understand, since that is the main goal.

More comples examples will be covered later in the course. You can see some [industrial examples here](https://rl-book.com/applications/?utm_source=winder&utm_medium=notebook&utm_campaign=rl).

> We do not expect you to understand every detail presented in this notebook.
> Today we focus on learning the rationale behind Reinforcement Learning.

## The Internals of Reinforcement Learning

Before we dive in, let's refresh some certain terms/key components:

- **An agent**: An application that is able to observe state and suggest an action. It also receives feedback to let it know whether the action was good, or not.
- **An environment**: This is the place where the agent acts within. It accepts an action, which alters its internal state, and produces a new observation of that state.

In nearly all of the examples we see in literature, these two entities are implemented independently. The agent is a RL algorithm and the environment is either real life or a simulation.


The agent and the environment interact through an interface. You have some control over what goes into that interface and a large amount of effort is typically spent improving the quality of the data that flows through it. You need representations of:

- **State**: This is the observation of the environment. You often get to choose what to "show" to the agent. There is a compromise between simplifying the state to speed up learning and preventing overfitting, but often it pays to include as much as you can.
- **Action**: Your agent must suggest an action. This mutates the environment in some way. So called "options" or "null-actions" allow you to do nothing, if that's what you want to do.
- **Reward**: You use the reward to fine-tune your action choices.

### Creating a "GridWorld" Environment

Let's start creating your first environment: a simulation of a simple grid-based "world". Many real-life implementations begin with a simulation of the real world, because it's much easier to iterate and improve your design with a software stub of real-life.

The goal of this environment is to define a "world" in which a "robot" can move. The so-called-world is actually a series of cells inside a 2-dimensional box. The agent can move north, east, south, or west which moves the robot between the cells. The goal of the environment is to reach a goal. There is a reward of -1 for every step, to promote reaching the goal as fast as possible.

#### Imports and Definitions

First let's import a few libraries (to enable the autocompletion in later cells) and define a few important definitions. The first is the defacto definition of a "point" object, with x and y coordinates and the second is a direction enumeration. These are use to define the position of the agent in the environment and the direction of movement for an action, respectively. Note that we assume that east moves in a positive x direction and north moves in a positive y direction.

In [1]:
from collections import defaultdict, namedtuple
from enum import Enum
from typing import Tuple, List
import random
from IPython.display import clear_output

Point = namedtuple('Point', ['x', 'y'])
class Direction(Enum):
  NORTH = "⬆"
  EAST = "⮕"
  SOUTH = "⬇"
  WEST = "⬅"

  @classmethod
  def values(self):
    return [v for v in self]

#### The Environment Class

Now, we create a Python class that represents the environment. The first function in the class is the initialisation function in which we can specify the width and height of the environment.

After that, we define a helper parameter which encodes the possible actions and then we reset the state of the environment with a `reset` function.

In [2]:
class SimpleGridWorld(object):
  def __init__(self, width: int = 5, height: int = 5, debug: bool = False):
    self.width = width
    self.height = height
    self.debug = debug
    self.action_space = [d for d in Direction]
    self.reset()

#### The Reset Function

Many environments have an implicit "reset", whereby the environment's state is moved away from the goal state. In this implementation, reset takes the state of the environment back to the `(0, 0)` position, but this isn't strictly necessary. Many real-life environments reset randomly or have no reset at all.

We also set the goal, which is located in the south-eastern corner of the environment.

In [3]:
class SimpleGridWorld(SimpleGridWorld):
  def reset(self):
    self.cur_pos = Point(x=0, y=(self.height - 1))
    self.goal = Point(x=(self.width - 1), y=0)
    # If debug, print state
    if self.debug:
      print(self)
    return self.cur_pos, 0, False

#### Taking a Step

Recall three of the key components we defined earlier: the state, the action, and the reward. The environment's step function accepts an action, then produces a new state and reward.

The large amount of code is a consequence of the direction implementation. You can refactor this to use fewer lines of code with some clever indexing. However, we'll keep  this level of verbosity to make it easier to see what is going on. In essence, every direction moves the current position by one square. You can see the code incrementing or decrementing the x or y coordinates.

The second part of the function is testing to see if the agent is at the goal. If it is, then it signals that it is at a terminal state.

In [4]:
class SimpleGridWorld(SimpleGridWorld):
  def step(self, action: Direction):
    # Depending on the action, mutate the environment state
    if action == Direction.NORTH:
      self.cur_pos = Point(self.cur_pos.x, self.cur_pos.y + 1)
    elif action == Direction.EAST:
      self.cur_pos = Point(self.cur_pos.x + 1, self.cur_pos.y)
    elif action == Direction.SOUTH:
      self.cur_pos = Point(self.cur_pos.x, self.cur_pos.y - 1)
    elif action == Direction.WEST:
      self.cur_pos = Point(self.cur_pos.x - 1, self.cur_pos.y)
    # Check if out of bounds
    if self.cur_pos.x >= self.width:
      self.cur_pos = Point(self.width - 1, self.cur_pos.y)
    if self.cur_pos.y >= self.height:
      self.cur_pos = Point(self.cur_pos.x, self.height - 1)
    if self.cur_pos.x < 0:
      self.cur_pos = Point(0, self.cur_pos.y)
    if self.cur_pos.y < 0:
      self.cur_pos = Point(self.cur_pos.x, 0)

    # If at goal, terminate
    is_terminal = self.cur_pos == self.goal

    # Constant -1 reward to promote speed-to-goal
    reward = -1

    # If debug, print state
    if self.debug:
      print(self)

    return self.cur_pos, reward, is_terminal

#### Visualisation

And finally, like all of data science, it is vitally important that you are able to visualise the behaviour and performance of your agent. The first step in this process is being able to visualise the agent within the environment. The next function does this by printing a textual grid, with an `x` at the agent's location, a `o` at the goal, an `@` if the agent is on top of the goal, and a `_` otherwise.

In [5]:
class SimpleGridWorld(SimpleGridWorld):
  def __repr__(self):
    res = ""
    for y in reversed(range(self.height)):
      for x in range(self.width):
        if self.goal.x == x and self.goal.y == y:
          if self.cur_pos.x == x and self.cur_pos.y == y:
            res += "@"
          else:
            res += "o"
          continue
        if self.cur_pos.x == x and self.cur_pos.y == y:
          res += "x"
        else:
          res += "_"
      res += "\n"
    return res

### Running the Environment

To run the environment you need to instantiate the class, call reset to move the agent back to the start, then perform a series of actions to move the agent. For now let me move it manually, to make sure it is working, visualising the agent at each step. I also print the result of the step (the new state, reward, and terminal flag) for completeness.

In [6]:
s = SimpleGridWorld(debug=True)
print("☝ This shows a simple visualisation of the environment state.\n")
s.step(Direction.SOUTH)
print(s.step(Direction.SOUTH), "⬅ This displays the state and reward from the environment 𝐀𝐅𝐓𝐄𝐑 moving.\n")
s.step(Direction.SOUTH)
s.step(Direction.SOUTH)
s.step(Direction.EAST)
s.step(Direction.EAST)
s.step(Direction.EAST)
s.step(Direction.EAST)

x____
_____
_____
_____
____o

☝ This shows a simple visualisation of the environment state.

_____
x____
_____
_____
____o

_____
_____
x____
_____
____o

(Point(x=0, y=2), -1, False) ⬅ This displays the state and reward from the environment 𝐀𝐅𝐓𝐄𝐑 moving.

_____
_____
_____
x____
____o

_____
_____
_____
_____
x___o

_____
_____
_____
_____
_x__o

_____
_____
_____
_____
__x_o

_____
_____
_____
_____
___xo

_____
_____
_____
_____
____@



(Point(x=4, y=0), -1, True)

#### Key Takeaways

There are a few key lessons that you should commit to memory:

- The **state** is an observation of the environment, which contains everything outside of the agent. For example, the agent's current position within the environment. In real world applications this could be the time of the day, the weather, data from a video camera, literally anything.
- The **reward** fully specifies the optimal solution to the problem. In real life this might be profit or the number of new customers.
- Every **action** mutates the state of the environment. This may or may not be observable.

## A simple Reinforcement Learning Solution

This simple algorithm we're about to implement operates by sampling the environment. In general, the idea is that if you can sample the environment enough times, you can begin to build a picture of the output, given any input. We can use this idea in RL. If we capture enough _trajectories_, where a trajectory is one full pass through an environment, then we can see which states are advantagous.

To begin, we create a class that is capable of generating trajectories. Here, we pass in the environment, then, the `run` function repeatedly performs a step in the environment using a random action. Each step is stored in a list and return it to the user.

> Note that this algorithm is known as the Monte Carlo method, which we will introduce in the next lecture.

In [7]:
class MonteCarloGeneration(object):
  def __init__(self, env: object, max_steps: int = 1000, debug: bool = False):
    self.env = env
    self.max_steps = max_steps
    self.debug = debug

  def run(self) -> List:
    buffer = []
    n_steps = 0 # Keep track of the number of steps so I can bail out if it takes too long
    state, _, _ = self.env.reset() # Reset environment back to start
    terminal = False
    while not terminal: # Run until terminal state
      action = random.choice(self.env.action_space) # Random action. Try replacing this with Direction.EAST
      next_state, reward, terminal = self.env.step(action) # Take action in environment
      buffer.append((state, action, reward)) # Store the result
      state = next_state # Ready for the next step
      n_steps += 1
      if n_steps >= self.max_steps:
        if self.debug:
          print("Terminated early due to large number of steps")
        terminal = True # Bail out if we've been working for too long
    return buffer

#### Visualising Trajectories

As before, it's vitally important to visualise as much as possible, to gain an intuition into your problem. A simple first step is to view the agent's movement and trajectory. Here we severely limit the amount of exploration to save reams of output. Depending on your random seed you will see the agent stumbling around.

In [8]:
env = SimpleGridWorld(debug=True) # Instantiate the environment
generator = MonteCarloGeneration(env=env, max_steps=10, debug=True) # Instantiate the generation
trajectory = generator.run() # Generate a trajectory
print([t[1].value for t in trajectory]) # Print chosen actions
print(f"total reward: {sum([t[2] for t in trajectory])}") # Print final reward

x____
_____
_____
_____
____o

x____
_____
_____
_____
____o

_____
x____
_____
_____
____o

_____
_____
x____
_____
____o

_____
_____
_x___
_____
____o

_____
_____
__x__
_____
____o

_____
_____
___x_
_____
____o

_____
_____
_____
___x_
____o

_____
_____
___x_
_____
____o

_____
_____
_____
___x_
____o

_____
_____
___x_
_____
____o

_____
___x_
_____
_____
____o

Terminated early due to large number of steps
['⬇', '⬇', '⮕', '⮕', '⮕', '⬇', '⬆', '⬇', '⬆', '⬆']
total reward: -10


### Quantifying Value

There's an important quanity called the _action value function_. In summary, it is a measure of the value of taking a particular action, given all the experience. In other words, you can look at the previous trajectories, find out which of them lead to the highest values and look to use them again.

To generate an estimate of this value, generate a full trajectory, then look at how far away the agent is from the terminal states at all steps.

So this means we need a class to generate a full trajectory, from start to termination. That code is below. First we create a new class that accepts the generator from before; we'll use this later to generate the full trajectory.

Then we create two fields to retain the experience observed by the agent. The first is recording the expected value at each state. This is the effectively the distance to the goal. The second is recording the number of times the agent has visited that state.

Then we create a helper function to return a key for the dictionary (a.k.a. map) and an action value function to calculate the value of taking each action in each state. This is simply the average value over all visits.

In [9]:
class MonteCarloExperiment(object):
  def __init__(self, generator: MonteCarloGeneration):
    self.generator = generator
    self.values = defaultdict(float)
    self.counts = defaultdict(float)

  def _to_key(self, state, action):
    return (state, action)
  
  def action_value(self, state, action) -> float:
    key = self._to_key(state, action)
    if self.counts[key] > 0:
      return self.values[key] / self.counts[key]
    else:
      return 0.0

Next we create a function to store this data after generating a full trajectory. There are several important parts of this function.

The first is that we are using reversed trajectories. I.e. we start from the end and working backwards.

The second is that we average the expected return over all visits. So this is reporting the value of an action, on average.

In [10]:
class MonteCarloExperiment(MonteCarloExperiment):
  def run_episode(self) -> None:
    trajectory = self.generator.run() # Generate a trajectory
    episode_reward = 0
    for i, t in enumerate(reversed(trajectory)): # Starting from the terminal state
      state, action, reward = t
      key = self._to_key(state, action)
      episode_reward += reward  # Add the reward to the buffer
      self.values[key] += episode_reward # And add this to the value of this action
      self.counts[key] += 1 # Increment counter

#### Running the Trajectory Generation

Let's test this by setting some expectations. We're reporting the value of taking an action **on average**. So on average, you would expect the value of taking the `EAST` action when next to the terminal state would be -1, because it's right there, it's a single step and therefore a single reward of -1 to get to the terminal state.

However, other directions will not be -1, because the agent will continue to stumble around.

In [11]:
env = SimpleGridWorld(debug=False) # Instantiate the environment - set the debug to true to see the actual movemen of the agent.
generator = MonteCarloGeneration(env=env, debug=True) # Instantiate the trajectory generator
agent = MonteCarloExperiment(generator=generator)
for i in range(4):
  agent.run_episode()
  print(f"Run {i}: ", [agent.action_value(Point(3,0), d) for d in env.action_space])

Run 0:  [0.0, -1.0, 0.0, 0.0]
Run 1:  [-3.0, -1.0, -4.0, -12.0]
Run 2:  [-3.0, -1.0, -4.0, -27.5]
Run 3:  [-3.0, -1.0, -4.0, -27.5]


So you can see from above that yes, when choosing east from the point to the west of the terminal state the expected return is -1. But notice that the agent (probably) did not observe that result straight away, because it takes time to randomly select it. (Run it a few more times to see what happens, you'll see random changes)

#### Visualising the Value Function

The value function (which we will refer to as "state value function" from now on) is the average expected return for all actions. In general, you should see that the value increases the closer you get to the goal. But because of the random movement, especially far away from the goal, there will be a lot of noise.

Below we create a helper function to plot this.

In [12]:
def state_value_2d(env, agent):
    res = ""
    for y in reversed(range(env.height)):
      for x in range(env.width):
        if env.goal.x == x and env.goal.y == y:
          res += "   @  "
        else:
          state_value = sum([agent.action_value(Point(x,y), d) for d in env.action_space]) / len(env.action_space)
          res += f'{state_value:6.2f}'
        res += " | "
      res += "\n"
    return res

print(state_value_2d(env, agent))

-59.33 | -57.79 | -37.00 | -36.34 | -37.07 | 
-56.65 | -51.16 | -47.75 | -42.79 | -48.41 | 
-43.71 | -42.96 | -43.94 | -44.38 | -35.67 | 
-40.05 | -41.30 | -31.06 | -22.08 |  -8.62 | 
-38.38 | -30.30 | -14.50 |  -8.88 |    @   | 



### Generating Optimal Policies

A _policy_ is a set of rules that an agent should follow. It is a strategy that works for that particular environment. You can now generate thousands of trajectories and track the expected value over time.

With enough averaging, the expected values should present a clear picture of what the optimal policy is. See if you can see what it is?

In the code below we are instantiating all the previous code and then generating 1000 episodes. Then we print out the state value function for every position.

In [13]:
env = SimpleGridWorld() # Instantiate the environment
generator = MonteCarloGeneration(env=env) # Instantiate the trajectory generator
agent = MonteCarloExperiment(generator=generator)
for i in range(1000):
  clear_output(wait=True)
  agent.run_episode()
  print(f"Iteration: {i}")
  # print([agent.action_value(Point(0,4), d) for d in env.action_space]) # Uncomment this line to see the actual values for a particular state
  print(state_value_2d(env, agent), flush=True)
  # time.sleep(0.1) # Uncomment this line if you want to see every episode

Iteration: 999
-98.85 | -98.64 | -94.76 | -89.27 | -84.37 | 
-97.69 | -95.11 | -90.74 | -87.81 | -82.06 | 
-91.29 | -88.08 | -80.15 | -73.38 | -64.66 | 
-87.33 | -82.64 | -72.66 | -57.03 | -38.75 | 
-81.87 | -77.20 | -64.83 | -40.97 |    @   | 



#### Plotting the Optimal Policy

That's right! The optimal policy is to choose the action that picks the highest expected return. In other words, you want to move the agent towards regions of higher reward.

Let me create another helper function to visualise where the maximal actions are pointing...

In [14]:
def argmax(a):
    return max(range(len(a)), key=lambda x: a[x])

def next_best_value_2d(env, agent):
    res = ""
    for y in reversed(range(env.height)):
      for x in range(env.width):
        if env.goal.x == x and env.goal.y == y:
          res += "@"
        else:
          # Find the action that has the highest value
          loc = argmax([agent.action_value(Point(x,y), d) for d in env.action_space])
          res += f'{env.action_space[loc].value}'
        res += " | "
      res += "\n"
    return res

print(next_best_value_2d(env, agent))

⬆ | ⮕ | ⬇ | ⮕ | ⬇ | 
⬇ | ⬇ | ⬇ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | ⬇ | ⬇ | 
⬇ | ⮕ | ⮕ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | ⮕ | @ | 



And there you have it. A policy. The above image spells out what the agent should do in each state. It should move towards regions of higher value. You can see (in general) that the arrows are all pointing towards the goal, as if by magic.

For the arrows that are not, that is a more interesting story. The problem is that the agent is still entirely random at this point. It's stumbling around until it reachest the goal. The agent started in the top left, so on average, it takes a _lot_ of stumbling to find the goal.

Therefore, for the points at the top left, furthest away from the goal, the agent will probably take many more random steps before it reachest the goal. In essence, it _doesn't matter which way the agent goes_. It will still take a long time to get there.

Subsequent Monte Carlo algorithms fix this by updating the action value function every episode and using this knowledge to choose the action. So latter iterations of the agent are far more guided and intelligent.

#### One Final Run

To wrap this up, let's run the whole thing one more time. We will plot the state value function and the policy for all iteration steps. Watch how it changes over time. Add a sleep to slow it down to see it changing on each step.

In [15]:
env = SimpleGridWorld() # Instantiate the environment
generator = MonteCarloGeneration(env=env) # Instantiate the trajectory generator
agent = MonteCarloExperiment(generator=generator)
for i in range(1000):
  clear_output(wait=True)
  agent.run_episode()
  print(f"Iteration: {i}")
  print(state_value_2d(env, agent))
  print(next_best_value_2d(env, agent), flush=True)
  # time.sleep(0.1) # Uncomment this line if you want to see the iterations

Iteration: 999
-106.92 | -104.87 | -99.46 | -91.93 | -88.39 | 
-103.63 | -102.25 | -96.18 | -88.39 | -82.27 | 
-101.76 | -96.87 | -90.96 | -82.10 | -71.96 | 
-94.89 | -88.36 | -77.60 | -61.89 | -41.65 | 
-93.03 | -85.19 | -72.19 | -45.81 |    @   | 

⬇ | ⮕ | ⮕ | ⬇ | ⬇ | 
⬇ | ⮕ | ⮕ | ⬇ | ⬇ | 
⬇ | ⬇ | ⬇ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | ⮕ | @ | 



## Summary and Next Steps

You have now implemented your first reinforcement learning algorithm and be able to learn a policy that succesfully guides the agent to the goal position !

Now, you can play around with the code to see how does the learning change if some aspects of the problem change. Here are some things that you can try:

- Increase or decrease the size of the grid.
- Add other terminating states
- Change the reward to a different value
- Change the reward to produce 0 reward per step, and a positive reward for the terminating state
- Add a terminating state with a massive negative reward, to simulate a cliff
- Add a hole in the middle
- Add a wall
- See if you can add the code to use the policy derived by the agent

### Looking forward

We see that the algorithm was able to solve the task. However, can we say that is has accurately learn the correct  value function and policy?

If not, Can you propose some ideas in order to make the learning more efficient and accurate?
