# Temporal difference prediction and control

In this notebook, you will implement temporal difference approaches to prediction and control described in [Sutton and Barto's book, Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html). We will use the grid ```World``` class from the previous lectures.

### Install dependencies

In [None]:
! pip install numpy pandas



### Imports

In [None]:
import numpy as np
import random
import sys          # We use sys to get the max value of a float
import pandas as pd # We only use pandas for displaying tables nicely
pd.options.display.float_format = '{:,.3f}'.format

### ```World``` class and globals

The ```World``` is a grid represented as a two-dimensional array of characters where each character can represent free space, an obstacle, or a terminal. Each non-obstacle cell is associated with a reward that an agent gets for moving to that cell (can be 0). The size of the world is _width_ $\times$ _height_ characters.

A _state_ is a tuple $(x,y)$.

An empty world is created in the ```__init__``` method. Obstacles, rewards and terminals can then be added with ```add_obstacle``` and ```add_reward```.

To calculate the next state of an agent (that is, an agent is in some state $s = (x,y)$ and performs and action, $a$), ```get_next_state()```should be called.

In [67]:
# Globals:
ACTIONS = ("up", "down", "left", "right")

# Rewards, terminals and obstacles are characters:
REWARDS = {" ": 0, ".": 0.1, "+": 10, "-": -10}
TERMINALS = ("+", "-") # Note a terminal should also have a reward assigned
OBSTACLES = ("#")

# Discount factor
gamma = 1

# The probability of a random move:
rand_move_probability = 0

class World:
  def __init__(self, width, height):
    self.width = width
    self.height = height
    # Create an empty world where the agent can move to all cells
    self.grid = np.full((width, height), ' ', dtype='U1')
    # Wind strength for each column (default no wind)
    self.wind = np.zeros((width, height, 2), dtype=int)  # (dx, dy)

  def add_obstacle(self, start_x, start_y, end_x=None, end_y=None):
    """
    Create an obstacle in either a single cell or rectangle.
    """
    if end_x == None: end_x = start_x
    if end_y == None: end_y = start_y

    self.grid[start_x:end_x + 1, start_y:end_y + 1] = OBSTACLES[0]

  def add_reward(self, x, y, reward):
    assert reward in REWARDS, f"{reward} not in {REWARDS}"
    self.grid[x, y] = reward

  def add_terminal(self, x, y, terminal):
    assert terminal in TERMINALS, f"{terminal} not in {TERMINALS}"
    self.grid[x, y] = terminal

  def set_wind(self, x,y,dx,dy):
    self.wind[x,y] = (dx,dy)

  def is_obstacle(self, x, y):
    if x < 0 or x >= self.width or y < 0 or y >= self.height:
      return True
    else:
      return self.grid[x ,y] in OBSTACLES

  def is_terminal(self, x, y):
    return self.grid[x ,y] in TERMINALS

  def get_reward(self, x, y):
    """
    Return the reward associated with a given location
    """
    return REWARDS[self.grid[x, y]]

  def get_next_state(self, current_state, action):
    """
    Get the next state given a current state and an action. The outcome can be
    stochastic  where rand_move_probability determines the probability of
    ignoring the action and performing a random move.
    """
    assert action in ACTIONS, f"Unknown acion {action} must be one of {ACTIONS}"

    x, y = current_state

    # If our current state is a terminal, there is no next state
    if self.grid[x, y] in TERMINALS:
      return None

    # Check of a random action should be performed:
    if np.random.rand() < rand_move_probability:
      action = np.random.choice(ACTIONS)

    if action == "up":      y -= 1
    elif action == "down":  y += 1
    elif action == "left":  x -= 1
    elif action == "right": x += 1
    if not (0 <= x < self.width and 0 <= y < self.height):
      return current_state
    dx, dy = self.wind[x, y]
    x += dx
    y += dy

    # If the next state is an obstacle, stay in the current state
    return (x, y) if not self.is_obstacle(x, y) else current_state


## A simple world and a simple policy

In [None]:
world = World(2, 3)

# Since we only focus on episodic tasks, we must have a terminal state that the
# agent eventually reaches
world.add_terminal(1, 2, "+")

def equiprobable_random_policy(x, y):
  return { k:1/len(ACTIONS) for k in ACTIONS }

print(world.grid.T)

[[' ' ' ']
 [' ' ' ']
 [' ' '+']]


## Exercise: TD prediction

You should implement TD prediction for estimating $V≈v_\pi$. See page 120 of [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html).


To implement TD prediction, the agent has to interact with the world for a certain number of episodes. However, unlike in the Monte Carlo case, we do not rely on complete sample runs, but instead update estimates (for prediction and control) and the policy (for control only) each time step in an episode.


Below, you can see the code for running an episode, with a TODO where you have to add your code for prediction. Also, play with the parameters ```alpha``` and ```EPISODES```, you will typically need a lot more than 10 episodes for an agent to learn anything.

In [None]:
# Global variable to keep track of current estimates
V = np.full((world.width, world.height), 0.0)

# Our step size / learing rate
alpha = 0.05 #Kan sættes ned ved mange iterationer for at tilnærme sig den rigtige værdi

# Discount factor
gamma = 0.9

# Episodes to run
EPISODES = 1000

def TD_prediction_run_episode(world, policy, start_state):
    current_state = start_state
    while not world.is_terminal(*current_state):
        # Get the possible actions and their probabilities that our policy says
        # that the agent should perform in the current state:
        possible_actions = policy(*current_state)

        # Pick a weighted random action:
        action = random.choices(population=list(possible_actions.keys()),
                                weights=possible_actions.values(), k=1)

        # Get the next state from the world
        next_state = world.get_next_state(current_state, action[0])

        # Get the reward for performing the action
        reward = world.get_reward(*next_state)

        V[current_state] = V[current_state] + alpha * (reward + gamma * V[next_state] - V[current_state])

        # Move the agent to the new state
        current_state = next_state


for episode in range(EPISODES):
    # print(f"Episode {episode + 1 }/{EPISODES}:")
    TD_prediction_run_episode(world, equiprobable_random_policy, (0, 0))

display(pd.DataFrame(V.T))

Unnamed: 0,0,1
0,3.311,3.792
1,4.321,6.139
2,6.452,0.0


## Exercise: SARSA

Implement and test SARSA with an $\epsilon$-greedy policy. See page 130 of [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html) on different worlds. Make sure that it is easy to show a learnt policy (most probable action in each state).


In [69]:
# TODO: Implement your code here -- you need a Q-table to keep track of action
#       value estimates and a policy-function that returns an epsilon greedy
#       policy based on your estimates.
world2 = World(4,4)
world2.add_terminal(1, 2, "+")
Q = np.zeros((world2.width, world2.height,len(ACTIONS)))

EPISODES = 100
epsilon = 0.1
alpha = 0.05

def epsilon_greedy(state):
  x,y = state
  if np.random.rand() <= epsilon:
    return random.choice(ACTIONS)
  else:
    return ACTIONS[np.argmax(Q[x,y])]


def sarsa(world, start_state):
  for episode in range(EPISODES):
    current_state = start_state
    current_action = epsilon_greedy(current_state)

    action_to_index = {action: i for i, action in enumerate(ACTIONS)}

    while not world.is_terminal(current_state[0], current_state[1]):

      # Get the next state from the world
      next_state = world.get_next_state(current_state, current_action)
      next_action = epsilon_greedy(next_state)

      Q[current_state[0], current_state[1], action_to_index[current_action]] += alpha * (world.get_reward(*next_state) + gamma * Q[next_state[0], next_state[1], action_to_index[next_action]] - Q[current_state[0], current_state[1], action_to_index[current_action]])
      current_state = next_state
      current_action = next_action


# sarsa(world2, (0, 0))

# for x in range(world2.width):
#   for y in range(world2.height):
#     for action in ACTIONS:
#       print(f"State: {x,y} - Action: {action} - Value: {Q[x,y,ACTIONS.index(action)]}")
#     print("\n")

## Exercise: Windy Gridworld

Implement the Windy Gridworld (Example 6.5 on page 130 in the book) and test your SARSA implementation on the Windy Gridworld, first with the four actions (```up, down, left, right```) that move the agent in the cardinal directions, and then with King's moves as described in Exercise 6.9. How long does it take to learn a good policy for different values of $\alpha$ and $\epsilon$?

In [71]:
### TODO: Implement and test SARSA, first on Windy Gridworld with four actions
###       and then with King's moves

windyWorld = World(8, 8)
Q = np.zeros((windyWorld.width, windyWorld.height,len(ACTIONS)))

for y in range(windyWorld.height):
    windyWorld.set_wind(2, y, 0, -1)  # Wind pushes up

windyWorld.add_obstacle(2, 2)
windyWorld.add_terminal(4, 4, "+")

sarsa(windyWorld, (0, 0))
V = np.full((windyWorld.width, windyWorld.height),"right")

for x in range(world2.width):
  for y in range(world2.height):
    for action in ACTIONS:
      V[x,y] = ACTIONS[np.argmax(Q[x,y])]


In [72]:
display(pd.DataFrame(V.T))

Unnamed: 0,0,1,2,3,4,5,6,7
0,right,down,right,right,right,right,right,right
1,up,right,up,up,right,right,right,right
2,up,up,up,right,right,right,right,right
3,up,up,up,up,right,right,right,right
4,right,right,right,right,right,right,right,right
5,right,right,right,right,right,right,right,right
6,right,right,right,right,right,right,right,right
7,right,right,right,right,right,right,right,right
