<a href="https://colab.research.google.com/github/jarrydmartinx/deep-rl/blob/master/Policy_Gradient_Methods_Theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Policy Gradient Methods Theory

* This review of the theory of Policy Gradient methods for reinforcment learning is based largely on David Silver's fantastic UCL course on reinforcement learning.
* At the bottom there's some very old broken code that should serve as an anti-model, if anything. Don't write your agents like this in future.


In [0]:
! pip install -q plotnine

In [0]:
#@title Imports

from abc import ABCMeta, abstractmethod
import collections
import enum
import itertools

import pdb

import numpy as np
import plotnine as gg
import pandas as pd 

# Type stuff
from typing import Dict, List, Tuple

# Theory Review

## Policy-Based Methods vs. Value-Based Methods


* Value Based: Learnt Value Function but implicit policy
* Policy Based: No value function and learnt policy
* Actor-Critic: Learnt Value function + learnt policy


In some environments, it might be much more compact to represent the policy rather than the value-function (e.g. Breakout)

**Advantages:**
* Better convergence properties: you are just smoothly updating your policy, and you don't get the dramatic swings in the policy can arise from updating the value-function.
  * if you just follow the gradient of your policy wrt the return, you're guaranteed to converge at least to a local optimum
* Effective in high-dimensional or continuous action-spaces: You don't have to compute the max at every step. This maximisation can be prohibitively expensive (say you have 10^10 actions, or continuous actions). 
  * With PG you just incrementally adjust the parameters of your policy in order to incrementally learn 'what the max will be', without solving a maximisation problem at every step.
* They can learn stochastic policies: necessary in e.g. rock paper scissors, aliased states. Value-based methods learn deterministic (or near deterministic) implicit policies.

**Disadvantages**
* Typically converges to a local optimum rather than a global optimum
* Naive policy based reinforcement learning can be slower, and higher variance, and less efficient than value based methods.
  * The max in value-based actions (even in Sarsa you often have a near greedy policy with a max), is extremely aggressive (you push your policy as far as you can toward what you currently estimate the optimal policy to be).
  * PG methods update smoothly in policy space, taking small steps in the *direction* of the estimated optimal policy, which makes them
    * more stable
    * less efficient
  * Also computing the gradient naively can be very slow and high-variance (can mitigate this, e.g. in Actor Critic)
 
 
#### Convergence properties of Policy Gradients vs. Value-based
* Will we find a unique global optimum?
* In the Tabular TD case, you are essentially applying the bellman operator, which is a contraction in value function space, so you're guaranteed to arrive at a global optimum
* If you follow the softmax policy, if you just follow the gradient, you also find the global optimum, with separate softmax parameters for each state
* If you have a more general value/policy approximator, like a neural network, you can get stuck in a local optimum in either case.


![alt text](https://spinningup.openai.com/en/latest/_images/math/262538f3077a7be8ce89066abbab523575132996.svg)

## PG Objective Functions

* Goal: given policy $\pi_{\theta}(s, a)$ with parameters $\theta$, find the best $\theta$
* How do we measure the quality of a policy $\pi$? What is the objective function for PG methods?
* Three possible objective functions:
  * Start Value: only for episodic environments. You have a defined distribution over start states s_1
  $$ J_1(\theta) = V^{\pi_{\theta}} (s_1) = \mathbb{E}_{\pi_{\theta}}\big[v_1\big] $$
  * Average Value: for continuing environments
  $$ J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta} (s) $$
  * Average reward per time-step
  $$ J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta (s, a) \mathcal{R}_s^a       $$
  where $d^{\pi_\theta}(s)$ is the stationary distribution of the Markov Chain induced by $\pi_\theta$
* But don't worry because *they're just rescalings of each other*, and the **policy gradient direction is the same** for all three
* Hence all methods that maximise one maximise the others

## Policy Optimisation

* Policy based reinforcment learning is an optimisation problem. Find $\theta$ that maximises $J(\theta)$
* Some approaches do not use gradient
  * Hill climbing
  * Simplex / amoeba / Nelder Mead
  * Genetic algorithms
* Greater efficiency often possible using gradient
  * Gradient descent
  * Conjugate gradient
  * Quasi-newton
* We focus on gradient descent, many extensions possible
* And on methods that exploit sequential structure
  * We're not going to do blind optimisation like a genetic algorithm
    * GA would just let the agent run around, die, get a return value at the end and then we'd change some parameters
    * PG methods still allow us to break open the trajectory, and make use of the sequence of states of rewards to do better **by learning within the agent's lifetime**
 

### Gradient Ascent
* The objective is something like 'reward gotten out of the system', we want to maximise that by adjusting the policy parameters, just a little, in the direction of the gradient of the object wrt the policy parameters
$$ \Delta\theta = \alpha\nabla_\theta J(\theta)    $$
* $\nabla_\theta J(\theta)$ is the **policy gradient**
\begin{align}
\nabla_\theta J(\theta) &= 
\begin{pmatrix}
\frac{\partial J(\theta)}{\partial \theta_1}\\
\vdots \\
\frac{\partial J(\theta)}{\partial \theta_n}
\end{pmatrix}
\end{align}




## Computing the gradient of the policy

### Numerically Computing the gradient of the policy (not the same as the PG)
* Numerical method of finite differences, works even if your objective isn't differentiable. 
  * You just do small perturbations to the policy in each parameter dimension, and then look at the value of the objective before and after the perturbation. That yields a numerical estimate of the gradient in each direction.
  * Very slow, especially in high-dimensional policy space, but it works sometimes (AIBO Soccer, UT, Peter Stone)

#### Computing the Gradient of the Policy Analytically 
* Here we still have no value function, these are still monte carlo approaches
* Assume that the policy $\pi_\theta$ is differentiable whenever it is non-zero
  * Technically only has to be differentiable wherever you are picking actions???
* and assume we know the gradient \nabla_theta \pi_\theta(s, a)
  * e.g. our policy is represented as a neural network we have created
* Likelihood ratios exploit the following identity
\begin{align}
  \nabla_\theta\pi_\theta(s, a) 
  &= \pi_\theta(s, a)\frac{\nabla_\theta \pi_\theta (s, a)}{\pi_\theta(s, a)} \\
  &= \pi_\theta(s, a)\nabla_\theta\log\pi_\theta (s, a)
\end{align}
  * We can multiply and divide by our policy without changing the objective
  * the gradient of the policy divided by the policy is equal to the gradient of the log of the policy
* This term $\nabla_\theta\log\pi_\theta(s,a)$ is called the score function


### What does the score look like for common policy choices?

#### Softmax Policy
* For discrete actions, a very simple way to parameterise the policy
  * A softmax policy is any policy that makes the probability of an action proportional to some exponentiated 'value'
* It's a smoothly parameterised policy that tells us how often we should take a particular action
* We choose some features, and some linear weights, and the linear combination of state-action features becomes something like a value
* This is called the linear softmax policy: the probability of taking an action a from state s is proportional to the exponentiated value
$$ \pi_\theta(s, a) \propto e^{\phi(s, a)^\top\theta}  $$
* Now we want to find the gradient of this policy, i.e. the score function:
$$ \nabla_\theta\log\pi_\theta(s, a) = \phi(s, a) - \mathbb{E}_{\pi_\theta} \big[\phi(s, \cdot)\big] $$
  * Intuitively, it's scoring 'how much more of this feature do I have than usual'. If this feature occurs more than usual with this action, and it gets a good reward, then we want to adjust the policy to do more of that action.

#### Gaussian Policy
* For continuous action spaces.
* Just a different parameterisation of the policy, here we usually just parameterise the mean $\mu$ but could also parameterise the variance $\sigma^2$
* Policy is a Gaussian over some support, $a \sim \mathcal{N}(\mu(s), \sigma^2)$
* The score function is:
$$ \nabla_\theta\log\pi_\theta(s,a) = \frac{(a-\mu(s))\phi(s)}{\sigma^2} $$
* Intuitively, this again measures how much 'more' we're taking this action than usual (than the mean), multiplied by the features that were observed. If observing those features resulted in more reward, we adjust the policy parameters to increase the probability of taking that action again in that state.

## Computing the Policy Gradient $\nabla_\theta J(\theta)$

### One-step MDP (a.k.a contextual bandit)
* Agents starts in state $s\sim d(s)$
* Environment terminates after one time-step with reward $r = \mathcal{R_{s,a}}$
* We use the likelihood ratio trick to compute the PG:
\begin{align}
J(\theta) &= \mathbb{E}_{\pi_\theta}[r] \\
&= \sum_{s\in \mathcal{S}}d(s)\sum_{a\in\mathcal{A}}\pi_\theta(s, a)\mathcal{R_{s,a}}\\
\nabla_{\theta} J(\theta) &=  
\sum_{s\in \mathcal{S}} d(s) \sum_{a\in\mathcal{A}} \pi_\theta (s, a) \nabla_{\theta} \log \pi_\theta (s, a) \mathcal{R_{s,a}}  \\
&= \mathbb{E}\big[\nabla_\theta \log \pi_\theta (s, a) r \big]
\end{align}
* Note that the whole point of using the likelihood ratio trick is to derive an expression for the gradient that is itself an expectation wrt the policy (i.e. wrt the behaviour distribution)
  * We can sample by taking actions in the environment in order to estimate this expectation
  * This expectation tells us the direction in which to shift the policy parame

### Policy Gradient Theorem
* We want to know which direction to shift the parameters in a multi-step MDP
* Beautifully, all we need to do is replace the immediate reward with the value function. That turns out to give us the true gradient of the policy objective function.
* For any differentiable policy $\pi_\theta(s, a)$, \\
for any of the policy objective functions $J = J_1, J_{avR}$, or $\frac{1}{1-\gamma} J_{avV}$

\begin{align}
  \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \big[\nabla_\theta \log \pi_\theta(s, a)Q^{\pi_\theta}(s, a)\big] 
\end{align}




## Algorithms


### Monte-Carlo Policy Gradient (REINFORCE)

* Update parameters by stochastic gradient ascent, using PG theorem
* Using policy gradient theorem
* Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t,a_t)$
* Gradient step: 
$$ \Delta\theta_t = \alpha \nabla_\theta\log\pi_\theta(s_t, a_t) v_t    $$
* You get a very smooth learning curve (in contrast to value-based, which chatters and sometimes collapses)
* Problem: Slow. Very high variance, like any Monte Carlo estimate.


### Actor-Critic



Desiderata: 
* Smooth learning with good convergence properties
* Be faster and lower variance than Monte-Carlo methods

To reduce variance we introduce a critic:
* The critic estimates the action-value function,
$$ Q_w(s, a) \approx Q^{\pi_\theta} $$
* This is just the familiar problem of *policy evaluation*: how good is policy $\pi_\theta$
  * Could do it with MC policy evalutation, TD learning, TD(\lambda)
* we will substitute these in place of the return in our policy gradient algorithm
* Actor-critic algorithms maintain *two* sets of parameters
  * Critic: Updates action-value function parameters $w$
  * Actor: Updates policy parameters $\theta$, in direction suggested by critic
  
Actor-critic algorithms follow an approximate policy gradient
\begin{align}
\nabla_\theta J(\theta) &\approx \mathbb{E}_{\pi_\theta}\big[\nabla_\theta\log\pi_\theta(s, a) Q_w(s, a)\big] \\
\Delta\theta &= \alpha\nabla_\theta\log\pi_\theta(s, a)Q_w(s, a)
\end{align}

#### Czepesvari on Actor-Critic
* **Actor-Critic methods implement Generalised Policy Iteration**
  * Remember that policy-iteration works by alternating between a complete policy evaluation and a complete policy improvement step. 
  * When using sample-vased methods or function approximation, exactly evaluation of the policies may require infinitely many samples or might be impossible due to the restrictions of the function-approximation technique
  * Hence RL algs simulating policy iteration must change the policy based on incomplete knowledge of the value function.
  * This is GPI: there are two closely interacting processes, actor and critic, policy improvement and policy evaluation.
    * Note that the policy used to generate samples (the behaviour policy), can be different from the one that is evaluated by the critic and improved by the actor
    * This can be very useful because the critic can learn about actions not preferred by the current target policy so that the critic can improve the target policy
    
 #### Problems
 * Note that unlike perfect policy iteration, a GPI method may generate a policy taht is substantially worse than the previous one.
  * The quality of the sequence of generated policies may oscillate or even diverge when the PE step is incomplete, regardless of whether the PI step is exact or approximate (B&T 1996)
  * In practice, they often improve in the beginning and oscillate later
  

#### Algorithm: Action-Value Actor-Critic
* Simple action-value critic using linear value fn approimation Q_w(s, a) = \phi(s, a)^\top_w
  * Critic: Updates $w$ by linear TD(0)
  * Actor: Updates $\theta$ by policy gradient

### The Bias Variance Tradeoff in RL


* Monte Carlo methods sample the actual end-of-episode return before doing an update. This return is an unbiased estimators of the true value
* TD methods do backups; that is, they estimate the value of a particular state using estimates of the value of other states. These estimates are much lower variance, but this also introduces bias.
* Methods like TD($\lambda$) attempt to balance this tradeoff

#### Bias/Variance in Policy Gradients
* Approximating the policy gradient introduces bias
* A biased policy gradient may not find the right solution
  * e.g. if Q_w(s, a) uses aliased features, can we solve the gridworld example
* Amazingly, if we choose value function approximation carefully, we can avoid introducing bias and follow the **exact** policy gradient


### Compatible Function Approximation Theorem


If the following two conditions are satisfied:
1.  Value function approximator is **compatible** to the policy:
$$ \nabla_w Q_w(s, a) = \nabla_\theta\log \pi_\theta(s, a) $$
2. Value function parameters $w$ minimise the mean-squared error
$$ \mathcal{L}(w) = \mathbb{E}_{\pi_\theta} \big[(Q^{\pi_\theta}(s, a) - Q_w(s, a))^2\big] $$

Then the policy gradient is exact,
$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\big[\nabla_\theta\log \pi_\theta(s, a) Q_w(s, a)\big]  $$


**TODO(jarryd@google.com): Prove this theorem (slides have a proof)

### Advantage Actor-Critic

#### Reducing variance using a baseline
* We substract a baseline function $B(s)$ from the policy gradient
* A good baseline is the state value function $B(s) = V^{\pi_\theta}(s)$
* So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s, a)$
\begin{align}
A^{\pi_\theta}(s, a) &= Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s) \\ 
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}\big[\nabla_\theta\log\pi_\theta(s,a) A^{\pi_\theta}(s, a)\big]
\end{align}


#### Estimating the advantage function
* You could learn both $Q^{\pi_\theta}$ and $V^{\pi_\theta}$ with two different function approximators and sets of parameters
* Better way: **the td-error $\delta^{\pi_\theta}$ is an unbiased estimator of the advantage function**
\begin{align}
\mathbb{E}_{\pi_\theta}\big[\delta^{ \pi_\theta} \mid s, a\big]
&= \mathbb{E}_{\pi_\theta}\big[r + \gamma V^{\pi_\theta}(s')\mid s, a \big]-V^{\pi_\theta}(s) \\
&= Q^{\pi_\theta}(s,a) - V^{\pi_theta}(s) \\
&= A^{\pi_\theta}(s, a)
\end{align}
* So we just use the TD error to compute the PG:
$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\big[\nabla_\theta \log \pi_\theta(s, a)\delta^{\pi_\theta}\big] **$$**
* In practice we can use an approximate TD error
$$ \delta_v = r + \gamma V_v(s') - V_v(s) $$
which only requires one set of critic parameters, **we don't need to estimate $ /Q^{\pi_\theta}$**


## PG algorithms are on-policy

* On-policy MEANS that when you update your policy, you throw away your old samples and generate new samples
* When we deal with very complicated policy classes represented by neural networks, which we only change a bit with each gradient step, this can make on-policy learning very inefficient.
* What if we want to learn from samples generated under a (perhaps only slightly) different policy from the current one?

### Importance Sampling and the off-policy Policy Gradient
* It's a way to estimate the expectation of some function wrt to some distribution, but when we can't sample from that distribution
* Instead we sample from another proposal distribution
\begin{align}
\mathbb{E}_{x\sim p(x)}[f(x)]
&= \int p(x)f(x)dx \\
&= \int \frac{q(x)}{q(x)}p(x)f(x)dx \\
&= \mathbb{E}_{x\sim q(x)} \bigg[\frac{p(x)}{q(x)}f(x)\bigg]
\end{align}
* In the RL setup the relevant expectation is our objective
$$ \mathbb{E}_{\tau\sim \pi_{\theta}(\tau)}\big[r(\tau)\big]         $$
and we have samples from some $\bar{\pi}(\tau)$ instead
* Using importance sampling we can estimate this expectation:
$$ J(\theta) = \mathbb{E}_{\tau\sim\bar{\pi}(\tau)}\bigg[\frac{\pi_\theta(\tau)}{\bar{\pi}(\tau)}r(\tau)\bigg]  $$
* Computing the gradient and using the likelihood ratio trick yields the off-policy PG:
\begin{align}
\nabla_{\theta'} J(\theta')
&= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\bigg[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log \pi_{\theta'}(\tau)r(\tau) \bigg] , \qquad \text{when } \theta \neq \theta'
\end{align}
* Substituting terms in for $\pi_\theta$ and $\bar{\pi}$, we can cancel out transition probability terms in the product, and all we are left with is the ratio of action probabilities under each policy
* A first order approximation for IS:
\begin{align}
J(\theta') = 
\sum_{t=1}^{T}\mathbb{E}_{s_t\sim p_\theta(s_t)}\big[\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}\big]
\mathbb{E}_{a_t\sim\pi_\theta(a_t \mid s_t)} \bigg[\frac{\pi_{\theta'}(a_t\mid s_t)}{\pi_{\theta'}(a_t\mid s_t)}r(s_t,a_t)\bigg]
\end{align}
where we then ignore the importance ratios for the transition probabilities. Will see later in Levine course why this is reasonable.
* Something about importance weights being exponential in $T$, and that's why we left off the aforementioned weights. Revisit.

# Implementation

In [0]:
Action = int
State = int
Vector = np.ndarray
TimeStep = collections.namedtuple('TimeStep', ['observation', 'reward', 'done'])

In [0]:
class Agent(metaclass=ABCMeta):
  
  @abstractmethod
  def policy(self, timestep: TimeStep) -> Action:
    pass

  @abstractmethod
  def update(self,
             timestep: TimeStep,
             action: Action,
             new_timetstep: TimeStep) -> None:
    pass

In [0]:
#@title TD-Learning Agent with Linear Function Approximation

class TDAgentLFA:
  
  def __init__(self,
               num_actions: int,
               exploration_policy: Policy,
               alpha: float = 0.01,
               gamma: float = 0.9):
    """Builds a tabular Q-learning agent.
    
    Glossary:
      A: Size of action space.
      X: Size of feature space.
    
    Args:
      TODO(jarryd): Describe what each of these means.
    """
    
    # Exploration Policy
    self._exploration_policy = exploration_policy
    
    # Learning parameters
    self._gamma = gamma
    self._alpha = alpha

    # Value function parameters/environment metadata
    self._num_actions = num_actions
    self._weights = None
  
  def policy(self, timestep: TimeStep) -> Action:
    """Compute the policy."""
    
    q_values = self._get_q_values(timestep.observation.flatten())

    return self._exploration_policy(q_values)
  
  def update(self, 
             timestep: TimeStep,
             last_action: Action,
             new_timestep: TimeStep) -> None:
    """
    Performs a SGD step to improve the approximation of q(., .)
    """

    x = timestep.observation.flatten()
    a = last_action
    r = new_timestep.reward
    x_ = new_timestep.observation.flatten()
    
    old_q = self._get_q_values(x, a)
    target_q = self._get_target_q_value(x_)

    target = r
    if not new_timestep.done:
      target += self._gamma * target_q
    
    td_error = target - old_q
    step = self._alpha * td_error * x
    self._weights[a] += step

  def _get_q_values(self, x: Vector, action: Action = None):
    """
    TODO(jarryd@google.com)s
    Compute the approximate q-values
    """
    
    if self._weights is None:
      self._weights = np.random.rand(self._num_actions, x.shape[0])  # [A, X]
    
    if action is None:
#       raise NotImplemented("You haven't dealt with the case for Q-Learning")
      return np.dot(self._weights, x)  # []
    else:
      return np.dot(self._weights[action], x)

  @abstractmethod
  def _get_target_q_value(self, new_observation: Vector) -> float:
    """
    Concrete subclasses like SarsaAgent/QLearningAgent must implement this
    method to compute the next action-value Q(s',a') in on/off policy manner.
    """
    pass
  
  
#@title Sarsa Agent with LFA

class SarsaAgentLFA(TDAgentLFA):
  """A Sarsa agent that uses linear value function approximation"""
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self._action = None    
  
  def _get_target_q_value(self, new_observation: Vector) -> float:
    """Computes the (on-policy) ....# TODO"""
    
    q_target = np.dot(self._weights[self._action], 
                      new_observation)
    
    return q_target

In [0]:
class ActorCriticAgentLFA(Agent):
  
  def __init__(num_actions: int,
               critic: TDAgentLFA, 
               behaviour_policy: Policy,
               alpha: float = 0.01
               gamma: float = 0.9):
    
    # Value function parameters
    self._num_actions = num_actions
    
    # Critic module, a TD-learning Agent that performs policy evaluation
    self._critic = critic
    # 
    self._behaviour_policy = behaviour_policy
    
    # Learning Parameters
    self._gamma = gamma
    self._alpha = alpha
    
  def policy(self, timestep:TimeStep) -> Action:
    
    q_values = self._critic._get_q_values(timestep.observation.flatten())
    
    return self._behaviour_policy(q_values)
  
  def update(self,
            timestep: TimeStep,
            action: Action,
            new_timestep: TimeStep) -> None:
    
    
    self._critic.update(timestep, action, new_timestep)

In [0]:
# Initialise states s, policy weights theta
# Sample a ~ pi_theta
# for each step in the trajectories
  # r = R(s,a), transition s' ~ P(s,a)

  # Critic computes td_error
  # Sample action a' ~ \pi_theta(s', a')
  # td_error = r + gamma Q_w(s', a') - Q_w(s, a)

  # Update policy weights
  # theta = theta + alpha * score * Q_w(s, a)

  # Update critic weights
  # w = w + beta * td_error * phi(s, a)

  # action = a'
  # state = s'

In [0]:
#@title Four Rooms.

class Tile(enum.Enum):
  EMPTY = 0
  WALL = 1
  AGENT = 2
  GOAL = 3
  
  
# class Action(enum.Enum):
#   UP = (-1, 0)
#   DOWN = (1, 0)
#   LEFT = (0, -1)
#   RIGHT = (0, 1)


MAP = """
#########
#       #
#   #   #
#   #   #
### ## ##
#   #   #
#       #
#   #   #
#########
"""


class FourRooms(object):
  
  def __init__(self, 
               map_string=MAP, 
               initial_pos=(1, 1), 
               goal_pos=(7, 7)):
    # Convert map to binary
    map_string = map_string.replace('#', str(Tile.WALL.value))
    binary_map = map_string.replace(' ', str(Tile.EMPTY.value))

    # Convert to list
    map_list = [list(x) for x in binary_map.split('\n')[1:-1]]
    
    # Convert to numpy array
    self._grid = np.array(map_list, dtype=np.int32)
    
    self._size = self._grid.shape[0]
    
    # Make sure that the grid is square.
    if self._grid.shape[1] != self._grid.shape[0]:
      raise ValueError('Map must be square, soz.')
      
    self._actions = {
        0: np.array((-1, 0)),
        1: np.array((1, 0)),
        2: np.array((0, -1)),
        3: np.array((0, 1)),
    }
    
    self._initial_pos = np.array(initial_pos)
    self._pos = self._initial_pos
    
    self._goal_pos = np.array(goal_pos)
    self._grid[goal_pos[0], goal_pos[1]] = Tile.GOAL.value
  
  def step(self, action: int) -> TimeStep:
    # TODO(jaslanides): Noisy actions?
    action_vec = self._actions[action]
    next_x, next_y = self._pos + action_vec
    next_tile = self._grid[next_x, next_y]
    
    reward = 0
    done = False
    
    if next_tile != Tile.WALL.value:  # Not a wall
      self._pos = (next_x, next_y)  # Move there :)

    if next_tile == Tile.GOAL.value:  # If it's the goal, get money get paid
      reward = 1
      done = True  # Episode ends when u get paid  
    timestep = TimeStep(observation=self._get_observation(), 
                        reward=reward,
                        done=done)

    return timestep
  
  def reset(self) -> TimeStep:
    self._pos = self._initial_pos
    # TODO(jaslanides): Randomize goals?
    return TimeStep(observation=self._get_observation(), 
                    reward=0, 
                    done=False)
      
  def _get_observation(self):
    obs = self._grid.copy()
    obs[self._pos[0], self._pos[1]] = Tile.AGENT.value
    obs = np.float32(obs)
    obs -= obs.mean()

    return obs
  
  @property
  def num_actions(self) -> int:
    return 4  # L, U, R, D
  
  @property
  def obs_shape(self) -> np.ndarray:
    # return the obs shape
    return self._grid.shape
  