# 1. Introduction to RL
### 1.2 States, MDPs, and Policies

In [1]:
import numpy as np
import random

## State
We define our state space, $\mathcal{S}$, as each point in the grid across which our agent will traverse. $\text{S}$ represents the starting state, $\text{F}$ represents the terminal state which will dispense a reward, and $\text{\#}$ show obstacles. 

The **current state**, $s$ is our agent's location in the **state space**, $s \in \mathcal{S}$.

In [2]:
grid = np.array([
    ['S', '-', '-', '#'],
    ['-', '#', '-', '-'],
    ['-', '-', '-', '#'],
    ['#', '-', '-', 'F']
])

We define our action space $\mathcal{A}$. 

In [3]:
actions = ['up', 'down', 'left', 'right']

We define our reward as $10$ for finishing in state $\text{F}$ which is located at index $(3,3)$ of our array.

In [4]:
rewards = { (3, 3): 10}

Similar to the bandit problem, we initialise a dictionary of zeroed action values, but now introduce the dimension of **state**. 

Recall that in the bandit problem, action values are given by:
$$q(a)=\mathbb{E}[R|A=a]$$

This means that the expected reward depends *only on the action taken*. 

Now that we have introduced **state**, rewards are determined not by the action taken, but also by the *state in which the action was taken*: 
$$q(s,a) = \mathbb{E}[R| S=s, A=a]$$

If you're struggling to make sense of that, take a look at the dictionary print-out below - we've simply added another dimension. Now, action values are stored for each **state-action pair**, instead of for just actions alone.

In [5]:
Q_values = { (i, j): {a: 0.0 for a in actions} for i in range(4) for j in range(4) }
Q_values

{(0, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0}}

We define a function for checking it the state the agent is trying to move to is valid - we don't want it going out of bounds or into an obstacle.

In [6]:
def is_valid(state):
    i, j = state
    return 0 <= i < 4 and 0 <= j < 4 and grid[i, j] != '#'

We also define a function for retrieving the next state, where we adjust the indices of our array, check their validity, and then return the new state.

In [7]:
def get_next_state(state, action):
    i, j = state
    if action == 'up': next_state = (i-1, j)
    elif action == 'down': next_state = (i+1, j)
    elif action == 'left': next_state = (i, j-1)
    elif action == 'right': next_state = (i, j+1)
    
    if is_valid(next_state): return next_state
    else: return state

Finally, we define a $\text{play}$ function. This initialises our agent at the starting position $\text{S}$, i.e., $(0,0)$. Our agent will then randomly select actions and update their action-values (Q-values) as it progresses through the grid, and terminate when it reaches position $\text{F}$, i.e., $(3,3)$. 

In [8]:
def play(): 
    state = (0, 0)
    while state != (3, 3): 
        action = random.choice(actions)
        next_state = get_next_state(state, action)
        reward = rewards.get(next_state, 0)
        
        Q_values[state][action] += reward
        state = next_state

What do you think our action value dictionary is going to look like after we have finished?

In [9]:
play()

Let's have a look what our agent has learned by checking the table of Q-values.

In [10]:
Q_values

{(0, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 10.0},
 (3, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0}}

The agent *hasn't learned anything* about how to navigate the grid. It has randomly selected actions in each state until it happened upon the terminal state. Instead of improving its decision-making, it has brute-forced its way through, blindly repeating the process every time it plays.

What we *want to do* is to propagate the reward backward through the grid, ensuring that each state-action pair receives a **small update** according to how much it contributed to the final reward. 

This is where **Markov Decision Processes (MDPs)** and **Dynamic Programming** come in. 

## Markov Decision Processes (MDPs)
A **Markov Decision Process (MDP)** is a **model for sequential decision making when outcomes are uncertain**. 

We can consider our grid as a sequence of decisions - with every state we are presented, we have four possible actions to take (up, down, left, right), and we must try to take the most optimal action.

An MDP is defined as a tuple: 
$$(\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma)$$

Where: 
- $\mathcal{S}$ is the **state space**.
- $\mathcal{A}$ is the **action space**.
- $P(s'|s,a)$ is the **transition function** - the probability of reaching state $s'$ by taking action $a$ in state $s$. 
- $R(s,a,s')$ is the reward received immediately after taking action $a$ in state $s$. 
- $\gamma$ is a discount factor, which allows us to control how much future rewards matter for the current decision. 

Unlike the bandit approach we explored earlier, a Markov decision process doesn't just consider immediate rewards - it also considers future rewards and how different actions lead to different states. 

To be able to model our problem as a Markov decision process, it must fulfill two criteria: 
- It must *not* violate the Markov property.
- We need *complete* knowledge of the MDP components $(\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma)$.

### The Markov Property
We can **only model a problem as an MDP if it satisfies the Markov Property**. 

The Markov property states that **the future state depends only on the current state and action, and not on the full history of past states**. 

Our gridworld fulfills this property because: 
- If our agent is in state $(1,2)$, the only *relevant information* for deciding what to do next is what actions are available from $(1,2)$. 
- It *does not matter* how the agent got to state $(1,2)$, and so past moves have no bearing on future decisions. 

A game like blackjack would **violate** the Markov property because: 
- Future states depend on past decisions. 
- The best decision (e.g., hit or stick) depends not just on the current hand, but also **what cards have already been played** - thus, past decisions affect future states.

We also need to check if we have complete knowledge of the components $(\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma)$.

We know $\mathcal{S}$, the state space, because we defined it earlier:

In [11]:
grid = np.array([
    ['S', '-', '-', '#'],
    ['-', '#', '-', '-'],
    ['-', '-', '-', '#'],
    ['#', '-', '-', 'F']
])

We know $\mathcal{A}$, the action space: 

In [12]:
actions = ['up', 'down', 'left', 'right']

We know $P(s'|s,a)$, the transition function. Moving around our grid is **deterministic** - if we take action '$\text{right}$' in state $(0,0)$, we're going to move to state $(0,1)$ with 100% certainty, i.e.:
$$P(s'|s,a) = 1 \quad \forall s,a \in \mathcal{S}, \mathcal{A}$$

This **does not always hold**. In many environments, transitions are stochastic, meaning that the same action can lead to different outcomes with some probability. We'll explore these later. 

We know $R(s,a,s')$, the reward for taking action $a$ in state $s$ leading to the new state $s'$:

In [13]:
rewards = { (3, 3): 10}

$\gamma$ (gamma), the discount factor, is defined by us, so we know that too.

Since our gridworld satisfies the Markov property, and we have all five components of the MDP (states, actions, transition probabilities, rewards, and $\gamma$) we can formally model our problem as a Markov decision process.

Our goal is now to solve this MDP, meaning we must find the best action to take in each state to maximise our reward.

To solve our MDP, we turn to **Dynamic Programming**, a method that allows us to systematically compute optimal values and decisions using the **Bellman Equation**.  

## Dynamic Programming
Dynamic programming is a technique used in computer science (among other fields) to break problems down into small overlapping subproblems and solve them recursively. 

In our case, this means breaking down decision-making in our gridworld into localised state-action updates, where each Q-value depends on the expected rewards and future values of neighbouring states. 

If this doesn't quite make sense yet, consider how we previously tried to optimise our gridworld by **looking at the whole grid at once**, and using information gained at the very end to update our Q-values, resulting only in states directly adjacent to the terminal state receiving any sort of update. 

A dynamic programming approach means that we will look at **each cell in turn**, propagating backwards through our grid and updating each cell in turn using information from the previously visited cells, e.g.: 
- We receive a reward from cell/state $(3,3)$. 
- We use that value to update cell/state $(3,2)$.
- We use that value to update cell/state $(2,2)$ and $(3,1)$.
- ... 
- We update our starting cell $\text{S}$, $(0,0)$.

So how can we do this in practice? How do we use information from our terminal state to update our beliefs about earlier decisions? 

## The Bellman Equation 
As described earlier, we want to *propagate the reward backwards* through the grid. The process for doing this is described by the **Bellman Equation**. To understand it, we'll first run through it by hand. 

Let's revisit our state-action value function: 
$$Q(s,a)=\mathbb{E}(R|S=s, A=a)$$

And our grid: 
$$
\begin{bmatrix}
\text{S} & - & - & \# \\
- & \# & - & - \\
- & - & - & \# \\
\# & - & - & \text{F}
\end{bmatrix}
$$

And our Q-value table: 

In [14]:
Q_values = { (i, j): {a: 0.0 for a in actions} for i in range(4) for j in range(4) }
Q_values

{(0, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (0, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (2, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 2): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0}}

We'll start from our terminal state, $(3,3)$. Because $(3,3)$ is our terminal state, we know that taking **any action** in this state **offers no reward**: 
$$Q((3,3), a) = 0$$

Now let's consider state $(3,2)$. Because $(3,3)$ is the goal, the reward for moving $\text{right}$ in state $(3,2)$ is $10$. This means we can update $Q((3,2), \text{right})$ as: 
$$Q((3,2), \text{right})= 10$$

In [15]:
Q_values[(3,2)]['right'] = 10
Q_values[(3,2)]

{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 10}

What about the states adjacent to $(3,2)$? 

Let's look at $Q((3,1), a)$ first. We know that we can reach state $(3,2)$ from state $(3,1)$ by taking action $\text{right}$. We want to propagate the reward received by taking $a=\text{right}$ in state $(3,2)$ backwards. It wouldn't make sense to pass this reward backwards directly, so we instead apply our **discount factor, $\gamma$**:
$$Q((3,1), \text{right}) = \gamma \cdot Q((3,2), \text{right})$$

We'll set our discount factor $\gamma=0.9$. 

In [16]:
gamma = 0.9

We have now passed *some of the value of the terminal reward* one step backwards through our trajectory. We now have a non-zero Q-value for an action in state $(3,1)$ that can inform our agent's decisions.

In [17]:
Q_values[(3,1)]['right'] = gamma * Q_values[(3,2)]['right']
Q_values[(3,1)]

{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 9.0}

Now let's step backwards again to $Q((2,1), \text{down})$: 
$$Q((2,1), \text{down}) = \gamma \cdot Q((3,1), \text{right})$$

Again, we have propagate *some of the value* of the terminal reward backwards through our grid.

In [18]:
Q_values[(2,1)]['down'] = gamma * Q_values[(3,1)]['right']
Q_values[(2,1)]

{'up': 0.0, 'down': 8.1, 'left': 0.0, 'right': 0.0}

And the same again for each step back through $(2,0)$, $(1,0)$, and $(0,0)$: 
$$Q((2,0), \text{right}) = \gamma \cdot Q((2,1), \text{down})$$
$$Q((1,0), \text{down}) = \gamma \cdot Q((2,0), \text{right})$$
$$Q((0,0), \text{down}) = \gamma \cdot Q((1,0), \text{down})$$

Remember, we're just walking backwards through our grid and copying over a fraction of the reward at each step.

In [19]:
Q_values[(2,0)]['right'] = gamma * Q_values[(2,1)]['down']
Q_values[(1,0)]['down'] = gamma * Q_values[(2,0)]['right']
Q_values[(0,0)]['down'] = gamma * Q_values[(1,0)]['down']

We can now see how the Q-values have propagated back through our grid.

In [20]:
visited_states =[(0,0), (1,0), (2,0), (2,1), (3,1), (3,2), (3,3)]

for state in visited_states:
    print(Q_values[state])

{'up': 0.0, 'down': 5.9049000000000005, 'left': 0.0, 'right': 0.0}
{'up': 0.0, 'down': 6.561, 'left': 0.0, 'right': 0.0}
{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 7.29}
{'up': 0.0, 'down': 8.1, 'left': 0.0, 'right': 0.0}
{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 9.0}
{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 10}
{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0}


We can now use these learned Q-values to enforce a path for our agent to take through the grid, simply by choosing the highest action-value for each state.

Ideally, we don't want to solve this by hand for every problem. We need a more generalisable update rule - one where we don't need to pre-define an action to take in each stage.

Let's look at our Q-values for state $(3,2)$. 

In [21]:
Q_values[(3,2)]

{'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 10}

Consider our update for the adjacent state:
$$Q((3,1), \text{right}) = \gamma \cdot Q((3,2), \text{right})$$

We chose the **action corresponding to the highest Q-value in the next state**, i.e., we chose:
$$\max_{a'} Q(s', a')$$

We did this for every single state we propagated backwards through. 

This allows us to derive a general update rule: 
$$Q(s,a) = \gamma \cdot \max_{a'} Q(s', a')$$

We can also consider the case where we receive **intermediate rewards**. While most states will not have any reward, i.e., $R=0$, what if, for example, state $(1,1)$ had a power-up which gave us a small reward of $R=1$? We'd want to account for this in our Q-values, and so we would simply add the value of the reward on to any state-action pairs that lead to state $(1,1)$.  

We can finalise our update rule as: 
$$Q(s,a) = R + \gamma \cdot \max_{a'} Q(s', a')$$

This is known as the **Bellman equation** and is example of **Dynamic Programming** because it involves overlapping subproblems - $Q(s,a)$ becomes $Q(s', a')$ for the next state-action value we're trying to calculate. 

## Return 
Because we're talking about distant rewards, it's no longer appropriate to describe our Q-values as the 'expected reward' for taking an action $a$ in state $s$. 

Let's look at our update rule: 
$$Q(s,a) = R + \gamma \cdot \max_{a'} Q(s', a')$$

We're taking the **immediate reward**, $R$ (if it exists) of the next state plus a portion of the **reward propagated backwards from future states**. 

Let's look at state $(3,2)$ again: 
$$\begin{align}Q((3,2), \text{right}) &= 10 + \gamma \cdot Q((3,3), a)\\&=10 + 0.9\cdot 0
\end{align}$$

The Q-value of action '$\text{right}$' in state $(3,2)$ is equal to the immediate reward $R$ plus the a portion of the reward from taking the best action in the next state, i.e., one step ahead - $R_{t+1}$: 

$$Q((3,2), \text{right})= R + \gamma \cdot R_{t+1}$$

Similarly for state $(3,1)$: 
$$Q((3,1), \text{right}) = R + \gamma \cdot (R_{t+1} + \gamma\cdot R_{t+2})$$

State $(2,1)$:
$$Q((2,1), \text{down}) = R + \gamma (R_{t+1} + \gamma \cdot (R_{t+2} + \gamma (R_{t+3})))$$

Do you see a pattern here? 

This is where we introduce the concept of **Return**, denoted by $G_t$. Instead of $Q(s,a)$ describing an *immediate reward*, it is instead describing **the total accumulated reward from a given state onwards**: 
$$G_t=R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + ... + \gamma^k R_{t+k}$$

Or: 
$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k}$$

Previously, we defined our Q-values (action-values) as the **expected reward for taking action $a$ in state $s$**, i.e.: 
$$Q(s,a)=\mathbb{E}[R|S=s,A=a]$$

Now that we are factoring in **expected future rewards**, we redefine our Q-values as: 
$$Q(s,a)=\mathbb{E}[G_t|S=s, A=a]$$
Equivalently:
$$Q(s,a)=\mathbb{E}[\sum_{k=0}^\infty \gamma^k R_{t+k} | S=s, A=a]$$

Now we described our action-value function $Q(s,a)$ as the **expected return when starting in state $s$ and taking action $a$, and selecting every optimal action thereafter**. 

## State-Value Function
Similarly to the **action-value function**, Q(s,a), which describes **the value of taking a given action in a given state**, we can also define a new concept: the **State-Value Function**, expressed as **$V(s)$**, which describes the **value of being in a given state**.

The state-value function is defined by the **maximum expected return achievable** by selecting the best possible action in that state: 

$$V(s)=\max_{a}Q(s,a)$$

## Back to our grid
Let's use the Bellman equation to build a solution to tackle our gridworld. We set our problem up as earlier: 

In [22]:
grid = np.array([
    ['S', '-', '-', '#'],
    ['-', '#', '-', '-'],
    ['-', '-', '-', '#'],
    ['#', '-', '-', 'F']
])

actions = ['up', 'down', 'left', 'right']

rewards = { (3, 3): 10}

Q_values = { (i, j): {a: 0.0 for a in actions} for i in range(4) for j in range(4) }

def is_valid(state):
    i, j = state
    return 0 <= i < 4 and 0 <= j < 4 and grid[i, j] != '#'

def get_next_state(state, action):
    i, j = state
    if action == 'up': next_state = (i-1, j)
    elif action == 'down': next_state = (i+1, j)
    elif action == 'left': next_state = (i, j-1)
    elif action == 'right': next_state = (i, j+1)
    
    if is_valid(next_state): return True, next_state
    else: return False, state

But now replace the $\text{play}$ function with an $\text{update}$ function to propagate our reward from the terminal state back through the grid using the Bellman equation we just derived.

In [23]:
def update():
    # We first iterate over every state
    for state in Q_values.keys():

        # We can skip the terminal state and any obstacle
        if state == (3, 3) or grid[state] == '#':
            continue

        # We iterate over every action in each state
        for action in actions:
            is_valid, next_state = get_next_state(state, action) # For each state, we find every next state reachable
            if is_valid:
                reward = rewards.get(next_state, 0) # And check if they offer a reward
                
                # We update the value of the current state-action according to the 
                # maximum state-action value in the next state
                # Bellman equation: Q(s,a) = R + gamma * max Q(s', a')
                Q_values[state][action] = reward + gamma * max(Q_values[next_state].values())

We iterate over our update function $10$ times to allow our values to propagate backwards through our grid and converge. 

In [24]:
for _ in range(10):
    update()

We can now check to see what our agent has learned by examining our Q-value dictionary.

In [25]:
Q_values = {
    state: {action: round(value, 2) for action, value in actions.items()}
    for state, actions in Q_values.items()
}
Q_values

{(0, 0): {'up': 0.0, 'down': 5.9, 'left': 0.0, 'right': 5.9},
 (0, 1): {'up': 0.0, 'down': 0.0, 'left': 5.31, 'right': 6.56},
 (0, 2): {'up': 0.0, 'down': 7.29, 'left': 5.9, 'right': 0.0},
 (0, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 0): {'up': 5.31, 'down': 6.56, 'left': 0.0, 'right': 0.0},
 (1, 1): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (1, 2): {'up': 6.56, 'down': 8.1, 'left': 0.0, 'right': 6.56},
 (1, 3): {'up': 0.0, 'down': 0.0, 'left': 7.29, 'right': 0.0},
 (2, 0): {'up': 5.9, 'down': 0.0, 'left': 0.0, 'right': 7.29},
 (2, 1): {'up': 0.0, 'down': 8.1, 'left': 6.56, 'right': 8.1},
 (2, 2): {'up': 7.29, 'down': 9.0, 'left': 7.29, 'right': 0.0},
 (2, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 0): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 0.0},
 (3, 1): {'up': 7.29, 'down': 0.0, 'left': 0.0, 'right': 9.0},
 (3, 2): {'up': 8.1, 'down': 0.0, 'left': 8.1, 'right': 10.0},
 (3, 3): {'up': 0.0, 'down': 0.0, 'left': 0.0, 'right': 

Note that there are some values that don't seem to make a lot of sense. Why, in state $(3,2)$, would we assign a Q-value of $8.1$ to '$\text{left}$'? That's moving us in the opposite direction to our reward!

Consider how our values are updated: we take the **Q-value of the most rewarding action in the adjacent state and use that to assign a Q-value to the action we took to get there**. This means that if we have already updated our grid once and assigned $Q((3,1), \text{right})=9$, the next time we run our update rule it will be: 
$$\begin{align}Q((3,2), \text{left})&=R+\gamma \cdot Q((3,1), \text{right})\\
&= 0 + 0.9 \cdot 9 \\
&= 8.1
\end{align}$$ 

Now that we have our Q-values, how can our agent use them to navigate the grid? 

This is where the concept of **policy** comes in. 

## Policy
A policy, denoted by $\pi$, defines the strategy that an agent follows to decide which action to take in a given state. 

There are two types of policy: 
- **Deterministic** policies
- **Stochastic** policies

#### Deterministic Policies
A deterministic policy will select the same action every time in a given state:
$$\pi(s)=a$$

Consider our gridworld and the accompanying Q-values we learned using the Bellman equation. 

If **every time** our agent navigated the grid, it selected each **action with the highest Q-value**, we would say that it is following a deterministic policy.

Selecting the action with the highest Q-value every time is known as the **optimal policy**, which is denoted by $\pi^*$: 
$$\pi^*= \arg\max_{a}Q(s,a)$$

When following the optimum policy, we can retrieve the value of any state-action pair (Q-value) as:
$$Q(s,a)\mathbb{E}[\sum_{t=0}^\infty \gamma^t R_t |S = s, A= a]$$
where $t$ is each step.

We can visualise the route our agent takes across our gridworld when following the optimal policy. 

In [29]:
action_symbols = {
    "up": "↑",
    "down": "↓",
    "left": "←",
    "right": "→"
}

def get_best_action(state):
    return max(Q_values[state], key=Q_values[state].get)

def visualise_path():
    grid_viz = grid.copy()

    state = (0, 0)
    path = []
    max_steps = 20

    for _ in range(max_steps):
        path.append(state)
        if state == (3,3):
            break

        best_action = get_best_action(state)
        _, next_state = get_next_state(state, best_action)

        if next_state == state:
            break

        state = next_state
    
    for (i, j) in path:
        if grid_viz[i, j] not in ['S', 'F']:
            grid_viz[i, j] = action_symbols[get_best_action((i, j))]

    for row in grid_viz:
        print(" ".join(row))

visualise_path()

S - - #
↓ # - -
→ ↓ - #
# → → F


#### Stochastic Policies
A stochastic policy assigns a **probability distribution over possible actions**:
$$\pi(a|s)=P(A=a|S=s)$$
where the policy for taking action $a$ in state $s$ is equal to the probability of taking action $a$ in state $s$. 

We see from our Q-value table that some states have multiple actions with non-zero Q-values. Let's take state $(2,2)$ as an example. 
$$\begin{align}
\text{up} = 7.29\\
\text{down} = 9.0\\
\text{left} = 7.29\\
\text{right} = 0\\
\end{align}$$

We can **map these values to a probability distribution** by using an operation such as **softmax** (we can use other operations too) to assign them to the range $[0,1]$: 
$$P(a)=\frac{e^{Q(a)}}{\sum_{a}e^{Q(a)}}$$

Let's see how that works. 

In [27]:
qvalues_22 = {'up': 7.29, 'down': 9.0, 'left': 7.29, 'right': 0.0}

values = np.array(list(qvalues_22.values()))
exp = np.exp(values)
probabilities = (exp / np.sum(exp)).tolist()

print({action: round(prob,3) for action, prob in zip(qvalues_22.keys(), probabilities)})

{'up': 0.133, 'down': 0.734, 'left': 0.133, 'right': 0.0}


Now our agent has a probability distribution over actions in state $(2,2)$: 
$$\begin{align}
\text{up} = 13.3\%\\
\text{down} = 73.4\%\\
\text{left} = 13.3\%\\
\text{right} = 0\%\\
\end{align}$$

In each state, our agent will sample an action from the action distribution probabilistically. **This will change our expected return**. The expected return will now become a **weighted sum over all possible actions**, rather than just the max-action we have seen previously. We'll explore this in more depth in another notebook.

# Review
In this notebook, we've introduced some new foundational concepts in reinforcement learning: 
- **State**: 
- **Expected return**: 
- **State-action values (Q-values)**: 
- **Markov Decision Processes (MDPs)**: 
- **Solving MDPs using the Bellman Equation and Dynamic Programming**: 
- **Policies**:

# Next ...