# Grid Worlds

We are exploring some grid world problems trhough the eyes of Reinforcement Learning and Dynamic Programing. We will deal with two problems, an agent running through a maze, and a tag game. So lets start with the maze.

## the world and the agent: states and actions


A grid world for us, is an $n \times m$ grid, where each pair of coordinates represents a position, which we will call a **state**. 

The agent is like a little robot that finds itself in this world. It will move through the world by selecting **actions**(up, down, left, right and the diagonals if we want to include then).

The world will contain non-empty spaces:

* initial state: a state where the agent will start
* terminal states: one or more states the agent is trying to reach
* walls: spaces the agent can't enter
* traps: a state the agent should avoid, and that will send it to the initial state

The goal of our agents will be to reach the terminal state as fast as possible, avoiding the traps.

Bellow is an example of a world.

In [1]:
from notebooks.utils.worlds import small_world_03
from grid_world.visualization.format_objects import get_world_str

print(get_world_str(small_world_03))

6                ✘ 

5 ☠  ☠     █     █ 

4          █       

3    ☠  ☠  █  █    

2                  

1    █  █  █  █    

0 ⚐                

  0  1  2  3  4  5 


To add some stochasticity we can also add a "wind" to the world. The wind will add at the desired states a probability of pushing the agent in a given direction. So for instance if the agent is at state (1, 1), which has an wind going up with probability 0.1, and selects action left, it will have probability 0.9 of going to (0, 1) and 0.1 of going to (0, 2), assuming both states are empty.

We will use $S$ to denote the set of possible states and $A$ the set of actions.

### dynamics and the world model

Another important ingredient in the world is its dynamics, which we call a "world model". Mathematicaly this is a function:

$$ M_w: S \times A \to \mathbb{P}(S) $$

where $M_w$ gives for each pair state action $(s,a)$ a probability distribution over the states $S$, that indicates the probabilitie of moving to this new state when taking action $a$ in state $s$. This means that $M_w(s,a): S \to [0, 1]$ is also a function and $M_w(s,a)(s_0)$ is the probability of getting to $s_0$ when taking action $a$ in state $s$. In a deterministic world these values will allways be 1 or 0, however in a stochastic environment they may not.

## effects and rewards

Every time our agent takes an action the world sends a signal back, which we are calling the **effect**; this signal informs if the agent landed on a trap(-1), the terminal state(1) or an empty state(0); if the agent hits a wall it lands on an empty state, which is where he came from, so the signal effect won´t be enough to diferentiate these cases. This signal is what the agent is suposed to use to guide its behaviour, i.e. try to learn whether his actions were good or not. We will denote it by

$$ E: S \rightarrow \mathbb{Z} $$

Based on this action our agent will determine a **reward**, so if the agent receives a desired effect it should determine a larger reward then it would for an undesired one. We are treating the reward as something internal to the agent, as oposed to the effect, which is something he receives from the world(more precisely, from the training process). The idea here is that different rewards can determine different behaviours for the agent in the same world, so we are viewing this as a change in the agent itself. Techinically the reward is a function from an effect to a float(we will denote this as $R_0$), however we will usually think of the function bellow when we are talking about rewards:

$$ \displaylines{R: S \rightarrow \mathbb{R} \\ R(s) = R_0(E(s))}$$

Notice that $E$ is deterministic, since it only depends on the state the agent ends at, which means our reward will also be deterministic. However taking the same action at a given state can lead to different rewards, since the world dynamics isn´t necessariliy deterministic. 

Most of the time we will use the reward determined by the function below; we call it the "basic reward":

$$R_0(i) =    
     \begin{cases}
       -1 &\quad\text{if } i = 0 \\
       0 &\quad\text{if } i = 1 \\
       -100 &\quad\text{if } i = -1 \\
     \end{cases}
$$

## episodes and returns

An agent run trhough the world, from start to finish, is called an **episode**. We usually view episodes as a sequence of the form

$$ 
s_0 \xrightarrow[]{a_0} r_0, \; s_1 \xrightarrow[]{a_1} r_1, \; s_2  \xrightarrow[]{a_2} r_2, \; s_3 \dots \\
$$

where $s_t, a_t, r_t, s_{t+1}$ correspond accordingly to the state the agent is at time t, the action he choses, the reward he gets and then the new states he ends at. This is slightly different from the Sutton and Barto book, since we are using $r_t$ where they would use $r_{t+1}$(I will say I find their notation a little weird in this regard).

The reward $r_t$ may give an imediate idea of how good an action was, but in order to actually learn paths, which involve taking multiple actions in sequence we will need to look to ahead, to all the rewards we got in that episode. So in order to determine how good an episode is, or more generaly how well the agent did from time $t$ onwards, we use a cumulative function of the reward called the return:

$$\displaylines{G: \mathbb{Z} \to \mathbb{R} \\ G_t = \sum_{k = 0}^{\infty} \gamma^k r_{t+k}} $$

Where $\gamma \in [0,1]$ is a parameter we must chose, and $r_t=0$ for $t>T$. For this problem we will usually set $\gamma = 1$; so we will essentially look at the sum of all rewards we get after selecting an action.

With this in place, the goal will be for the agent to maximize the returns it gets in a given apisode.

## policies and value functions

The agent behaviour is specified by what we call a policy, usually denoted by $\pi$. The policy gives for each state $s$ a probability distribution over the actions $A$, i.e. $\pi: S \to \mathbb{P}(A)$. In practice we just define
$$ \displaylines{\pi: S \times A \to [0,1] \\ \text{satisfaying: }\sum_{a \in A}\pi(s, a) = 1 \; \; \; \forall s \in S}$$

now given a world and a policy we can see that there is an expected return we will get by following this policy from a given state. This is called a value function 

$$ \displaylines{
v_\pi: S \to \mathbb{R} \\
v_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s )
}
$$

Here $\mathbb{E}_\pi$ denotes the expected value under policy $\pi$.

Very similar, and more usefull in practice, is another value function 
$$ \displaylines{
Q_\pi: S \times A \to \mathbb{R} \\
Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s \land a_t = a)
}
$$ 
which gives estimates for a state action pair.


## Solutions

The solutions we will go through in this project involve learning a good policy $\pi$. This is usually done through a process called GPI, where we simultaneasly improve estimates for a $Q$ function and the policy $\pi$. Since the police keeps changing as we go, the $Q$ function we will be updating isn't necessarily approximating a particular value function $Q_\pi$(especially because we don't know what is the policy we are trying to achieve in the first place), nevertheless this is done in a smart way where we can get to good policies. Some details are povided in the specific notebooks, but to better understand what is going in GPI I highly recommend reading  ["Reinforcement Learning: An Introduction"](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf).

The solutions we will explore can be viewed in two groups.

### Dynamic Programing

Dynamic Programing is great. It completly solves the problem, in the sense that it is capable to determine what is the optimal policy an agent should follow. It has a drawback though, we need to know the world model! So as great as it is, in some cases it just presuposes too much knowledge about the problem.

If we think about it, having a model of the world should indeed be enough to find an optimal solution. This doesn't mean that it is easy to find this solution, and having an algorithm capable of doing so is great. Our goal in the DP section is to explore such algorithm.

### Exploration based methods

There are a group of methods that I am calling "exploration" based methods. By these I mean any algorithm that resorts on a process like this: 

* an agent interacts with a world by taking actions and completing episodes
* it uses the information it gathers from these events to improve its policy

This is different in an essential way, as the agent is not given information on how the world works beforehand; it has to "go out there" and collect it. 

The actions the agent takes during an episode affect both the returns it gets and the data it collects(the states it will visit). This leads to a situation called the "exploration vs exploitation", where the agent will have to decide whether he wants to stay in what he believes is the best path, or go exploring for new information.

There are many solutions that fall under this group, including Monte Carlo methods and all variations of TD learning.

Our focus will allways be on data efficiency, i.e. how are returns evolving as episodes go by. In general we are not interested in computational efficiency.

# The tag game

So far we discussed the problem of an agent trying to navegate a maze to find the terminal state. But we can use our world and agents to adress another interesting case, a game of tag between two agents.

This may sound complicated, but it is essentially the same problem(or at least very similar). The main difference is that the two agents will be learning their policy at the same time. Since the agents are competing against each other, the police one learns is trying to beat the policy of the other agent. Because of this,any change in policy for an agent will correspond to a change in the dynamics of the world for the other one. This is usually called Adversarial Reinforcement Learning(ARL).

Agent 1 goal is to caths Agent 2; while Agent 2 is trying to survive $k$ rounds. There is allways a winner and a loser for each episode.

## States, Effects and Rewards in the tag game.

To make this problem feasable for our agents, we will let then use more information to select their actions, in particular they will have access to the position of the other agent as well as their own. So for this case we will represent an state as a pair of coordinates(for consistence we will use $s^i_t$ to represent Agent i position at time $t$): 
$$S^1_t = (s^1_t, s^2_t)$$ 
$$S^2_t = (s^1_{t+1}, s^2_t)$$
are the state of agent 1 and 2 at time $t$(since agent 2 goes second he can see the move from agent 1 before making his choice). This way of representing states actually doesn't gives us the Markov property, since we would also need to add how many rounds are still remaining, but this would make the cardinality of our States space way too high.

We will also adapt the effect sent by the training process for each agent:

$$ e^1_t =
    \begin{cases}
        \begin{align}
            &1 &\textrm{if $s_{t+1}^1 = s_t^2$ or $s_{t+1}^1 = s_{t+1}^2$}\\
            &-1 &\textrm{if $s_{t+1}^1$ is a trap or $t+1=k$}\\
            &0 &\textrm{otherwise}\\
        \end{align}
    \end{cases}
$$

and

$$ e^2_t =
    \begin{cases}
        \begin{align}
            &1 &\textrm{if $t+1 = k$}\\
            &-1 &\textrm{if $s_{t+1}^2$ is a trap or $s_{t+1}^1 = s_{t+1}^2$ or $s_{t+2}^1 = s_{t+1}^2$}\\
            &0 &\textrm{otherwise}\\
        \end{align}
    \end{cases}
$$

This means that at time $t$ Agent 1 has to wait for Agent 2 to take his next action in orther to receive an effect, and vice-versa. 

We will give rewards in a very simple way, just indicating who wins the round:

$$R_0(i) = i$$

we will also need to set $\gamma < 1$, usually $0.9$. So agents are incentivised to either shorten or prolong the episodes.

## Training

Training is a litle more complex then for the maze, our sequence of events looks like this:


$$ 
S^1_0 \xrightarrow[]{a^1_0} S^2_0 \xrightarrow[]{a^2_0} r^1_0, \; S^1_1  \xrightarrow[]{a^1_1} r^2_0, \; S^2_1 \xrightarrow[]{a^2_1} r^1_1, \; S^1_2 \xrightarrow[]{a^1_2} r^2_1, \; S^2_2 \dots \\
$$
                

Notice that, at time $t$, Agent 1 will only have the usual required information for making an update after Agent 2 makes an action, since only then he will know  $r^1_t$ and $S^1_{t+1}$, wich is his next state($S^2_t$ is not the state he needs, as this is not a state for which he chooses an action). The same applies for Agent 2.

So the training loop for each episode is:

---
**Training Loop**


* determine $s^1_{0}$, $s^2_{0}$, $a^1_0$ and $s^1_{1}$
* Loop until $e^1_t = 1 $ or $e^2_t = 1$: 
    * Agent 2 selects and takes action $a^2_t$ for $(s^1_{t+1}, s^2_{t})$ and goes to state $s^2_{t+1}$
    * Agent 1 receives effect $e^1_t$ and updates $Q^1(S^1_t, a^1_t)$ aware of $S^1_{t+1}$.
    * Agent 1 selects and takes action $a^1_{t+1}$ and goes to state $s^1_{t+2}$
    * Agent 2 receives effect $e^2_t$ and updates $Q^2(S^2_t, a^2_t)$ aware of $S^2_{t+1}$
    * increment t

---

The setup is a little more complex; but, apart from this, we can just use the same agents and methods we developed for the maze problem(except for the specialized ones).
