# Intro to Reinforcement Learning

### Prerequisites
- [Basic Python]()
- [Optimisation]()
- [Logistic regression]() (for understanding softmax function)

![](images/agent-env-loop.png)

## What is reinforcement learning?

Reinforcement learning (RL) is way to train machines to perform tasks by rewarding them when they do well.

We affect our environment by taking actions. Then we observe the new state of the environment and may or may not recieve some reward (or punishment) from the environment for taking that action. 

### Why reinforcement learning?

This is how we, as humans, learn to do almost everything we know: 

Reinforcement learning is the mathematical formulation for how this learning occurs.

## Problem setup

### Can we train a reinforcement learning agent to balance a pole on top of a cart?
![](images/cartpole.gif)

## Definitions

To start understanding RL, it's useful to define a few things...

### Agent
An agent is something can interact with its enviroment to affect it - is has *agency* over its environment. E.g. a robot, a character in a video game or a computer system controlling traffic light signals in a city.

### Environment
The environment is what the agent can interact with. For a robot that might be a real environment like a room full of objects. It may be a simulation or a digital enviroment. Or it could be something more abstract like a set of traffic lights.

### State

At any point in time, the environment has a ***state*** which defines everything in the environment (perhaps as well as the agent's own state). This could include everything from all object positions, to the temperature, to information about other agents in the environment.

At time $t$, a state $s_t$ represented completely by $k$ variables is a k-vector;
$s_t = \begin{bmatrix}s^1_t \\ \vdots \\ s^k_t\end{bmatrix}$

### Observation
An agent may not be able to know (or even sense) the whole state. In this case, what it can directly observenis called an **observation**.

E.g. We as humans do not know the state of our entire environment (the universe). We can only observe a small part of it at a time.

E.g. A robot might be able to perform some task better if it could hear. But if it is not equipped with a microphone, then these useful sounds would be missing from its information about the state.

In the case where the agent cannot know the complete state, but only an observation, the problem is known as **partially observable**.

At time $t$, an observation $o_t$ represented by $k$ variables (where k < number of variables that represent complete state) is a k-vector;
$o_t = \begin{bmatrix}o^1_t \\ \vdots \\ o^k_t\end{bmatrix}$

### Actions
Actions are things that the agent can choose to do to affect the environment. The set of all possible actions is called the **action space**.


### Policy
You probably know that in government, politicians have to follow certain *policies* that determine how they should respond to a situation. 

In RL, a **policy** means almost exactly the same thing: a policy defines what an agent should do when it finds itself in a certain state.

Mathematically, a policy is a distribution over possible actions $a_t$, conditioned on the state $s_t$, at time $t$. We use the symbol $\pi$ to represent the policy: $\pi(a_t|s_t)$

### Reward
In RL, we train our agents by giving them *rewards* when they do the right thing. These rewards are not *actual* rewards like a bar of gold or a treat. Rather they are just what we try to make our agents get the maximum amount of by writing such code. Later we will discuss how to define and codify rewards for our agents.

Reward for taking action $a_t$ in state $s_t$ at time $t$ is denoted as $r(s_t, a_t)$.

### Trajectory
A list of states that the agent encounters and the corresponding actions that it takes over one episode/lifetime/game.
A trajectory until time T is denoted as $\tau = (s_1, a_1, ... , s_T, a_T)$

## In our case...

**Agent** = the cart

**Environment** = consists of the cart, the pole and the floor

**State** = position of cart, velocity of cart, angle of pole, rotation rate of pole

**Observation** = the state, because the environment is fully observable

**Actions** = either push the cart right (action choice = 1) or push it left (action choice = 0)

**Policy** = yet to be made, but will be a function that maps our observations to a distribution over the action space

**Reward** = this gym environment returns +1 reward every timestep that the pole is still standing upright

**Trajectory** = ([position of cart, velocity of cart, angle of pole, rotation rate of pole]@t=1, push left or right @t=1, ..., [position of cart, velocity of cart, angle of pole, rotation rate of pole]@t=T, push left or right @t=T)

The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

# Open AI Gym

Open AI gym provides a standardised set of environments that we can use to train reinforcement learning agents.

In [6]:
!pip3 install gym       # bash command


Collecting gym
Collecting pyglet<=1.3.2,>=1.2.0 (from gym)
  Using cached https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl
Collecting six (from gym)
  Downloading https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
Collecting cloudpickle~=1.2.0 (from gym)
  Using cached https://files.pythonhosted.org/packages/c1/49/334e279caa3231255725c8e860fa93e72083567625573421db8875846c14/cloudpickle-1.2.2-py2.py3-none-any.whl
Collecting scipy (from gym)
  Using cached https://files.pythonhosted.org/packages/29/50/a552a5aff252ae915f522e44642bb49a7b7b31677f9580cfd11bcc869976/scipy-1.3.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy>=1.10.4 (from gym)
  Using cached https://files.pythonhosted.org/packages/0e/46/ae6773894f7eacf53308086287897ec568eac9768918d913d5b9d366c5db/numpy-1.17.3-cp36-cp36m-manylinux1_x86_64.whl
Colle

In [17]:
import gym
from time import sleep
env = gym.make('CartPole-v1')         # make the cartpole environment
print(env.reset())                    # initialise the env and print its initial state
print(env.action_space.sample())      # print an example action

for step in range(100):              # for 1000 steps
    action = env.action_space.sample()    # randomly sample an action to take
    obs, reward, done, info = env.step(action)   # take the action and one timestep
    print('Observation:', obs, '\tReward:', reward, '\tDone?', done, '\tInfo:', info, '\tPrevious action:', action)
    env.render()                     # show the env
    sleep(0.01)                      # wait a small amount of time so we can see the env
    
env.close()                           # close the env

[-0.00051033 -0.02143847  0.0197286  -0.04010987]
0
Observation: [-0.0009391   0.1733951   0.01892641 -0.32650352] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 0.0025288   0.36824255  0.01239634 -0.61315831] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 9.89365278e-03  5.63189090e-01  1.33169726e-04 -9.01911227e-01] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 0.02115743  0.75830924 -0.01790505 -1.19455229] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 0.03632362  0.95365841 -0.0417961  -1.49279289] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 0.05539679  0.7590692  -0.07165196 -1.21344845] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 0
Observation: [ 0.07057817  0.95503889 -0.09592093 -1.5276964 ] 	Reward: 1.0 	Done? False 	Info: {} 	Previous action: 1
Observation: [ 0.08967895  1.15117835 -0.12647486 -1.84871067] 	Reward: 1.0 	Done? False 	Info: {} 



Observation: [ 0.1819734   0.77024326 -0.28015583 -1.51051004] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.19737827  0.57938442 -0.31036603 -1.31667607] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.20896595  0.7768131  -0.33669955 -1.68845965] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.22450222  0.58676208 -0.37046874 -1.51652   ] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.23623746  0.39748332 -0.40079914 -1.35830691] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.24418712  0.20891137 -0.42796528 -1.21257097] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.24836535  0.40636467 -0.4522167  -1.60405508] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.25649264  0.60327547 -0.4842978  -1.99819767] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.26855815  0.41523183 -0.52426175 -1.885

Observation: [ 0.33288437  1.76053396 -6.33478518 -3.24718838] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.36809504  1.95582603 -6.39972895 -3.55490022] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.40721157  2.15097759 -6.47082696 -3.87982817] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.45023112  2.34549305 -6.54842352 -4.22132332] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.49714098  2.15042265 -6.63284999 -4.01601901] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 0
Observation: [ 0.54014943  2.34311513 -6.71317037 -4.38828666] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.58701173  2.53341987 -6.8009361  -4.7703152 ] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.63768013  2.72037089 -6.8963424  -5.15949604] 	Reward: 0.0 	Done? True 	Info: {} 	Previous action: 1
Observation: [ 0.69208755  2.52186312 -6.99953232 -5.085

In this demo, we randomly choose actions using `env.action_space.sample()`. Later we will build intelligent models that make up the part of the agent that chooses actions to take instead.

When we take a step in time in the environment by running `env.step(action)`, the environment returns us a list of the current observation, reward that we recieved, whether the episode has ended or not, and some additional info which we wont worry about for now. In the instance of cartpole, the observation values represent 

At this point you should read through [this](http://gym.openai.com/docs/) page of the OpenAI gym docs and get a good understanding of what is happening.

# The goal of RL

Our agent should do well if it maximises the expected sum of all rewards. i.e. if it maximises:

## $J = \mathbb{E}\begin{bmatrix} \sum_{t=1}^T r(s_t, a_t) \end{bmatrix}$

However, that equation is missing some detail. Our policy is what actually determines the trajectory that our agent takes by defining the action it takes in a certain state. 

#### Our first policy

In our case, we can form a simple policy by taking a weighted combination (linear combination) of the current state for each output action, and then use a [softmax](https://en.wikipedia.org/wiki/Softmax_function) function to normalise that (turn it into a probability distribution):

## $\pi (a|s) 
= \sigma(\theta o )
= \sigma \left( \begin{bmatrix} \theta_{11} \; \dots \; \theta_{1k} \\ \vdots \;\; \ddots \;\; \vdots \\ \theta_{n1} \; \dots \; \theta_{nk}  \end{bmatrix} \begin{bmatrix}o_1 \\ \vdots \\ o_k\end{bmatrix} \right)
= \sigma \left( \begin{bmatrix} \pi (a_1|s) \\ \vdots \\  \pi (a_n|s)\end{bmatrix}\right)
$

So our goal is to maximise:

## $J 
= \mathbb{E}_{\tau \sim \pi(\tau)}\begin{bmatrix} \sum_{t=1}^T r(s_t, a_t) \end{bmatrix}
= \mathbb{E}_{\tau \sim \pi(\tau)}\begin{bmatrix} r(\tau) \end{bmatrix} 
= simple
$

###### Extra detail: $\pi(\tau) = p(s_1)\prod_{t=1}^T \pi_\theta(a_t | s_t)p(s_{t+1} | s_t, a_t)$

The objective function (what we try to maximise) of reinforcement learning is to maximise the expected sum of all rewards:

## $\nabla J = $



