# Reinforcement Learning

So far we have grouped Machine Learning into Supervised Learning and
Unsupervised Learning. There is a third branch of Machine Learning called
Reinforcement Learning. It is motivated by the way humans are belived to learn,
by **interacting with their environment**.

The goal of Reinforcement Learning is to map actions to situations (states) so as to 
**maximize a numerical reward signal**.

https://www.youtube.com/watch?v=kopoLzvh5jY

## How does it work?

We define an **agent**, that takes **actions**. These actions lead to a **reward** and influence the **state** of the environment.

![title](sar.jpg)

## Problem Definition

Reinforcement Learning problems are commonly defined as Markov Decision Processes (MDP) that are defined by a **state space** *S*, an **action space** *A*, a **state transition function** *P*, a **reward function** *r* and a **discount factor** *$\gamma$*.

### State Space *S*

$S = \{s_1, s_2, ..., s_n\}$

All n possible states of the environment.

### Action Space *A*

$A = \{a_1, a_2, ..., a_m\}$

All m possible actions of the agent.

### State Transition Function *P*

$P(s', r|s, a)$

The probability distribution of entering state $s'$ and receiving reward *r* after choosing action *a* in state *s*. It defines the dynamics of the MDP.

### Reward Function

$R(a, s)$

The reward of the agent for choosing action a in state s.

### Discount Factor

$\gamma$

The factor with which future rewards are discounted. Usually a discount factor $\gamma < 1$ is used to indicate that future reward is worth less than current reward.

## Goal

Find a policy $\pi(a|s)$ that maximizes the total expected reward $V_\pi(s) = E[G_t|S_t = s] \forall s$ where $G_t = R_{t+1} + \gamma * R_{t+2} ... + \gamma^{p-1} * R_{t+p}$ is the **return**.

#### Return

$G_t = R_{t+1} + \gamma * R_{t+2} ... + \gamma^{p-1} * R_{t+p}$

The return at time t is the cumulated future discounted return.

#### Policy

$\pi(a|s)$

Is a mapping of a single action to every state of the environment.

#### State-Value Function

$V_\pi(s) = E[G_t|S_t=s]$

The state-value function for policy $\pi(a|s)$ is the total expected return from being in state S=s and following the policy $\pi(a|s)$.

#### Action-Value Function

$Q_\pi(s, a) = E[G_t|S_t=s, A_t=a]$

The action-value function for policy $\pi(a|s)$ is the total expected return from being in state S=s, chosing action A=a and thereafter following policy $\pi(a|s)$.

One important property of both value functions is that they are **recursive relationships**. E.g.


$Q_\pi(s, a) = E[G_t|S_t=s, A_t=a] = $

$E[R_{t+1} + \gamma*G_{t+1}|S_t=s, A_t=a] = $

$E[R_{t+1}|S_t=s, A_t=a] + E[\gamma*V_\pi(s')|S_t+1=s']$

This is called the **Bellman equation** and is central to Reinforcement Learning.

## Summary

We want to find a policy that optimizes the return (sum of discounted future rewards) of the agent.

Let's look at an example.

Assumptions:

- Mouse can go left, right, up and down
- If the mouse finds cheese, either in the upper right corner or in the lower left corner, the environment terminates
- The mouse prefers two blocks of cheese over one block of cheese

![](mouse_grid.png)

So what are state space, action space, state transition function, reward function and discount factor?

## How do we find the policy?

1. Dynamic Programming: Do not learn from the environment and require knowledge of the dynamics and the rewards
- Monte Carlo Methods: Learn directly from the environment but only after the final outcome is observed
- Temporal Difference Learning: Learn directly from the environment and update estimates from other learned estimates
- Policy Gradients: Approximate the value function through a parametrized function (Deep Reinforcement Learning)

## Drawbacks

* Credit assignment Problem
* Exploration vs. Exploitation

---

### Implement it in practice using OpenAI's Gym
* A handy library for learning about RL - https://gym.openai.com/

`pip install gym`

In [1]:
import gym
import time
import numpy as np

---

### Let's work on the cartpole problem
#### First we make an environment in which the agent can be trained

In [2]:
env = gym.make('CartPole-v1')

In [3]:
env.reset()
for i in range(1000):
    env.render()
    obs, reward, done, _ = env.step(env.action_space.sample()) # take a random action
    time.sleep(0.08)
    if done:
        print(f'We survived {i} steps')
        env.reset()
        break
env.close()

We survived 14 steps


#### Now we implement the agent-environment loop
* Start the process by resetting the environment
* And return an initial observation

In [4]:
initial_obs = env.reset()

In [5]:
initial_obs
#position of cart, velocity of cart, angle of pole, rotation of pole

array([-0.01848355,  0.01098999,  0.0157114 , -0.03951064])

\[position of cart, velocity of cart, angle of pole, rotation rate of pole\]

We can achieve the same thing by taking an action - in this case a  `step` in a given direction, 0 for left and 1 for right

In [6]:
obs = env.step(0) # move cart left 
obs, reward, done, _ = env.step(1)

We can already use the `done` boolean to work out if we can stop the loop

In [7]:
obs, reward, done, _

(array([-0.02195082,  0.01055209,  0.02008294, -0.02985178]), 1.0, False, {})

And use `sample` the `action_space` space to randomly pick an action

In [8]:
random_step = env.action_space.sample()

And `render` the environment to see what our cart is doing

**OK, but we need to build an RL agent. What next?**

First, lets try to build the simplest RL agent:
* If the pole is left, move left
* If the pole is right, move right

In [3]:
def simple_rl(env):
    #reset the environment and taking an initial step
    obs = env.reset()
    
     #loop over this process until I die
    for i in range(1000):
        
    #measure: is my pole angled to the left, or the right
    #action: if pole is left, move cart left. if pole is right, move right
        if obs[2] < 0:
            action = 0
        elif obs[2] >0:
            action = 1
        elif obs[2] == 0:
            print('omgomgomg were amazing')
            break
            
        obs, reward, done, _ = env.step(action)
        env.render()
        time.sleep(0.08) #to make the video play at a normal rate
        if done:
            print(f'iterations survived: {i}')
            env.close()
            break

In [4]:
#benchmark for a dumb rl agent = 50

In [5]:
simple_rl(env)

iterations survived: 34


### Let's look at some evolutionary algorithm

In [6]:
parameters = np.random.rand(4) * 2 - 1

In [7]:
parameters

array([-0.22711813,  0.90011165,  0.04559917,  0.9696599 ])

In [8]:
observation = env.reset()
observation

array([-0.00533671, -0.0396626 , -0.04493139, -0.00146532])

In [9]:
np.matmul(parameters, observation)

-0.03795839522582525

In [10]:
action = 0 if np.matmul(parameters,observation) < 0 else 1
action

0

In [11]:
def run_episode(env, parameters, range_=200, render=False):  
    observation = env.reset()
    totalreward = 0
    
    for _ in range(range_):
        action = 0 if np.matmul(parameters,observation) < 0 else 1
        observation, reward, done, info = env.step(action)
        totalreward += reward
        if render:
            env.render()
            time.sleep(0.08)
        if done:
            break
            
    env.close()
    return totalreward

In [12]:
run_episode(env, parameters, render=True)

200.0

#### Random Search

In [14]:
re_re = 400
bestparams = None  
bestreward = 0  
for i in range(1000):  
    parameters = np.random.rand(4) * 2 - 1
    reward = run_episode(env,parameters, range_=re_re)
    
    if reward > bestreward:
        bestreward = reward
        bestparams = parameters
        # considered solved if the agent lasts 200 timesteps
        if reward == re_re:
            print(f'{i} episodes required to reach a reward of {re_re}')
            break

4 episodes required to reach a reward of 400


In [15]:
bestparams

array([-0.02159105,  0.79203038,  0.7630268 ,  0.7267915 ])

In [16]:
bestreward

400.0

In [18]:
run_episode(env, bestparams, re_re, True)

400.0

### DQN

In [21]:
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import DQN

env = DummyVecEnv([lambda: gym.make('CartPole-v1')])

model = DQN(MlpPolicy, env, verbose=1)

In [22]:
model.learn(total_timesteps=100000)
model.save("deepq_cartpole")

# del model # remove to demonstrate saving and loading

# model = DQN.load("deepq_cartpole")

--------------------------------------
| % time spent exploring  | 76       |
| episodes                | 100      |
| mean 100 episode reward | 24       |
| steps                   | 2375     |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 200      |
| mean 100 episode reward | 80.8     |
| steps                   | 10459    |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 300      |
| mean 100 episode reward | 143      |
| steps                   | 24716    |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 400      |
| mean 100 episode reward | 128      |
| steps                   | 37472    |
--------------------------------------
--------------------------------------
| % time spent exploring 

In [23]:
obs = env.reset()
done = False
i = 0
total_rewards = []
while not done:
    i += 1
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    total_rewards.append(rewards)
    time.sleep(0.08)
    env.render()
print(f'{i} episodes')

500 episodes
