## Reinforcement Learning
### Understand the theory

* Practical achievements in the field
* Supervised / Unsupervised / Reinforcement
* Pavlov to Bellman
* Environment / State / Action / Reward
* Drawbacks - curse of dimensionality, credit assignment problem

![title](SAR.png)

---

### Implement it in practice using OpenAI's Gym
* A handy library for learning about RL - https://gym.openai.com/

`pip install gym`

In [1]:
import gym

---

### Let's work on the Cartpole problem
#### First we make an environment in which the agent can be trained

In [2]:
env = gym.make('CartPole-v1')

#### Now we implement the agent-environment loop
* Start the process by resetting the environment
* And return an initial observation

In [4]:
initial_obs = env.reset()
position = initial_obs[0]
velocity = initial_obs[1]
angle = initial_obs[2]
rotation = initial_obs[3]
initial_obs

array([-0.00234686, -0.02152307,  0.03581043, -0.04828385])

### The arrays
- Starting position of pole at position 1
- Velocity is how fast is the pole moving in order to try and balance!
- Angle is whether the pole is leaning left or right
- Rotation is the magnitude by which it is leaning either way

We can achieve the same thing by taking an action - in this case a  `step` in a given direction, 0 for left and 1 for right

In [5]:
env.render()

True

In [6]:
obs, reward, done, _ = env.step(1)

In [7]:
obs

array([-0.00277732,  0.17306758,  0.03484476, -0.32945666])

In [8]:
reward

1.0

We can already use the `done` boolean to work out if we can stop the loop - boolean telling us whether we've died or not!

In [9]:
done

False

And use `sample` the `action_space` space to randomly pick an action

In [10]:
random_step = env.action_space.sample()

And `render` the environment to see what our cart is doing

In [12]:
env.render()

False

**OK, but we need to build an RL agent. What next?**

First, lets try to build the simplest RL agent:
* If the pole is left, move left
* If the pole is right, move right

In [13]:
import time

In [14]:
def naive_rl(env):
    
    #setup the game
    obs = env.reset()
    
    for i in range(1000):
        # work out if my pole is on the left or on the right
        if obs[2] < 0:
            action = 0
        else:
            action = 1
        # take an according step
        obs, reward, done, _ = env.step(action)
        
        #visualise my results
        env.render()
        print(obs, reward)
        time.sleep(0.1)
        
        # find out if I died
        if done:
            print(f'iterations survived {i}')
            break

In [18]:
naive_rl(env)

[-0.01152066 -0.20595819 -0.02972786  0.32124378] 1.0
[-0.01563982 -0.40064445 -0.02330299  0.60440544] 1.0
[-0.02365271 -0.59543289 -0.01121488  0.88965831] 1.0
[-0.03556137 -0.79040089  0.00657829  1.17879481] 1.0
[-0.05136939 -0.59536497  0.03015418  0.88818127] 1.0
[-0.06327669 -0.40066496  0.04791781  0.60512801] 1.0
[-0.07128999 -0.2062447   0.06002037  0.3279148 ] 1.0
[-0.07541488 -0.0120263   0.06657866  0.05474719] 1.0
[-0.07565541  0.18208096  0.06767361 -0.21620896] 1.0
[-0.07201379  0.37617348  0.06334943 -0.48680046] 1.0
[-0.06449032  0.57034703  0.05361342 -0.75886427] 1.0
[-0.05308338  0.76469083  0.03843613 -1.03420631] 1.0
[-0.03778956  0.95928117  0.01775201 -1.3145788 ] 1.0
[-0.01860394  1.15417402 -0.00853957 -1.60165319] 1.0
[ 0.00447954  0.9591542  -0.04057263 -1.31164473] 1.0
[ 0.02366263  0.76456882 -0.06680553 -1.03193195] 1.0
[ 0.038954    0.57039607 -0.08744417 -0.76094875] 1.0
[ 0.05036193  0.37658065 -0.10266314 -0.49701224] 1.0
[ 0.05789354  0.18304474 -0.

**We can do better than that! Lets build a model which learns to move better based on training data**

* First we need some training data

In [19]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier

In [20]:
def collect_training_data(env):
    
    number_of_games = 100
    last_moves = 10
    observations = []
    actions = []
    
    for i in range(number_of_games):
        game_obs = []
        game_acts = []
        obs = env.reset()
    
    for j in range(1000):
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        game_obs.append(obs)
        game_acts.append(action)
        
        if done:
            observations += game_obs[:-(last_moves+1)]
            actions += game_acts[1:-last_moves]
            break
    
    observations = np.array(observations)
    actions = np.array(actions)
    
    return observations, actions

* Then a model which plays based on its predictions

In [21]:
def smart_rl(env, m):
    
    obs = env.reset()
    
    for i in range(1000):
        # start to play game
        # let my ML model tell me what to do next
        
        obs = obs.reshape(-1, 4)
        action = int(m.predict(obs))
        
        #take an according step
        obs, reward, done, _ = env.step(action)
        
        #visualise my results
        env.render()
        time.sleep(0.1)
        
        # find out if I died
        if done:
            print(f'iterations survived {i}')
            break

#### Now lets run the code, and measure the improvement
* Setup the gym
* Collect training data
* Train a model
* And play
* And measure

In [22]:
X, y = collect_training_data(env)
m = RandomForestClassifier()
m.fit(X, y)

smart_rl(env, m)



iterations survived 199


#### Much improved on the original attempt! Can then try other models or start to optimise hyperparameters!