# Reinforcement Learning

Problem Setting

<img src="ref/rl-problem.png" width="600"/>

The data / prediction sequence is like:
$$
S_0, A_0, r_1, S_1, A_1, r_2, S_2, \dots
$$

NB-1: the subscript stands for the time-step, not to be confused with symbols distinguishing different individual states/actions. E.g. you need to make yourself comfortable with notions like $s_5 = s^3$ or $a_5 = a^{MOVE-RIGHT}$

NB-2: sometimes, a different denotation is taken for the time step of reward, so instead of considering $s_0, a_0, !, r_1, s_1, $, we adopt the denotation system as  $s_0, a_0, r_0, !, s_1$ -- I explicitly use "!" for time-step-tick.

## RL Essential: Exploration

A criticism:
> Deep RL is popular because it's the only area in ML where it's socially acceptable to train on the test set.
-- [A tweet](https://twitter.com/jacobandreas/status/924356906344267776)

This is because in RL the agent is not allowed to access data before it's being tested!

Exploration scheme:
<img src="ref/explore.png" width="200"/>

## Examples of RL Environments
Let have a look at typical tasks. Here is a [list of environment](https://gym.openai.com/envs/#classic_control) provided by OpenAI Gym.

In [1]:
import gym
env = gym.make('LunarLander-v2')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [2]:
state  = env.reset()

In [3]:
print(state)

[-0.00115395  0.94757329 -0.11689678  0.46399501  0.00134391  0.02647887
  0.          0.        ]


In [4]:
env.render()

True

In [5]:
env.close()

### A typical control flow
```python
state = env.reset()
while not done:
    # make t-step decision
    act = policy(state) # !MARKOVIAN ASSUM. -- MDP
    # POMDP

    # step is the way to commit an action
    new_state, reward, done, _ = env.step(act) 
    
    # adjust policy on-fly or this can be done after an episode is done
    
    # loop over
    state = new_state
```
    


In [9]:
import time
import random
env = gym.make('LunarLander-v2')
state = env.reset()
done = False
while not done:
    # optionally, we can see how everything goes
    env.render()
    act = random.randint(0, 3) #0 # dummy act
    new_state, reward, done, _ = env.step(act) 
    state = new_state
    time.sleep(0.02)
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


### Observations and state
- POMDP
- Multimedia observations

For some game, you get the "internal" states to work with. On the otherhand, your agent must rely on it own to interpret some observations.

# Main Families of RL Algorithms

## Decision Making

- Value Iteration
    - Q-learning (DQN)
    
    Action = $\arg\max_a {Q(\rm{state}, a)}$

- Policy Iteration
    - Policy Gradient 
    
    Action ~ $\pi_\theta(\cdot|\rm{state})$
    
    $\theta \leftarrow (1-\alpha)\theta + \alpha \Delta\theta$

## Exploration in the space

- MonteCarlo Methods
    - To account for variables difficult to model

- Temporal Difference Learning
    - "Look into the future"
    

"O": evaluation of the CURRENT version of value estimation
<img src="ref/td.png" width="600">

- Planning
    - MonteCarlo Tree Search (AlphaGo)

# Q-Learning (TD)

## A Trivial "learing Q" and improve scheme

Example from [Sutton and Barto, 2018, "Reinforcement Learning, An Introduction" 2nd ed.]

<img src="ref/simple-grid-problem.png" width="600">

Starting from (the table shows not $Q$, but $V$), while Q and V are convertible -- at least in this certain, simple world environment.
<img src="ref/q-step000.png" width="600">

The simple policy is to RANDOMLY take 4 actions at EACH state -- After 1 step:

<img src="ref/q-step001.png" width="600">

<img src="ref/q-step002.png" width="600">

<img src="ref/q-step003.png" width="600">

<img src="ref/q-step003.png" width="600">

<img src="ref/q-step010.png" width="600">

<img src="ref/q-step999.png" width="600">


| Q        | $\pi$           |
|:-------------:|:-------------:| 
| <img src="ref/q-step999.png" width="100%">|<img src="ref/q-policy-opt.png" width="100%">| 


| Q        | $\pi$           |
|:-------------:|:-------------:| 
| <img src="ref/q-step000.png" width="100%">|<img src="ref/q-policy-000.png" width="100%">| 

| Q        | $\pi$           |
|:-------------:|:-------------:| 
| <img src="ref/q-step001.png" width="100%">|<img src="ref/q-policy-001.png" width="100%">| 

| Q        | $\pi$           |
|:-------------:|:-------------:| 
| <img src="ref/q-step002.png" width="100%">|<img src="ref/q-policy-002.png" width="100%">| 

## Lab

Play with a grid-game "[FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/)".

In [None]:
import torch
import random

In [None]:
# Let's test shallow water first.
from gym.envs.registration import register

register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=2000,
    reward_threshold=0.78, # optimum = .8196
)

In [None]:
# A Demo solution -- please try your own method before playing with this piece of code.

slipery = True
env = gym.make('FrozenLake-v0') if slipery \
    else gym.make('FrozenLakeNotSlippery-v0')

######### HYPER PARAMETERS ########
lr = 0.1
gamma = 0.9 # How far into the future we are looking at
Q = torch.zeros(env.observation_space.n, env.action_space.n)
rs_longterm = 0
report_ever_n_episodes = 5000
n_episodes = 150000

# ! Exploration vs Exploitation 
randomness = 1.0
epsilon = 0.01
explore_steps = n_episodes * 2
d_random = (randomness-epsilon)/explore_steps

######### LEARNING ########
for ep in range(n_episodes):
    done = False
    state = env.reset()
    rs = 0
    while not done:
        if random.random()>randomness or ep > n_episodes - report_ever_n_episodes:
            _, a = Q[state].max(dim=0)
            a = a.item()
        else:
            a = random.randint(0, env.action_space.n-1)
            
        new_state, reward, done, _ = env.step(a)
        if ep == n_episodes-1:
            print(Q[state])
            print(a, Q[state].max(dim=0))
            env.render()


        rs += reward
        Q[state, a] +=  lr*(reward + gamma*Q[new_state].max() - Q[state, a])
        state = new_state
        randomness = max(epsilon, randomness - d_random)
    if (ep+1) % report_ever_n_episodes == 0:
        print(ep, rs_longterm/report_ever_n_episodes)
        print(Q.max(dim=1)[1])
        rs_longterm = 0
    rs_longterm += rs

In [None]:
# LET'S CHECK Q!
Q