# MSE MachLe -- Reinforcement Learning
**Helmut Grabner, Autum Term, 2020**

### Objectives:
+ Apply (classical) Q-learning on the (classical) cartpole example
+ [Bonus] use other environments!

### Questions:
+ What is the influence of the different parameters, such as learning rate, exploration rate, discount factor, etc.?
+ How many episodes are needed?

### Credits
+ Hughly inspired by: https://github.com/RJBrooker/Q-learning-demo-Cartpole-V1/blob/master/cartpole.ipynb

### Additional Resources
+ https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
+ https://gym.openai.com/docs/

In [None]:
from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import gym
import math
from typing import Tuple
from sklearn.preprocessing import KBinsDiscretizer


In [None]:
#if you're using colab you may need to install xvfb to display the environment
!apt-get install xvfb
!pip install pyvirtualdisplay

# Set up a virtual display for rendering OpenAI gym environments.
import pyvirtualdisplay
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

## Environment

Here we will use the well-known CartPole gym-environment. Our agent has to learn how to balance a pole on a car. The agent observes the position and velocity of the car and the angular position and velocity of the pole. In every step, he can move right or left.

In [None]:
env_name = 'CartPole-v1'
env = gym.make(env_name).env

Lets take a look at the start situation. Observation corresponds to: [position of the car, velocity of the cart, angular position of the pole, velocitiy of the pole]

In [None]:
time_step = env.reset()
print(time_step)

If we act and let the car move right, we see the new situation. Additionally, the reward increased by one, because we were able to keep the pole above a given angel for one step.

In [None]:
action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

## Visualise Enviroment
Visualise the eniroment/simulation -- with a **RANDOM** policy (e.g., moving left or right randomly)

In [None]:
#monitor
from gym import wrappers
env = wrappers.Monitor(env, "./gym-results", force=True)

#reset environment
env.reset()
for t in range(500):
  #choose random action
  action = env.action_space.sample()
  #apply action, get new state, reward, etc.
  observation, reward, done, info = env.step(action)
  if done:
    print("Episode finished after {} timesteps".format(t+1))
    break
env.close()

In [None]:
import io
import base64
from IPython.display import HTML

video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))

## Q-learning

Convert Catpoles continues state space into discrete one. 

In [None]:
n_bins = ( 6 , 12 )
lower_bounds = [ env.observation_space.low[2], -math.radians(50) ]
upper_bounds = [ env.observation_space.high[2], math.radians(50) ]

def discretizer( _ , __ , angle, pole_velocity ) -> Tuple[int,...]:
    est = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
    est.fit([lower_bounds, upper_bounds ])
    return tuple(map(int,est.transform([[angle, pole_velocity]])[0]))


Initialise the Q value table with zeros. Interpret the table!

In [None]:
Q_table = np.zeros(n_bins + (env.action_space.n,))
Q_table.shape
#print (Q_table)

Create a polict function,  uses the Q-table to and greedly selecting the highest Q value 

In [None]:
def policy( state : tuple ):

    ### INSERT YOUR CODE HERE ###


    ### END YOUR CODE HERE ###


Update function 

In [None]:
def new_Q_value( reward : float ,  new_state : tuple , discount_factor=1 ) -> float:
    #Temperal diffrence for updating Q-value of state-action pair
    
    ### INSERT YOUR CODE HERE ###


    ### END YOUR CODE HERE ###

Decaying of learning rate and exploration rate

In [None]:
def learning_rate(n : int , min_rate=0.01 ) -> float  :
    #Decaying learning rate decay
    return max(min_rate, min(1.0, 1.0 - math.log10((n + 1) / 25)))

def exploration_rate(n : int, min_rate= 0.1 ) -> float :
    #Decaying exploration rate
    return max(min_rate, min(1, 1.0 - math.log10((n  + 1) / 25)))

## Training

Make yourself familar with the code. Spot policy, environment update, etc. Play around with differnt settings...

In [None]:
n_episodes = 300
n_maxtimesteps = 10000
timesteps= [];

discount_factor = 1

for e in range(n_episodes):
    
    # Siscretize state into buckets
    current_state, done = discretizer(*env.reset()), False
    
    for t in range(n_maxtimesteps):
    
        # policy action 
        action = policy(current_state) # exploit
        
        # insert random action
        if np.random.random() < exploration_rate(e) : 
            action = env.action_space.sample() # explore 
         
        # increment enviroment
        obs, reward, done, _ = env.step(action)
        new_state = discretizer(*obs)
        
        # Update Q-Table
        lr = learning_rate(e)
        learnt_value = new_Q_value(reward , new_state, discount_factor)

        old_value = Q_table[current_state][action]
        Q_table[current_state][action] = (1-lr)*old_value + lr*learnt_value
        
        current_state = new_state

        if done:
           print("Episode {} finished after {} timesteps".format(e, t+1))
           timesteps.append(t+1)
           break
        
    #close environmnet      
    env.close()

plt.plot(timesteps);
plt.xlabel("epidodes");
plt.ylabel("timesteps of successfull balancing");

Check the learnted Q-Table

In [None]:
print (Q_table)

Now, run the learned policy...

In [None]:
env.reset()
env = wrappers.Monitor(env, "./gym-results", force=True)

#reset environment
current_state, done = discretizer(*env.reset()), False

for t in range(n_maxtimesteps): 
  # policy action 
  action = policy(current_state) # exploit
    
  # increment enviroment
  obs, reward, done, _ = env.step(action)
  current_state = discretizer(*obs)

  env.render()

  if done:
      print("Finished after {} timesteps".format(t+1))
      break
env.close()

Render Video for the last episode -- compare to previously video...

In [None]:
import io
import base64
from IPython.display import HTML

video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))