<a href="https://colab.research.google.com/github/lin3372/252_ML.tutorial/blob/main/RL/03_Hands_on_RL_Tabular_SARSA_220801.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Hands-On Reinforcement Learning Course: Part 5 - Deep Q learning](https://medium.com/towards-data-science/hands-on-reinforcement-learning-course-part-5-bdb2e7fa243c)

by [Pau Labarta Bajo](https://pau-labarta-bajo.medium.com/?source=post_page-----269b50e39d08--------------------------------), Feb 10, 2022.  [[github repo for this lesson]](https://github.com/Paulescu/hands-on-rl)

Summarized and Revised by Ivan H.P. Lin

Previous RL courses:
1. Part 1: Introduction to Reinforcement Learning - [datSci](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08), [Ivan's colab](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08)
2. Part 2: Tabular Q-learning - [datSci](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08), [Ivan's colab](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08)
3. Part 3: Tabular SARSA - [datSci](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08), [Ivan's colab](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08)
4. Part 4: Linear Q-learning [datSci](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-4-55da5eae851f) [Ivan's colab](https://drive.google.com/file/d/1Q1CWVOjmH46Gf_Xb_u4Ht13U55LsQina/view?usp=sharing)

5. 👉🏻 Part 5: Deep Q-learning (today) [datSci](https://medium.com/towards-data-science/hands-on-reinforcement-learning-course-part-5-bdb2e7fa243c) [Ivan's colab](https://medium.com/towards-data-science/hands-on-reinforcement-learning-course-part-5-bdb2e7fa243c)

In part 4 we built an okay-ish agent for the Cart Pole environment. We used **parametric $Q$ learning** with a linear model.

Today we will replace this linear model with a neural network.

And we will kick-ass-solve the Cart Pole environment 😎

Today’s lesson is a bit longer, as it includes a mini crash course on training neural network models. Unless you are an expert in deep learning, I highly recommend you do not skip it.

All the code for this lesson is in [this Github repo](https://github.com/Paulescu/hands-on-rl). Git clone it to follow along with today’s problem.

And if you like the course, please give it a ⭐ in Github!

<img src="https://i.makeagif.com/media/9-30-2015/3TntUH.gif" width="60%">

# 0. Contents

1. Let’s go deep!
2. Hands-on Introduction to Deep Learning
3. Deep Q learning to solve the Cart Pole
4. Recap ✨
5. Homework 📚
6. What’s next? ❤️

#1.. Let’s go deep! 🕹️

In the previous lesson, we used this linear parameterization to represent the optimal $q$ function.

<figure><center>
<img src="https://miro.medium.com/max/711/0*ziABdoI655gxHn3H.jpeg" width="60%">
<figcaption>linear q function (Image by the author)</figcaption>
</center></figure>

The success (or failure) of a **parametric $Q$-learning** agent strongly depends on the parameterization we use to approximate the optimal $q$ value function.

Linear models are conceptually simple, fast to train, and fast to run. However, they are not very flexible. Given a set of inputs and outputs, linear layers struggle to map inputs to outputs.

And this is when neural networks enter into the game.

Neural network models are the most powerful function approximations we have. They are extremely flexible and can be used to uncover complex patterns between the input features and the target labels.


  > **The Universal Approximation Theorem** 📘 is a mathematical result that essentially says

  > Neural networks are as flexible as you want them to be. If you design a sufficiently large neural network (i.e. with enough parameters), you will find an accuracte mapping between the input features and the target values.

<img src="https://miro.medium.com/max/1465/1*Z_0zd9ld4CaSptmAlrYcdw.png" width="50%">

Today we are going to replace the linear model from part 4 with the most simple neural network architecture out there: a **feed-forward neural network**.

<figure><center>
<img src="https://miro.medium.com/max/1539/1*Ibtr60PwTe51yyJqUqh8Eg.jpeg" width="50%">
<figcaption>Feed-forward neural network (Image by the author)</figcaption>
</center></figure>

Later in the course, we will use other neural networks to deal with more complex states spaces (e.g. **convolutional neural networks**).

Let us warm up our deep learning mastery with the following imitation learning problem I created for today.

There is quite a lot to cover, so arm yourself with deep focus.




### display setup for colab
reference stackoverflow - [How to render OpenAI gym in google Colab? ](https://stackoverflow.com/questions/50107530/how-to-render-openai-gym-in-google-colab)

In [1]:
!apt-get install x11-utils > /dev/null 2>&1 
!pip install pyglet > /dev/null 2>&1 
!apt-get install -y xvfb python-opengl > /dev/null 2>&1

In [2]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1

In [3]:
# then import all your libraries, including *matplotlib* & *ipythondisplay*:

import gym
import numpy as np
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay

In [4]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()

<pyvirtualdisplay.display.Display at 0x7f756cd45150>

### download util files from github

In [5]:
import os
user = "Paulescu"
repo = "hands-on-rl"
src_dir = "03_cart_pole/src/"
pyfiles = ["agent_memory.py", "config.py", "loops.py", "model_factory.py", "optimize_hyperparameters.py",\
           "q_agent.py", "random_agent.py", "supervised_ml.py", "utils.py", "viz.py"]
### Note - For "viz.py" there is an error in get_action(), so I fixed it and put in my respoitory =, needs to download tehe viz.py from my github repository

curr_dir=os.getcwd()
os.makedirs('src', exist_ok=True)

os.chdir('src')

for f_rl in pyfiles:
  url = f"https://raw.githubusercontent.com/{user}/{repo}/main/{src_dir}/{f_rl}"
  !wget --no-cache --backups=1 {url}

##########################   
#user = "lin3372"
#repo = "252_ML.tutorial"
#src_dir = "RL/src/02_mountain_car/src"
#pyfiles = ["viz.py"]

#for f_rl in pyfiles:
#  url = f"https://raw.githubusercontent.com/{user}/{repo}/main/{src_dir}/{f_rl}"
#  !wget --no-cache --backups=1 {url}
########################## 

os.chdir(curr_dir)

--2022-08-11 09:56:20--  https://raw.githubusercontent.com/Paulescu/hands-on-rl/main/03_cart_pole/src//agent_memory.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /Paulescu/hands-on-rl/main/03_cart_pole/src/agent_memory.py [following]
--2022-08-11 09:56:20--  https://raw.githubusercontent.com/Paulescu/hands-on-rl/main/03_cart_pole/src/agent_memory.py
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 581 [text/plain]
Failed to rename agent_memory.py to agent_memory.py.1: (2) No such file or directory
Saving to: ‘agent_memory.py’


2022-08-11 09:56:21 (39.2 MB/s) - ‘agent_memory.py’ saved [581/581]

--2022-08-11 09:56:21--  https://raw.githubusercontent.com/Paulescu/hands-o

#2.. Intro to Deep Learning: learning the optimal policy from labeled data

[05_crash_course_on_neural_nets.ipynb](https://github.com/Paulescu/hands-on-rl/blob/main/03_cart_pole/notebooks/05_crash_course_on_neural_nets.ipynb)

Skip this lecture if you think you are a Deep Learning master

👉 Let's train a few neural network models in a supervised setting.

👉 In Supervised Machine Learning you have input features and target values. And the goal is to find the right mapping between input features and target values.

👉Neural networks are usually highly-parametric models that are able to fit complex patterns between the input features and the target.

👉 The type of neural network we will use is a Multi Layer Perceptron (MLP). MLPs are stacks of linear models, interleaved with activation functions.

In [6]:
%load_ext autoreload
%autoreload 2
%pylab inline
%config InlineBackend.figure_format = 'svg'

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


### Environment 🌎

In [7]:
import gym
env = gym.make('CartPole-v1')

##1. Data
Steps to generate the train data and test data we will need to build the neural network, and to evaluate it.

### 1.1 Download the agent parameters from Google Drive 📩

In [8]:
from src.supervised_ml import download_agent_parameters

path_to_agent_data = download_agent_parameters()
print(f'path_to_agent_data={path_to_agent_data}')

FileNotFoundError: ignored

#2.. Environment, actions, states, rewards 👉🏽 
[notebooks/00_environment.ipynb](https://github.com/Paulescu/hands-on-rl/blob/main/03_cart_pole/notebooks/00_environment.ipynb)



The state is represented by 4 numbers:

* The cart position $x$ from $[-2.4, 2.4]$.
* The cart velocity **$v$**
* The pole angle $θ$ with respect to the vertical from $-12$ to $12$ degrees (from $-0.21$ to $0.21$ in radians)
* The pole angular velocity $ω$. This is the rate of change of $θ$.

<figure><center>
<img src="https://miro.medium.com/max/461/1*VZcW5MKAkp2P2uqZ_-o53w.jpeg" width="60%">
<figcaption>Cart pole state vector (Image by the author)</figcaption>
</center></figure>

An episode terminates when either:

* the cart goes beyond the limits: $x > 2.4$ or $x < -2.4$
* the pole is too far from the vertical: $θ > 12$ degrees or $θ < -12$ degrees.
* or we reached the maximum number of episodes steps, $500$. In this case, the agent perfectly solved the episode.

The starting state is sampled randomly from the interval $[-0.05, 0.05]$ for each state. Sometimes, the starting position is so close to balance that the episode is easy. Other times, the starting position is so off-balance that the episode is much harder to solve, and sometimes even impossible.

What about the **cart velocity $v$** and the pole **angular velocity $ω$**. Are these values bounded too?

👉Before you solve a Reinforcement Learning problem you need to define what are
* the actions
* the states of the world
* the rewards

👉We are using the ```CartPole-v0``` environment from [OpenAI's gym](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

👉```CartPole-v0``` is not an extremely difficult environment. However, it is complex enough to force us level up our game. The tools we will use to solve it are really powerful.

👉Let's explore it!

### Load the environment 

In [None]:
%load_ext autoreload
%autoreload 2
%pylab inline
%config InlineBackend.figure_format = 'svg'

from matplotlib import pyplot as plt
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [None]:
## Load the environment 
import gym
env = gym.make('CartPole-v1')

If you look at the OpenAI gym interval variables ```env.observation_space.low``` and ```env.observation_space.high``` you will see that these 2 numbers seem to be arbitrarily large or small.

In [None]:
## The state consists of 4 numbers:
x_min, v_min, angle_min, angular_v_min = env.observation_space.low
x_max, v_max, angle_max, angular_v_max = env.observation_space.high

print(f'Cart position from {x_min:.2f} to {x_max: .2f}')
print(f'Cart velocity from {v_min:.2E} to {v_max: .2E}')
print(f'Angle from {angle_min:.2f} to {angle_max: .2f}')
print(f'Angular from {angular_v_min:.2E} to {angular_v_max: .2E}')

Cart position from -4.80 to  4.80
Cart velocity from -3.40E+38 to  3.40E+38
Angle from -0.42 to  0.42
Angular from -3.40E+38 to  3.40E+38


In [None]:
# env.reset()
# frame = env.render(mode='rgb_array')

# fig, ax = plt.subplots(figsize=(8, 6))
# ax.axes.yaxis.set_visible(False)
# min_x = env.observation_space.low[0]
# max_x = env.observation_space.high[0]
# ax.imshow(frame, extent=[min_x, max_x, 0, 8])

### **actions**

What about the actions our agent can perform?

* 0 : Push the cart to the left.
* 1 : Push the cart to the right.

### **rewards**

The **reward** is +1 for every step taken. This means that the longer the agent keeps the pole standing, the higher the cumulative reward.