<center><h1>AI in Web Development</h1></center>

---

<center><h2>Lesson 04</h2></center>


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snsie/ai-webdev/blob/main/lessons/lesson-04/lesson-04.ipynb)

This notebook is based on [this example](https://colab.research.google.com/github/Curt-Park/rainbow-is-all-you-need/blob/master/01.dqn.ipynb)
and [this example](https://colab.research.google.com/github/ehennis/ReinforcementLearning/blob/master/05-DQN.ipynb#scrollTo=DPWjJiOZ2uVd)  


<h1 align="center">Reinforcement Learning (RL)</h1>

<center><img src="https://github.com/snsie/ai-webdev/blob/main/images/what-is-reinforcement-learning.png?raw=true" width='320px'/></center>


###Environment
  * The stage that contains the simulation

<br/>

###Agent
  * The entity making decisions
  * Can be represented as a neural network

<br/>

###States
* Set of observations that agents that can be performed by the agent
* example: agent's position

<br/>

###Actions
* Set of activities that can be performed by the agent
* example: move right, move left

<br/>

###Rewards
* Provides agents feedback about their performance


<center><h1>Cart Pole Example</h1></center>

---

<center><h4>the cart's goal: balance the pole</h4></center>


[Gym Docs](https://www.gymlibrary.ml/environments/classic_control/cart_pole/)

|Bad Cart          |  Good Cart  |
|:-------------------------:|:-------------------------:|
| <img src='https://github.com/snsie/ai-webdev/blob/main/images/cartpole-initial.gif?raw=true' width="300"/>  | <img src='https://github.com/snsie/ai-webdev/blob/main/images/cartpole-trained.gif?raw=true' width="300"/> |

<center><h3><u>Actions</u></h3></center>

| Num | Action                 |
|-----|------------------------|
| 0   | Push cart to the left  |
| 1   | Push cart to the right |

<br/>
<br/>

<center><h3><u>States</u></h3></center>

| Num | Observation           | Min                 | Max               |
|-----|-----------------------|:-------------------|:-----------------|
| 0   | Cart Position         | -4.8                | 4.8               |
| 1   | Cart Velocity         | -Inf                | Inf               |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf               |

<br/>

[Cart Pole Python File](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

RL Algorithm to be used in this example:

<center><h1>Q-learning</h1></center>

<br/>

<h4>
\begin{align}
Q(s,a) = r(s,a) + \gamma \cdot \max_{a} Q(s',a')
\end{align}
</h4>

<br/>

###$Q(s,a)$ = Q-value

###$r(s,a)$ = reward for current action

###$\gamma$ = parameter that scales: $\max_{a}Q(s',a')$

###$\max_{a}Q(s',a')$ = Maximum Q-value predicted in the next state 

<br/>

---

<h3>Long term interpretation of Q:</h3>
<h3>
$$
Q(s_t,a_t) = r(t) + \gamma \cdot r(t+1) + \gamma^2 \cdot r(t+2) +  \gamma^3 \cdot r(t+3) \ ...
$$
</h3>

**Reference**  
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). *Human-level control through deep reinforcement learning*. Nature, 518(7540), 529

In [32]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !apt install python-opengl
    !apt install ffmpeg
    !apt install xvfb
    !pip install PyVirtualDisplay==3.0
    !pip install gym==0.21.0
    from pyvirtualdisplay import Display
    
    # Start virtual display
    dis = Display(visible=0, size=(400, 400))
    dis.start()

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-opengl is already the newest version (3.1.0+dfsg-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
xvfb is already the newest version (2:1.19.6-1ubuntu4.11).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.


In [33]:
import os
from typing import Dict, List, Tuple

import gym
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from IPython.display import clear_output,HTML, display
import base64
import glob
import io

## Network

We are going to use a simple network architecture with three fully connected layers and two non-linearity functions (ReLU).

In [34]:
### Create a neural network class 


## Replay buffer

Typically, people implement replay buffers with one of the following three data structures:

  - collections.deque
  - list
  - numpy.ndarray
  
**deque** is very easy to handle once you initialize its maximum length (e.g. deque(maxlen=buffer_size)). However, the indexing operation of deque gets terribly slow as it grows up because it is [internally doubly linked list](https://wiki.python.org/moin/TimeComplexity#collections.deque). On the other hands, **list** is an array, so it is relatively faster than deque when you sample batches at every step. Its amortized cost of  *Get item* is [O(1)](https://wiki.python.org/moin/TimeComplexity#list).

Last but not least, let's see **numpy.ndarray**. numpy.ndarray is even faster than list due to the fact that it is [a homogeneous array of fixed-size items](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray), so you can get the benefits of [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference). Whereas list is an array of pointers to objects, even when all of them are of the same type.

Here, we are going to implement a replay buffer using numpy.ndarray.


Reference: [OpenAI spinning-up](https://github.com/openai/spinningup/blob/master/spinup/algos/sac/sac.py#L10)

In [35]:
#Create ReplayBuffer Class


## DQN Agent

Here is a summary of DQNAgent class.

| Method           | Note                                                 |
| ---              | ---                                                  |
|select_action     | select an action from the input state.               |
|step              | take an action and return the response of the env.   |
|compute_dqn_loss  | return dqn loss.                                     |
|update_model      | update the model by gradient descent.                |
|target_hard_update| hard update from the local model to the target model.|
|train             | train the agent during num_frames.                   |
|test              | test the agent (1 episode).                          |
|plot              | plot the training progresses.                        |


In [36]:
# Create DQNAgent class 


#Create the environment

In [37]:
# set the environment


#Set parameters for experiment

In [38]:
# create params



In [39]:
#Initialize the agent using the DQN class


#Train

In [40]:
# train the agent


#Test the network and export video of agent's actions

In [41]:
# create the video_folder var


In [42]:
## create the functions to display the test video in colab
