# Week 06 Notes - Deep Reinforcement Learning <a class="tocSkip">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Deep-RL-for-Database-Optimization" data-toc-modified-id="Deep-RL-for-Database-Optimization-1">Deep RL for Database Optimization</a></span><ul class="toc-item"><li><span><a href="#Notes" data-toc-modified-id="Notes-1.1">Notes</a></span></li><li><span><a href="#Take-Aways" data-toc-modified-id="Take-Aways-1.2">Take Aways</a></span></li><li><span><a href="#Learning-Resources" data-toc-modified-id="Learning-Resources-1.3">Learning Resources</a></span></li></ul></li><li><span><a href="#Deep-Q-Learning-Pong-Tutorial" data-toc-modified-id="Deep-Q-Learning-Pong-Tutorial-2">Deep Q Learning Pong Tutorial</a></span><ul class="toc-item"><li><span><a href="#Notes" data-toc-modified-id="Notes-2.1">Notes</a></span></li><li><span><a href="#Learning-Resources" data-toc-modified-id="Learning-Resources-2.2">Learning Resources</a></span></li></ul></li><li><span><a href="#Prioritized-Experience-Replay-(PER)" data-toc-modified-id="Prioritized-Experience-Replay-(PER)-3">Prioritized Experience Replay (PER)</a></span><ul class="toc-item"><li><span><a href="#Theory" data-toc-modified-id="Theory-3.1">Theory</a></span></li><li><span><a href="#Priority-$p_t$" data-toc-modified-id="Priority-$p_t$-3.2">Priority $p_t$</a></span></li><li><span><a href="#Probability-$P(i)$" data-toc-modified-id="Probability-$P(i)$-3.3">Probability $P(i)$</a></span></li><li><span><a href="#Importance-Sampling-Weights-(IS)" data-toc-modified-id="Importance-Sampling-Weights-(IS)-3.4">Importance Sampling Weights (IS)</a></span></li><li><span><a href="#Google-DeepMind-Paper" data-toc-modified-id="Google-DeepMind-Paper-3.5">Google DeepMind Paper</a></span></li><li><span><a href="#Implementation" data-toc-modified-id="Implementation-3.6">Implementation</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-3.7">Summary</a></span></li><li><span><a href="#More-Papers-(Intermediate)" data-toc-modified-id="More-Papers-(Intermediate)-3.8">More Papers (Intermediate)</a></span></li></ul></li><li><span><a href="#Dueling-DQN" data-toc-modified-id="Dueling-DQN-4">Dueling DQN</a></span></li><li><span><a href="#Neural-Networks-Study-Guide" data-toc-modified-id="Neural-Networks-Study-Guide-5">Neural Networks Study Guide</a></span></li><li><span><a href="#Quiz:-Neural-Networks" data-toc-modified-id="Quiz:-Neural-Networks-6">Quiz: Neural Networks</a></span><ul class="toc-item"><li><span><a href="#Question-1" data-toc-modified-id="Question-1-6.1">Question 1</a></span></li><li><span><a href="#Question-2" data-toc-modified-id="Question-2-6.2">Question 2</a></span></li><li><span><a href="#Question-3" data-toc-modified-id="Question-3-6.3">Question 3</a></span></li><li><span><a href="#Question-4" data-toc-modified-id="Question-4-6.4">Question 4</a></span></li><li><span><a href="#Question-5" data-toc-modified-id="Question-5-6.5">Question 5</a></span></li></ul></li><li><span><a href="#Reading-Assignments-(DQN-Improvements)" data-toc-modified-id="Reading-Assignments-(DQN-Improvements)-7">Reading Assignments (DQN Improvements)</a></span></li><li><span><a href="#Homework-Assignment:-Deep-Q-Learning" data-toc-modified-id="Homework-Assignment:-Deep-Q-Learning-8">Homework Assignment: Deep Q Learning</a></span></li></ul></div>

## Deep RL for Database Optimization

**Video Description**:

We can use deep reinforcement learning to optimize a SQL database, and in this video we'll optimize the ordering of a series of SQL queries such that it involves the minimum possible memory/computation footprint. Deep RL involves using a neural network to approximate reinforcement learning functions, like the Q (quality) function. After we frame our database as a Markov Decision Process, I'll use Python to build a Deep Q Network to optimize SQL queries. Enjoy!


### Notes

SQL:

$ SQL\ Statement \rightarrow Parsing \rightarrow (Parse\ Tree) \rightarrow Binding \rightarrow (Algebrized\ Tree) \rightarrow Query\ Optimization \rightarrow (Execution\ Plan) \rightarrow Query\ Execution \rightarrow Query\ Results $


### Take Aways

- Deep Reinforcement Learning involves using a Neural Network to Approximate Reinforcement Learning Functions like the Q Function
- We can assess the quality or Q of State Action Pairs by computing A Q Table
- Q Learning involves approximating the relationship between State Action Pairs and Q Values in this table using Neural Networks


### Learning Resources

- [Youtube Video](https://www.youtube.com/watch?v=Rw3ewEXOKC8)
- [Code Link: SQL Database Optimization](https://github.com/llSourcell/SQL_Database_Optimization)
- [Irselab: SQL Query Optimization Meets Deep Reinforcement Learning](https://rise.cs.berkeley.edu/blog/sql-query-optimization-meets-deep-reinforcement-learning/)
- [MLDB: Machine Learning Database](https://mldb.ai/)
- [Microsoft: Machine Learning Services in SQL Server 2017](https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-2017)
- [Towards Data Science: Mchine Learning in your Database](https://towardsdatascience.com/machine-learning-in-your-database-the-case-for-and-against-bigquery-ml-4f2309282fda)
- [Quora: Which database is best for machine learning](https://www.quora.com/Which-database-is-best-for-machine-learning)

## Deep Q Learning Pong Tutorial

**Video Description**:

Learn how to build an AI that plays Pong like a boss, with Deep Q Learning. Discover how neural networks can learn to play challenging video games at superhuman levels by looking at raw pixels.


### Notes

**How to Play Pong Like a BOSS with Deep Q Learning**

- If we can write code which masters complex video games we can use that same code to master complex real-life problems


**Taking Q DEEP**

- In vanilla Q Learning, we are storing a huge lookup table whose size is the number of possible states times the number of possible actions
- PROBLEM: possible game states are astronomically large!
- Bellman Equation remains the same
- SOLUTION: replace the Q lookup table with a neural network, which approximates a function for Q of state and action
- BEFORE: we use the most recent value calculation and a learning rate to update our Q table
- NOW: Update the Q Network with stochastic gradient descent (SGD) and Back propagation


**Over-simplified Q Learning Algorithm**

1. Initialize Q(s, a) randomly
2. Interact with the environment to obtain (s, a, r, s')
3. Calculate loss: $ L\ =\ (Q_{s,a} - r)^2 $  
   otherwise: $ L\ =\ (Q_{s,a}\ -\ (r\ +\ \gamma max_a [Q_{s',a'}]))^2  $
4. Update Q(s, a) using SGD, minimizing the loss function
5. Repeat steps 2 - 4 until converged


**Explore vs. Exploit**

- To reach an optimal policy, we need to balance exploration with exploitation
- Continue to use Epsilon greedy method
- Start with epsilon = 1 -> taking all random actions
- Gradually taper to lower value over a fixed number of game frames


**Replay Buffer**

- SGD optimization requires Independent and Identically Distributed training data
- Our state transitions are highly correlated
- Store a long list of (s, a, r, s') transitions
- Randomly sample batches to train on from the buffer
- New data kicks off old data
- By randomly sampling from a long list, it breaks the correlation that comes from sampling values right next to each other and allows the model to converge


**Target Network**

- Loss function, $ L\ =\ (Q_{s,a}\ -\ (r\ +\ \gamma max_a [Q_{s',a'}]))^2  $
- We're updating Q(s, a) and Q(s', a') in the same step
- We'll store them in a different network (target_network)
- Copy the weights from main to target at a fixed interval


**Predicting Motion**

- Markov Property: The Past Doesn't Matter Baby!
- But when there's motion it does matter
- Solution: stack several recent frames together as input


**Network Architecture**

- How do we take a large number of pixels from the screen and pick out which objects are important to achieving results in a video game? The same neural network used in cutting edge image recognition is perfect for applying reinforcement learning to screen pixels
- Use 3 layers of convolutions with each one passing through a ReLu activation
- Input
- Conv2D -> ReLu (x3)
- Fully connected (512) -> ReLu
- Actions: Output layer spits out the values of each action


**Deep Q Network Algorithm**

1. Initialize Q(s,a) and Q^(s,a) (target network) with random weights
2. With probability epsilon, select random action $a$, otherwise $ a = max_a(Q_{s,a}) $
3. Execute action $a$ in the game, observe reward r, next state s'
4. Store transition (s, a, r, s') in the replay buffer
5. Sample a random mini-batch of transitions from the replay buffer
6. For every transition in the buffer, calculate target $ y = r $ if episode is over, otherwise $ y = r\ +\ \gamma max_a(Q_{s',a'}) $
7. Calculate loss: $ L\ =\ (Q_{s,a}\ -\ y)^2  $
8. Update Q(s, a) using SGD, minimizing the loss function
9. Every N steps copy weights from Q to Q^
10. Repeat from step 2 until converged

Deep Learning: use either PyTorch (easier) or Tensorflow (great for production, steeper learning curve)


**Simple then Expand**

- Reinforcement Learning started out with simple applications of the Bellman Equation
- Gradually involved enhancements and workarounds when that performed poorly on tasks
- Basic implementation of Q Learning can only handle fairly simple tasks so we're going to start out with Pong
- Learning Deep Q enhancements, we can try out more complex games like Doom


**Wrappers**

- In OpenAI Gym, Wrappers are a layer of code that takes observations raw pixels from the environemnt and processes them before they enter the neural network
- A layer of code around OpenAI gym
- Transforms observations before passing them to the network
- Transforms actions before passing them to the environment


**How to Run Deep Q Pong**

- ```dqn_basic.py --cuda```
- ```tensorboard --logdir runs```
- took ~2 hours to train on a Nvidia GTX 1080ti GPU

### Learning Resources

- [Youtube Video: Deep Q Learning Pong Tutorial](https://www.youtube.com/watch?v=pST6caY3mu8)
- [Code Link: DQN Pong](https://github.com/colinskow/move37/tree/master/dqn)
- [Siraj: Image Recognition Tutorial](https://www.youtube.com/watch?v=cAICT4Al5Ow)

## Prioritized Experience Replay (PER)

This article is mainly citing Thomas Simonini's blog
    - Thomas Simonini's blog about Improvements in Deep Q Learning
    - PRIORITIZED EXPERIENCE REPLAY by Google DeepMind
    - SLM Lab@School of AI Github
    - OpenAI Github
    - Patrick Emami's blog
    - Jaromiru's blog about LET’S MAKE A DQN


### Theory

- Some experiences may be more important than others for our training, but might occur less frequently
- Because we sample the batch uniformly (selecting the experiences randomly) these rich experiences that occur rarely have practically no chance to be selected.


### Priority $p_t$

- We want to take in a priority experience where there is a big difference between our prediction and the TD target, since it means that we have a lot of learn about it.
- Define Priority $p_t$ as:

$$ \large p_t = |\delta_{t}| + e $$

$|\delta_{t}|$: Magnitude of our TD error  
$e$: Constant assures that no experience has 0 probability to be taken

- TD Error: $error\ = |Q(s,a)\ - T(S)|$ where $T(S)\ =\ r + \gamma Q(s', argmax_aQ(s',a))$


### Probability $P(i)$

Priority is translated to probability of being chosen for replay.

A sample _i_ has a probability of being picked during the experience replay determined by a formula:

$$ \large P(i) = \frac{p_i^a}{\sum{k} p_k^a}$$

where
- $p_i$ is the Priority value
- $\sum{k} p_k^a$ - All priority values in Replay Buffer
- $a$ - Hyperparameter used to reintroduce some randomness in the experience selection for the replay buffer.
    - If $a\ = 0 \rightarrow $ pure uniform randomness
    - If $a\ = 1 \rightarrow $ only select the experience with the highest priorities

- $P(i)$ is probability and $p_i$ is priority
- We defined modified probability to pick more experiences with higher priorities


### Importance Sampling Weights (IS)

- Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (=bias)
- Therefore, we will update our weights with only a small portion of experiences that we consider to be really interesting
- To correct this bias, we use importance sampling weights (IS) that will **adjust the updating by reducing the weights** of the often seen samples

$$ \larger W_i\ =\ (\frac{1}{N}\ *\ \frac{1}{P(i)})^b $$

where
- $\frac{1}{N}$ is Replay Buffer Size
- $P(i)$ is Sampling probability
- $b$ controls how much the IS w affects learning
- Close to 0 at the beginning of learning and annealed up to 1 over the duration of training because **these weights are more important in the end of learning when our Q values begin to converge**


### Google DeepMind Paper

Define Priority $p_i$, pick Probability of $P(j)$ and update with importance sampling weight $w_i$.

![Google DeepMind PER Paper Algorithm 1](imgs/move_37_google_deepmind_per_paper_algorithm1.jpg)

<br/>

### Implementation

- TODO: Add relevant code parts!

- Priority Part ([SLM Lab](https://github.com/kengz/SLM-Lab/blob/master/slm_lab/agent/memory/prioritized.py))
- Priority Part ([OpenAI](https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py))
- Probability and IS Part ([OpenAI](https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py))


### Summary

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in reply memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.


### More Papers (Intermediate)

- [Distributed Prioritized Experience Replay ICLR 2018](https://arxiv.org/abs/1803.00933)
- [A Deeper Look at Planning as Learning from Replay (Richard Sutton 2015)](http://proceedings.mlr.press/v37/vanseijen15.pdf)

## Dueling DQN


**Additional Resources**:

- [Medium: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets](https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682)

## Neural Networks Study Guide

[Neural Networks Study Guide](https://www.theschool.ai/wp-content/uploads/2018/10/Move-37-Week-6-Study-Guide.pdf)

## Quiz: Neural Networks

Test your knowledge on FFNN, RNN, LSTM, and GRU architectures

### Question 1


- [] 
- [] 
- [] 
- [] 
- [] 


**Explanation**:



### Question 2


- [] 
- [] 
- [] 
- [] 
- [] 


**Explanation**:


### Question 3


- [] 
- [] 
- [] 
- [] 
- [] 


**Explanation**:



### Question 4

- [] 
- [] 
- [] 
- [] 
- [] 


**Explanation**:



### Question 5



- [] 
- [] 
- [] 
- [] 
- [] 



**Explanation**:



## Reading Assignments (DQN Improvements)

3 pivotal papers on Deep Q Learning:

- [Paper 1 – Playing Atari with Deep Reinforcement Learning](https://arxiv.org/pdf/1312.5602.pdf)

- [Paper 2 – Episodic Memory with Deep Q Networks](https://www.ijcai.org/proceedings/2018/0337.pdf)

- [Paper 3 – Dueling Network Architectures with Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.pdf)


**Additional Resources**

- [Medium: Deep Reinforcement Learning: Deep Q Learning and Policy Gradients (Towards AGI)](https://medium.com/deep-math-machine-learning-ai/ch-13-deep-reinforcement-learning-deep-q-learning-and-policy-gradients-towards-agi-a2a0b611617e)

## Homework Assignment: Deep Q Learning

This weeks homework assignment is build a deep q network to defeat the [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) environment in OpenAI’s Gym environment. Use this [repository](https://github.com/AndersonJo/dqn-pytorch) as a helpful guide. Train it and test it, if your algorithm successfully learns how to beat the environment, you’ve successfully completed the assignment. Good luck!

See [homework 06 notebook](homework06/homework06.ipynb)