# Week 08 Notes - Policy Gradient Methods <a class="tocSkip">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Policy-Gradients-Math-Primer" data-toc-modified-id="Policy-Gradients-Math-Primer-1">Policy Gradients Math Primer</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#A-Short-Intro-to-Entropy,-Cross-Entropy-and-KL-Divergence" data-toc-modified-id="A-Short-Intro-to-Entropy,-Cross-Entropy-and-KL-Divergence-1.0.1">A Short Intro to Entropy, Cross-Entropy and KL-Divergence</a></span></li><li><span><a href="#Softmax-Output-Function" data-toc-modified-id="Softmax-Output-Function-1.0.2">Softmax Output Function</a></span></li></ul></li></ul></li><li><span><a href="#Policy-Gradients-Math-Quiz" data-toc-modified-id="Policy-Gradients-Math-Quiz-2">Policy Gradients Math Quiz</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Question-1" data-toc-modified-id="Question-1-2.0.1">Question 1</a></span></li><li><span><a href="#Question-2" data-toc-modified-id="Question-2-2.0.2">Question 2</a></span></li><li><span><a href="#Question-3" data-toc-modified-id="Question-3-2.0.3">Question 3</a></span></li><li><span><a href="#Question-4" data-toc-modified-id="Question-4-2.0.4">Question 4</a></span></li><li><span><a href="#Question-5" data-toc-modified-id="Question-5-2.0.5">Question 5</a></span></li><li><span><a href="#Question-6" data-toc-modified-id="Question-6-2.0.6">Question 6</a></span></li><li><span><a href="#Question-7" data-toc-modified-id="Question-7-2.0.7">Question 7</a></span></li><li><span><a href="#Question-8" data-toc-modified-id="Question-8-2.0.8">Question 8</a></span></li><li><span><a href="#Question-9" data-toc-modified-id="Question-9-2.0.9">Question 9</a></span></li></ul></li></ul></li><li><span><a href="#Policy-Gradients-Methods-Tutorial" data-toc-modified-id="Policy-Gradients-Methods-Tutorial-3">Policy Gradients Methods Tutorial</a></span></li><li><span><a href="#Policy-Gradient-Methods-(REINFORCE)" data-toc-modified-id="Policy-Gradient-Methods-(REINFORCE)-4">Policy Gradient Methods (REINFORCE)</a></span></li><li><span><a href="#Evolved-Policy-Gradients" data-toc-modified-id="Evolved-Policy-Gradients-5">Evolved Policy Gradients</a></span></li><li><span><a href="#Policy-Gradients-Study-Guide" data-toc-modified-id="Policy-Gradients-Study-Guide-6">Policy Gradients Study Guide</a></span></li><li><span><a href="#Policy-Gradients-Quiz" data-toc-modified-id="Policy-Gradients-Quiz-7">Policy Gradients Quiz</a></span></li><li><span><a href="#Homework-Assignment-(Monte-Carlo-Policy-Gradients)" data-toc-modified-id="Homework-Assignment-(Monte-Carlo-Policy-Gradients)-8">Homework Assignment (Monte Carlo Policy Gradients)</a></span></li><li><span><a href="#Artificial-Curiosity" data-toc-modified-id="Artificial-Curiosity-9">Artificial Curiosity</a></span></li></ul></div>

# Policy Gradients Math Primer

[Logarithms Refresher](https://www.mathsisfun.com/algebra/logarithms.html)

### A Short Intro to Entropy, Cross-Entropy and KL-Divergence

- [Youtube Video](https://www.youtube.com/watch?v=ErfnhcEV1O8)
- [Paper: A Mathematical Theory of Communication](https://pure.mpg.de/rest/items/item_2383164/component/file_2383163/content)

**Video Description**:

Entropy, Cross-Entropy and KL-Divergence are often used in Machine Learning, in particular for training classifiers. In this short video, you will understand where they come from and why we use them in ML.


**Notes**:

- Cross entropy commonly used as a cost function when training classifiers
- Entropy measures the average amounts of information that you get from one sample drawn from a given probability distribution $p$. It tells you how unpredictable that probability distribution is.

$$ \large H(p)\ =\ -\sum_i p_i log_2(p_i) \\ $$

- Cross entropy is the average message length

$$ \large H(p, q)\ =\ -\sum_i p_i log_2(q_i) \\ $$

where:
- $p\ =\ $ true distribution
- $q\ =\ $ predicted distribution

- If our predictions are perfect, meaning the predicted distribution is equal to the true distribution, then cross-entropy is equal to entropy
- If the distributions differ, then the cross-entropy will be greater than the entropy by some number of bits. This amount is called the relative entropy or more commonly, the Kullback-Leibler Divergence (KL Divergence)

$$ Cross\ Entropy\ =\ Entropy\ +\ KL\ Divergence $$

- KL Divergence is equal to the cross entropy minus the entropy

$$ D_{KL}(p\ ||\ q) = H(p,\ q)\ - H(p) $$

- Train an image classifier to detect some animals: Cat, Dog, Fox, Cow, Red Panda, Bear or Dolphin (7 possibilities)
- Classifier outputs an estimated probability, the predicted possibility distribution
- Since this is a Supervised Learning problem, we know the true distribution
- Use the cross-entropy between these two distributions as a cost function, called a cross-entropy loss or log loss

- Cross Entropy Loss: uses the natural logarithm rather than the binary logarithm

$$ H(p,\ q)\ =\ -\sum_i p_i log(q_i) $$


### Softmax Output Function

- [Youtube Video: Neural Networks for Machine Learning](https://www.youtube.com/watch?v=mlaLLQofmR8)

**Video Description**:

Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. 


**Notes**:

- Softmax output function forces the outputs of a neural network to sum to one so that they can represent a probability distribution across discrete mutually exclusive alternatives

_Problems with squared error_

- The squared error measure has some drawbacks:
    - If the desired output is 1 and the actual output is 0.000000001 there is almost no gradient for a logistic unit to fix up the error
    - If we try to assign probabilities to mutually exclusive class labels, we know that the outputs should sum to 1, but we are depriving the network of this knowledge
- Is there a different cost function that works better?
    - Yes: Force the outputs to represent a probability distribution across discrete alternatives via use of Softmax Function

_Softmax_

- The output units in a softmax group use a non-local non-linearity
- They each receive some total input they've accumulated from the layer below (logit, $Z_i$) and provide an output $y_i$ that depends on the $Z$'s accumulated by their rivals as well as their own.
- Output of the ith neuron:

$$ \large y_i\ =\ \frac{e^{z_i}}{\sum_{j\in group} e^{z_i}} $$

- The bottom line of the equation is the sum of the top line over all possibilities, we know when you add over all possibilities you'll end up with 1 - the sum of all $y_i$'s must come to one
- The $y_i$'s must lie between 0 and 1
- So we force the $y_i$'s to represent a probability distribution over mutually exclusive alternatives just by using that soft max equation

- Softmax Derivative:

$$ \large \frac{\partial y_i}{\partial z_i}\ =\ y_i(1-y_i) $$


_Cross Entropy: the right cost function to use with softmax_

- Right cost function is the negative log probability of the right answer.

$$ \large C\ =\ -\sum_{j} t_j log(y_i) $$

where $t_j$ is the target value

- C has a very big gradient when the target value is 1 and the output is almost 0
    - A value of 0.000001 is much better than 0.000000001
    - The steepness of dC/dy exactly balances the flatness of dy/dz
    
$$ \large \frac{\partial C}{\partial z_i}\ =\ \sum_{j} \frac{\partial C}{\partial y_j}\ \frac{\partial y_j}{\partial z_j}\ =\ y_i\ -\ t_i $$

# Policy Gradients Math Quiz

This is a quiz to test your understanding of the Policy Gradients Math Primer lesson. Make sure you are confident on the material before you attempt it!

### Question 1

Check all of the following which are true about logarithms.

    - [ ] Log is a function which outputs the value of tensors to the console
    - [x] A log is how many of the base number we need to multiply to get the given value
    - [ ] log3(x) is the cubed root of x cubed
    - [x] log base 2 of 8 means 2 to what power equals 8
    - [x] A negative result of a log means we are dividing instead of multiplying
    - [ ] The result of a log is always positiveion


### Question 2

What is $ log_2(64) $?

    - [ ] 2
    - [ ] 4
    - [x] 6
    - [ ] 8

**Explanation**: $2 ^ 6 = 64$, therefore the log base 2 of 64 is 6.


### Question 3

What is $ log_e(1) $?

    - [ ] - infinity
    - [ ] -1
    - [x] 0
    - [ ] 1
    - [ ] infinity

**Explanation**: $e^0 = 1$, in fact any positive number to the power of zero equals one.


### Question 4

Check all that are true about entropy.

    - [ ] Entropy is the tendency of neural networks to move from a state of order to chaos
    - [x] Entropy is a measure of uncertainty in a probability distribution
    - [ ] Entropy is an objective scientific measure of how bad Siraj's hair gets on a bad hair day
    - [ ] Entropy goes up as an RL algorithm becomes more certain of which actions to take
    - [x] Entropy goes down as an RL algorithm becomes more certain of which actions to take


### Question 5

Friday night you have a $60%$ chance of going out for food with buddies, $30%$ chance of staying home to study, and $10%$ chance of landing a really hot date. What is the (base e) entropy of your Friday night probability distribution?

- [x] $- (0.6 * log(0.6) + 0.3 * log(0.3) + 0.1 * log(0.1)) = 0.8979$
- [ ] $\sqrt(0.6 ^ 2 + 0.3 ^ 2 + 0.1 ^2) = 0.6782$
- [ ] $(0.6 * log(0.6) + 0.3 * log(0.3) + 0.1 * log(0.1)) ^ 2 = 0.8063$
- [ ] $\sqrt(0.4 ^ 2 + 0.7 ^ 2 + 0.9 ^2) = 1.2083$


### Question 6

Check all that are true about cross-entropy.

    - [x] Cross entropy measures how much a probability distribution varies from the expected value
    - [ ] Cross entropy is the same as entropy, but moving in the perpendicular direction
    - [ ] Cross entropy is calculated as the mean squared difference between two probability distributions
    - [x] Cross entropy is the negative sum of the expected probabilities multiplied by the log of their respective actual probabilities
    - [ ] The field of machine learning mostly uses log base 2 to calculate cross entropy


### Question 7

Your local weatherman predicts an $80%$ chance of sun and $20%$ chance of rain for tomorrow. Based on that report you leave your umbrella at home and get soaked when it rains. What loss function are you going to back-propagate into the neurons of the weatherman’s brain so he’ll be more right next time?

- [ ] $ - (1.0 * log_e(0.8)) = 0.22$
- [ ] $\sqrt(e^{0.8}) = 1.49$
- [x] $- (1.0 * log_e(0.2)) = 1.61$
- [ ] $\sqrt(e^{0.2}) = 1.10$


### Question 8

Check all that are true about the softmax function.

    - [ ] Outputs the exact inverse of the hardmin function.
    - [x] Squeezes arbitrary numbers into probabilities between 0 and 1, all adding up to 1
    - [ ] Softens the maximum values in a matrix to reduce outliers
    - [x] Softmax is the exponential of each logit, divided by the sum of the exponentials of all logits
    - [ ] Softmax is the square of each logit, divided by the square of all logits
    - [ ] In psychology, softmax is when you want to do your best but don't want to try too hard


### Question 9

What is the softmax of the array ```[1.0, 2.0, 3.0, 4.0]```?

- [ ] ```[0.002, 0.016, 0.117, 0.865]```
- [ ] ```[0.004, 0.031, 0.234, 1.730]```
- [ ] ```[0.1, 0.2, 0.3, 0.4]```
- [x] ```[0.032, 0.087, 0.237, 0.644]```

**Explanation**: If you are having trouble lookup "softmax function" in Wikipedia and scroll down to the python code. Paste the formula into a workbook to calculate the softmax of the array.

# Policy Gradients Methods Tutorial

**Video Description:**

Dive deeper into deep reinforcement learning and learn how to improve upon Q learning with policy gradient methods!

**Notes**


**Learning Resources**

- [Youtube Video](https://www.youtube.com/watch?v=0c3r5EWeBvo)
- [Code Link: CartPole PGN](https://github.com/colinskow/move37/tree/master/pg)
- [Math is Fun: Introduction to Logarithms](https://www.mathsisfun.com/algebra/logarithms.html)

# Policy Gradient Methods (REINFORCE)

# Evolved Policy Gradients

# Policy Gradients Study Guide

[Policy Gradients Study Guide](docs/Policy-Gradients-Study-Guide.pdf)


**Policy Gradient Methods**

- Use gradient ascent to adjust toward a policy with greater reward
- Are model-free
- Are 'on policy'
- Are a form of policy search


**Main types of policy gradient methods**:

- [source](http://www.scholarpedia.org/article/Policy_gradient_methods)
- Finite Difference Methods
- Likelihood ratio methods (REINFORCE)
- Natural policy gradients


**How do Policy Gradients compare to other methods?**

- They are preferred to DQN [source](http://karpathy.github.io/2016/05/31/rl/)
- They are commonly used as an actor for actor-critic methods [source](https://www.quora.com/What-is-the-difference-between-policy-gradient-methods-and-actor-critic-methods)
- Policy Gradients are compared to value based methods: [source](https://www.youtube.com/watch?v=KHZVXao4qXs)

_Advantages_

- Better convergence properties (guarantees local convergence)
- Effectvive in high dimensional/continuous action spaces
- Can learn stochastic policies

_Disadvantages_

- Tends to converge on local rather than global optimum
- Evaluate the policy can be very inefficient


**Extra Facts**

- _AlphaGo_ uses policy gradients in combination with Monte Carlo Tree Search [source](http://karpathy.github.io/2016/05/31/rl/)
- REINFORCE was the the first policy gradient method introduced in 1992 [source](https://www.quora.com/What-is-the-difference-between-policy-gradient-methods-and-actor-critic-methods)
- REINFORCE is sometimes called Monte-Carlo Policy Gradient [source](https://www.youtube.com/watch?v=KHZVXao4qXs)
- Policy Gradients can be used as an actor in Actor-Critic [source](https://www.quora.com/What-is-the-difference-between-policy-gradient-methods-and-actor-critic-methods)
- The original DQN authors prefer Policy Gradients [source](http://karpathy.github.io/2016/05/31/rl/)

For an extensive list of methods based on Policy Gradients, see this [blog post](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html).


# Policy Gradients Quiz

# Homework Assignment (Monte Carlo Policy Gradients)

See Thomas Simonini’s example [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb): replicate it but for a different environment — Lunar Lander!

# Artificial Curiosity