# Week 08 Notes - Policy Gradient Methods <a class="tocSkip">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Policy-Gradients-Math-Primer" data-toc-modified-id="Policy-Gradients-Math-Primer-1">Policy Gradients Math Primer</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#A-Short-Intro-to-Entropy,-Cross-Entropy-and-KL-Divergence" data-toc-modified-id="A-Short-Intro-to-Entropy,-Cross-Entropy-and-KL-Divergence-1.0.1">A Short Intro to Entropy, Cross-Entropy and KL-Divergence</a></span></li><li><span><a href="#Softmax-Output-Function" data-toc-modified-id="Softmax-Output-Function-1.0.2">Softmax Output Function</a></span></li></ul></li></ul></li><li><span><a href="#Policy-Gradients-Math-Quiz" data-toc-modified-id="Policy-Gradients-Math-Quiz-2">Policy Gradients Math Quiz</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Question-1" data-toc-modified-id="Question-1-2.0.1">Question 1</a></span></li><li><span><a href="#Question-2" data-toc-modified-id="Question-2-2.0.2">Question 2</a></span></li><li><span><a href="#Question-3" data-toc-modified-id="Question-3-2.0.3">Question 3</a></span></li><li><span><a href="#Question-4" data-toc-modified-id="Question-4-2.0.4">Question 4</a></span></li><li><span><a href="#Question-5" data-toc-modified-id="Question-5-2.0.5">Question 5</a></span></li><li><span><a href="#Question-6" data-toc-modified-id="Question-6-2.0.6">Question 6</a></span></li><li><span><a href="#Question-7" data-toc-modified-id="Question-7-2.0.7">Question 7</a></span></li><li><span><a href="#Question-8" data-toc-modified-id="Question-8-2.0.8">Question 8</a></span></li><li><span><a href="#Question-9" data-toc-modified-id="Question-9-2.0.9">Question 9</a></span></li></ul></li></ul></li><li><span><a href="#Policy-Gradients-Methods-Tutorial" data-toc-modified-id="Policy-Gradients-Methods-Tutorial-3">Policy Gradients Methods Tutorial</a></span></li><li><span><a href="#Policy-Gradient-Methods-(REINFORCE)" data-toc-modified-id="Policy-Gradient-Methods-(REINFORCE)-4">Policy Gradient Methods (REINFORCE)</a></span></li><li><span><a href="#Reading-Assignment-(Evolved-Policy-Gradients)" data-toc-modified-id="Reading-Assignment-(Evolved-Policy-Gradients)-5">Reading Assignment (Evolved Policy Gradients)</a></span></li><li><span><a href="#Homework-Assignment-(Monte-Carlo-Policy-Gradients)" data-toc-modified-id="Homework-Assignment-(Monte-Carlo-Policy-Gradients)-6">Homework Assignment (Monte Carlo Policy Gradients)</a></span></li><li><span><a href="#Artificial-Curiosity" data-toc-modified-id="Artificial-Curiosity-7">Artificial Curiosity</a></span></li></ul></div>

# Policy Gradients Math Primer

[Logarithms Refresher](https://www.mathsisfun.com/algebra/logarithms.html)

### A Short Intro to Entropy, Cross-Entropy and KL-Divergence

- [Youtube Video](https://www.youtube.com/watch?v=ErfnhcEV1O8)
- [Paper: A Mathematical Theory of Communication](https://pure.mpg.de/rest/items/item_2383164/component/file_2383163/content)

**Video Description**:

Entropy, Cross-Entropy and KL-Divergence are often used in Machine Learning, in particular for training classifiers. In this short video, you will understand where they come from and why we use them in ML.


**Notes**:

- Cross entropy commonly used as a cost function when training classifiers
- Entropy measures the average amounts of information that you get from one sample drawn from a given probability distribution $p$. It tells you how unpredictable that probability distribution is.

$$ \large H(p)\ =\ -\sum_i p_i log_2(p_i) \\ $$

- Cross entropy is the average message length

$$ \large H(p, q)\ =\ -\sum_i p_i log_2(q_i) \\ $$

where:
- $p\ =\ $ true distribution
- $q\ =\ $ predicted distribution

- If our predictions are perfect, meaning the predicted distribution is equal to the true distribution, then cross-entropy is equal to entropy
- If the distributions differ, then the cross-entropy will be greater than the entropy by some number of bits. This amount is called the relative entropy or more commonly, the Kullback-Leibler Divergence (KL Divergence)

$$ Cross\ Entropy\ =\ Entropy\ +\ KL\ Divergence $$

- KL Divergence is equal to the cross entropy minus the entropy

$$ D_{KL}(p\ ||\ q) = H(p,\ q)\ - H(p) $$

- Train an image classifier to detect some animals: Cat, Dog, Fox, Cow, Red Panda, Bear or Dolphin (7 possibilities)
- Classifier outputs an estimated probability, the predicted possibility distribution
- Since this is a Supervised Learning problem, we know the true distribution
- Use the cross-entropy between these two distributions as a cost function, called a cross-entropy loss or log loss

- Cross Entropy Loss: uses the natural logarithm rather than the binary logarithm

$$ H(p,\ q)\ =\ -\sum_i p_i log(q_i) $$


### Softmax Output Function

- [Youtube Video: Neural Networks for Machine Learning](https://www.youtube.com/watch?v=mlaLLQofmR8)

**Video Description**:

Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. 


**Notes**:

- Softmax output function forces the outputs of a neural network to sum to one so that they can represent a probability distribution across discrete mutually exclusive alternatives

_Problems with squared error_

- The squared error measure has some drawbacks:
    - If the desired output is 1 and the actual output is 0.000000001 there is almost no gradient for a logistic unit to fix up the error
    - If we try to assign probabilities to mutually exclusive class labels, we know that the outputs should sum to 1, but we are depriving the network of this knowledge
- Is there a different cost function that works better?
    - Yes: Force the outputs to represent a probability distribution across discrete alternatives via use of Softmax Function

_Softmax_

- The output units in a softmax group use a non-local non-linearity
- They each receive some total input they've accumulated from the layer below (logit, $Z_i$) and provide an output $y_i$ that depends on the $Z$'s accumulated by their rivals as well as their own.
- Output of the ith neuron:

$$ \large y_i\ =\ \frac{e^{z_i}}{\sum_{j\in group} e^{z_i}} $$

- The bottom line of the equation is the sum of the top line over all possibilities, we know when you add over all possibilities you'll end up with 1 - the sum of all $y_i$'s must come to one
- The $y_i$'s must lie between 0 and 1
- So we force the $y_i$'s to represent a probability distribution over mutually exclusive alternatives just by using that soft max equation

- Softmax Derivative:

$$ \large \frac{\partial y_i}{\partial z_i}\ =\ y_i(1-y_i) $$


_Cross Entropy: the right cost function to use with softmax_

- Right cost function is the negative log probability of the right answer.

$$ \large C\ =\ -\sum_{j} t_j log(y_i) $$

where $t_j$ is the target value

- C has a very big gradient when the target value is 1 and the output is almost 0
    - A value of 0.000001 is much better than 0.000000001
    - The steepness of dC/dy exactly balances the flatness of dy/dz
    
$$ \large \frac{\partial C}{\partial z_i}\ =\ \sum_{j} \frac{\partial C}{\partial y_j}\ \frac{\partial y_j}{\partial z_j}\ =\ y_i\ -\ t_i $$

# Policy Gradients Math Quiz

### Question 1

Check all of the following which are true about logarithms.

    - []


**Explanation**:


### Question 2

- [] 

**Explanation**:


### Question 3

- [] 

**Explanation**:


### Question 4

- [] 

**Explanation**:


### Question 5

- [] 

**Explanation**:


### Question 6

- [] 

**Explanation**:


### Question 7

- [] 

**Explanation**:


### Question 8

- [] 

**Explanation**:


### Question 9

- [] 

**Explanation**:




# Policy Gradients Methods Tutorial

# Policy Gradient Methods (REINFORCE)

# Reading Assignment (Evolved Policy Gradients)

# Homework Assignment (Monte Carlo Policy Gradients)

# Artificial Curiosity