# Table of contents
1. [What is Machine Learning](#Machine Learning)
2. [Difference between Classification and Regression](#Classification and Regression)
3. [Neural Network](#Neural Network)
4. [The simplest neural network](#The simplest neural network)
5. [Gradient Descent](#Gradient Descent)
    1. [Gradient Descent: Math](#Gradient Descent: Math)
    2. [Gradient Descent: The Code](#Gradient Descent: The Code)
    3. [Gradient Descent: Implementing gradient descent](#Gradient Descent: Implementing gradient descent)
6. [Multilayer Perceptrons](#Multilayer Perceptrons)
7. [Backpropagation](#Backpropagation)

---
# 1. What is Machine Learning <a name="Machine Learning"></a>

___Machine learning___ is a broaden notion of building ___computational artifacts___ that learn over time based on experience. 

It is also called such a ___computational statistics___.

___Supervised learning___ is that it takes examples of inputs and outputs. Now, given a new input, predict its output.

___
# 2. Difference between Classification and Regression <a name="Classification and Regression"></a>

Two typical Supervised Learning is ___Classification___ and ___Regression___. Only concern about the difference between them is output, rather than input.

$$ \text{Classification} \rightarrow \text{Discrete output} $$
$$ \text{Regression} \rightarrow \text{Continuous output} $$

## Classification

### Definition: 

Classification takes some kind of inputs and mapping it to some ___discrete label___.
		
$$ X → \text{True or Flase} $$
e.g.)
* Binary test : True or False (take a picture of people, and find male or female)
* Trinary test : Three options
Whatever input would be, if output is discrete, it will be classification.

### Terms of Classification Learning:

* ___Instances___ (input) : those are vectors of values or of attributes that define whatever our input space is. e.g.) pictures and all the pixels that make up pictures, credit score for examples how much money I make,
* ___Concept___ (function) : this function maps instances to some kind of outputs. e.g.) binary classification (true or false)
* ___Target Concept___ (answer) : the thing we try to find.
* Hypothesis (hypothesis class or all possible functions) : all the functions we're willing to think about.
* ___Sample___ (training set) : this is a set of all of our inputs paired with a label which is the correct output.
* ___Candidate___ : the concept that we think might be the target concept.
* ___Testing Set___ : this correct the function found on the training set.

## Regression

### Definition:

This is more about continuous valued function which mapping continuous inputs to outputs.

___
# 3. Neural Network <a name="Neural Network"></a>

<img src="Figures/hq-perceptron.png" width=800>

### Perceptron

Data are fed into a network of interconnected nodes. These individual nodes are called [perceptrons](https://en.wikipedia.org/wiki/Perceptron), or artificial neurons, and they are the basic unit of a neural network.

<img src="Figures/hq-new-plot-perceptron-combine-v2.png" width=800>

When we initialize a neural network, we don't know what information will be most important in making a decision. It's up to the neural network to _learn for itself_ which data is most important and adjust how it considers that data.

It does this with something called __weights__.

### Weights

When input comes into a perceptron, it gets multiplied by a weight value that is assigned to this particular input. 

These weights start out as random values, and as the neural network network learns more about what kind of input data leads to the input data, the network adjusts the weights based on any errors in categorization that results from the previous weights. This is called __training__ the neural network.

A higher weight means the neural network considers that input more important than other inputs, and lower weight means that the data is considered less important.

### Summing the Input Data

Each input to a perceptron has an associated weight that represents its importance. These weights are determined during the learning process of a neural network, called _training_. In the next step, the weighted input data are summed to produce a single value, that will help determine the final output.

<img src="Figures/perceptron-graphics.001.jpg" width=800>

let's say that the weights are: $w_{\text{grades}}=-1, w_{\text{test}}=−0.2$. You don't have to be concerned with the actual values, but their relative values are important. $w_{\text{grades}}$ is 5 times larger than $w_{\text{test}}$, which means the neural network considers grades input 5 times more important than test in determining whether a student will be accepted into a university.
 
The perceptron applies these weights to the inputs and sums them in a process known as __linear combination__. 
$$
w_{\text{grades}} \cdot x_{\text{grade}} + w_{\text{text}} \cdot x_{\text{test}} = -1 \cdot x_{\text{grades}} - 0.2 \cdot x_{\text{test}}
$$

Simply, this case would be expressed as the linear combination succinctly as: 
$$
\sum_{i=1}^{m}w_i \cdot x_i
$$

### Calculating the Output with an Activation Function

Finally, the result of the perceptron's summation is turned into an output signal! This is done by feeding the linear combination into an __activation function__.

Activation functions are functions that decide, given the inputs into the node, what should be the node's output? Because it's the activation function that decides the actual output, we often refer to the outputs of a layer as its "activations".

One of the simplest activation functions is the __Heaviside step function__. This function returns a __0__ if the linear combination is less than 0. It returns a 1 if the linear combination is positive or equal to zero. The [Heaviside step](https://en.wikipedia.org/wiki/Heaviside_step_function) function is shown below, where $h$ is the calculated linear combination:

<img src="Figures/heaviside-step-graph-2.png" width=500>
<img src="Figures/heaviside-step-function-2.gif" width=150>

It's easiest to see this with an example in two dimensions. In the following graph, imagine any points along the line or in the shaded area represent all the possible inputs to our node. Also imagine that the value along the y-axis is the result of performing the linear combination on these inputs and the appropriate weights. It's this result that gets passed to the activation function.

<img src="Figures/example-before-bias.png" width=500>

Now, we certainly want more than one possible grade/test combination to result in acceptance, so we need to adjust the results passed to our activation function so it activates – that is, returns 1 – for more inputs. Specifically, we need to find a way so all the scores we’d like to consider acceptable for admissions produce values greater than or equal to zero when linearly combined with the weights into our node.

One way to get our function to return 1 for more inputs is to add a value to the results of our linear combination, called a __bias__.

For example, the following diagram shows the previous hypothetical function with an added bias of +3. The blue shaded area shows all the values that now activate the function. But notice that these are produced with the same inputs as the values shown shaded in grey – just adjusted higher by adding the bias term:

<img src="Figures/example-after-bias.png" width=500>

Of course, with neural networks we won't know in advance what values to pick for biases. That’s ok, because just like the weights, the bias can also be updated and changed by the neural network during training. So after adding a bias, we now have a complete perceptron formula:

<img src="Figures/perceptron-equation-2.gif" width=400>

This formula returns 1 if the input $(x_1, x_2, \cdots, x_m)$ belongs to the accepted-to-university category or returns 0 if it doesn't. The input is made up of one or more real numbers, each one represented by $x_i$, where $m$ is the number of inputs.

Then the neural network starts to learn! Initially, the weights $(w_i)$ and bias $(b)$ are assigned a random value, and then they are updated using a learning algorithm like gradient descent. The weights and biases change so that the next training example is more accurately categorized, and patterns in data are "learned" by the neural network.

---
### Quiz: What are the weights and bias for the AND perceptron?

Set the weights (_weight1_, _weight2_) and bias _bias_ to the correct values that calculate AND operation as shown above.
In this case, there are two inputs as seen in the table above (let's call the first column _input1_ and the second column _input2_), and based on the perceptron formula, we can calculate the output.

First, the linear combination will be the sum of the weighted inputs: _linear_combination = weight1*input1 + weight2*input2_ then we can put this value into the biased Heaviside step function, which will give us our output (0 or 1):

<img src="Figures/perceptron-formula.gif" width=400>

In [1]:
import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = 1.0
weight2 = 1.0
bias = -2


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                  -2.0                    0          Yes
      0          1                  -1.0                    0          Yes
      1          0                  -1.0                    0          Yes
      1          1                   0.0                    1          Yes


---
### Quiz: What are two ways to go from an AND perceptron to an OR perceptron?

The OR perceptron is very similar to an AND perceptron. In the image below, the OR perceptron has the same line as the AND perceptron, except the line is shifted down. What can you do to the weights and/or bias to achieve this? Use the following AND perceptron to create an OR Perceptron.

<img src="Figures/hq-new-and-or-percep.png" width=700>

__Answer:__
1. Increased the weights
2. Decrease the magnitude of the bias

---
### Quiz: NOT Perceptron

Unlike the other perceptrons we looked at, the NOT operations only cares about one input. The operation returns a __0__ if the input is __1__ and a __1__ if it's a __0__. The other inputs to the perceptron are ignored.

In this quiz, you'll set the weights (_weight1_, _weight2_) and bias bias to the values that calculate the NOT operation on the second input and ignores the first input.

In [2]:
import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = -1.0
weight2 = -2.0
bias = 1


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                   1.0                    1          Yes
      0          1                  -1.0                    0          Yes
      1          0                   0.0                    1          Yes
      1          1                  -2.0                    0          Yes


---
### Quiz: XOR Perceptron

<img src="Figures/hq-new-xor-table.png" width=500>

An XOR perceptron is a logic gate that outputs __0__ if the inputs are the same and __1__ if the inputs are different. Unlike previous perceptrons, this graph isn't linearly separable. To handle more complex problems like this, we can chain perceptrons together.

Let's build a neural network from the AND, NOT, and OR perceptrons to create XOR logic. Let's first go over what a neural network looks like.

<img src="Figures/legend.png" width=500>

The above neural network contains 4 perceptrons, A, B, C, and D. The input to the neural network is from the first node. The output comes out of the last node. The weights are based on the line thickness between the perceptrons. Any link between perceptrons with a low weight, like A to C, you can ignore. For perceptron C, you can ignore all input to and from it. For simplicity we wont be showing bias, but it's still in the neural network.

<img src="Figures/a-b-c-fill-nn.png" width=500>

The neural network above calculates XOR. Each perceptron is a logic operation of OR, AND, [Passthrough](https://en.wikipedia.org/wiki/Passthrough), or NOT. The Passthrough operation just passes it's input to the output. However, the perceptrons A , B, and C don't indicate their operation. In the following quiz, set the correct operations for the three perceptrons to calculate XOR.

_Note: Any line with a low weight can be ignored._

__Answer:__
A $=$ __NOT__, B $=$ __AND__, C $=$ __OR__

___
You've seen that a perceptron can solve linearly separable problems. Solving more complex problems, you use more perceptrons. You saw this by calculating AND, OR, NOT, and XOR operations using perceptrons. These operations can be used to create any computer program. With enough data and time, a neural network can solve any problem that a computer can calculate. However, __the power of a neural network isn't building it by hand, like we were doing.__

___

# 4. The simplest neural network <a name="The simplest neural network"></a>

So far you've been working with perceptrons where the output is always one or zero. The input to the output unit is passed through an activation function, $f(h)$, in this case, the step function.

<img src="Figures/heaviside-step-graph-2.png" width=500>
<img src="Figures/heaviside-step-function-2.gif" width=200>

The output unit returns the result of $f(h)$, where $h$ is the input to the output unit:

$$
h = \sum_i{w_ix_i + b}
$$

The diagram below shows a simple network. The linear combination of the weights, inputs, and bias form the input $h$, which passes through the activation function $f(h)$, giving the final output of the perceptron, labeled $y$.

<img src="Figures/simple-neuron.png" width=500>
$$
\text{Diagram of a simple neural network. Circles are units, boxes are operations.}
$$

The cool part about this architecture, and what makes neural networks possible, is that the activation function, $f(h)$ can be any function, not just the step function shown earlier.

Other activation functions you'll see are the _logistic_ (often called the _sigmoid_), $\tanh$, and softmax functions. We'll mostly be using the sigmoid function for the rest of this lesson:

$$
\text{sigmoid}(x)=1/(1+e^{−x})
$$
<img src="Figures/sigmoid.png" width=500>

The sigmoid function is bounded between 0 and 1, and as an output can be interpreted as a probability for success. It turns out, again, using a sigmoid as the activation function results in the same formulation as logistic regression.

This is where it stops being a perceptron and begins being called a neural network. In the case of simple networks like this, neural networks don't offer any advantage over general linear models such as logistic regression.

But, as you saw with the XOR perceptron, stacking units will let you model linearly inseparable data, impossible to do with regression models.

Once you start using activation functions that are continuous and differentiable, it's possible to train the network using gradient descent.

### Exercise

Below you'll use NumPy to calculate the output of a simple network with two input nodes and one output node with a sigmoid activation function. Things you'll need to do:
- Implement the sigmoid function.
- Calculate the output of the network.
The sigmoid function is
$$
\text{sigmoid}(x)=1/(1+e^{−x})
$$

For the exponential, you can use Numpy's exponential function, __np.exp__.

And the output of the network is
$$
y=f(h)=\text{sigmoid}(\sum_i{w_i x_i + b})
$$

For the weights sum, you can do a simple element-wise multiplication and sum, or use NumPy's [dot product function](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html).

In [3]:
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1/(1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(weights, inputs) + bias)

print('Output:')
print(output)

Output:
0.432907095035


___
# 5. Gradient Descent <a name="Gradient Descent"></a>
<!--
Learning Weights <a name="Learning Weights"></a>
-->

You've seen how you can use perceptrons for AND and XOR operations, but there we set the weights by hand. What if you want to perform an operation, such as predicting college admission, but don't know the correct weights? You'll need to learn the weights from example data, then use those weights to make the predictions.

To figure out how we're going to find these weights, start by thinking about the goal. We want the network to make predictions as close as possible to the real values. To measure this, we need a metric of how wrong the predictions are, the __error__. A common metric is the sum of the squared errors (SSE):
$$
E = \frac{1}{2} \sum_\mu \sum_j \big[y_j^\mu - \hat{y}_j^\mu \big]^2
$$

where $\hat{y}$ is the prediction and $y$ is the true value, and you take the sum over all output units $j$ and another sum over all data points $μ$. This might seem like a really complicated equation at first, but it's fairly simple once you understand the symbols and can say what's going on in words.

First, the inside sum over $j$. This variable $j$ represents the output units of the network. So this inside sum is saying for each output unit, find the difference between the true value $y$ and the predicted value from the network $\hat{y}$, then square the difference, then sum up all those squares.

Then the other sum over $\mu$ is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

The SSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus.

Remember that the output of a neural network, the prediction, depends on the weights

$$
\hat{y}_j^\mu = f\big(\sum_i w_{ij} x_i^\mu \big)
$$

and accordingly the error depends on the weights

$$
E = \frac{1}{2} \sum_\mu \sum_j \big[y_j^\mu - f\big(\sum_i w_{ij} x_i^\mu \big) \big]^2
$$

We want the network's prediction error to be as small as possible and the weights are the knobs we can use to make that happen. Our goal is to find weights $w_{ij}$ that minimize the squared error $E$. To do this with a neural network, typically you'd use gradient descent.

__Gradient__ is another term for rate of change or slope. With gradient descent, we take multiple small steps towards our goal. In this case, __we want to change the weights in steps that reduce the error__.

The steps taken should be in the direction that minimizes the error the most. __We can find this direction by calculating the gradient of the squared error__.

To calculate a rate of change, we turn to calculus, specifically derivatives. A derivative of a function $f(x)$ gives you another function $f′(x)$ that returns the slope of $f(x)$ at point $x$.

<img src="Figures/derivative-example.png" width=500>

The gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights.

Below is an example of the error of a neural network with two inputs, and accordingly, two weights. You can read this like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

<img src="Figures/gradient-descent.png" width=500>

At each step, you calculate the error and the gradient, then use those to determine how much to change each weight. Repeating this process will eventually find weights that are close to the minimum of the error function, the block dot in the middle.

Since the weights will just go where ever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. 

<img src="Figures/local-minima.png" width=500>

If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum. There are methods to avoid this, such as using [momentum](http://sebastianruder.com/optimizing-gradient-descent/index.html#momentum).

### 5.1. Gradient Descent: Math <a name="Gradient Descent: Math"></a>

We'd like to use the output to make predictions, but how do we build this network to make predctions without knowing the correct weights before hand? 

<img src="Figures/Gradient-Descent-Math1.png" width=500>

What we can do is present it with data that we know to be true, then set the model parameters, the weights to match that data.

First, we need some measure of how bad our predctions are. The bovious choice is to use the difference between the true target value, $y$, and the network output, $\hat{y}$. 

$$
E = y - \hat{y}
$$

However, if the predction is too high, this error will be negative and if their prection is too low, by the same abount the error will be positive. We'd rather treat these errors the same. 

To make both cases positive, we'll just square the error. 

$$
E = (y - \hat{y})^2
$$

One benefit of using the square rather than using absolute value is that it penalizes outliers more than small errors. Also, squaring the error makes the math nice later. 

We'd rather like to know the error for the entire dataset. So, we'll just sum up the errors for each data record denoted by the sum over $\mu$.

$$
E = \sum_\mu (y - \hat{y})^2
$$

Now we have the total error for the network over the entire dataset. 

Finally, we'll add a one half in front because it cleans up the math later. 

$$
E = \frac{1}{2} \sum_\mu (y - \hat{y})^2
$$

This formulation is typically called the sum of the squared error (SSE).

Remember that $\hat{y}$ is the linear combination of the weights and inputs passed through that activation function. We can substitute it in here, then we see that the error depends on the wiehgts, $w_i$, and the input values, $x_i$. 

$$
\begin{align}
E &= \frac{1}{2} \sum_\mu (y - \hat{y})^2 \\
  &= \frac{1}{2} \sum_\mu (y - f(\sum_i w_i x_i^\mu) )^2
\end{align}
$$

The data records are represented by the Greek letter $\mu$. We can think of the data as two tables of an array, or a matrices, whatever works for you. One contains the input data, $x$, and the other contains the targets, $y$. 

Each record is one row here, so $\mu$ equals $1$ is the first row.

<img src="Figures/Gradient-Descent-Math2.png" width=500>

Then to calcualte the total error, you're just scanning through the rows of these arrays and caculating the SSE. Then summing up all of these results.

The SSE is a measure of our network's performance. If it's high, the network is making bad predctions. If it's low, the network is making good predctions. So, we want tp make it as small as possible. 

Going forward, let's consider a simple example with only one data record to make it easier to understand how we'll minimize the error. For the simple network, the SSE is the true target, $y$, minus the predction, $\hat{y}$, and squared and divided by 2. 

$$
E = \frac{1}{2} \sum_\mu (y - \hat{y})^2
$$

Substituting for the prediction, you see the error is a function of the wiehgts. 

$$
E = \frac{1}{2} \sum_\mu (y - f(\sum_i w_i x_i^\mu) )^2
$$

The weights are the knobs we can use to alter the network's predictions, which in turn affects the overall error. Then our goal is to find weights that minimize the error. 

Here is a simple depiction of the error with one wieght. Our goals is to find the weight at the bottom of this bowl. __Starting at some random weight, we want to make a step in the direction towards the minimum. This direction is the opposite to the gradient, the slope.__ If we take many steps, always descending down a gradient. Eventually the weight will find the minimum of the error function, and this is gradient descent. 

<img src="Figures/Gradient-Descent-Math3.png" width=400>

We want to update the weight, so a new weight, $w_i$, is the old weight, $w_i$ plus the wieght step, $\Delta w_i$.

$$
w_i = w_i + \Delta w_i
$$

The weight step is proprortional to the gradient, the partial derivative of the error with respect to each weight, $w_i$. 

$$
\Delta w_i \propto -\frac{\partial E}{\partial w_i} \rightarrow \text{THE GRADIENT}
$$

We can add in an arbitrary scaling parameter that allows us to set the size of the gradient descent steps. This is called the learning rate, $\eta$.

$$
\Delta w_i = -\eta \frac{\partial E}{\partial w_i}
$$

Calculating the gradient here requires multivariable calculus, as it is denoted by partial derivative. 

Writing out the gradient, we get the partial derivative with respect to the weights of the squared error. 

$$
\frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2}(y - \hat{y})^2
$$

The netwrok output, $\hat{y}$, is a function of the weights. 

$$
= \frac{\partial}{\partial w_i} \frac{1}{2}(y - \hat{y}(w_i))^2
$$

So, what we have here is a function of another function that depends on the weights. This requires using the chain rule to calculate the derivative. 

$$
\frac{\partial}{\partial z} p(q(z)) = \frac{\partial p}{\partial q} \frac{\partial q}{\partial z}
$$

This leads to a problem because we can set $q$ to the error, $y - \hat{y}$, and set $p$ the squared error. 

$$
q = (y - \hat{y}(w_i)) \qquad \qquad p = \frac{1}{2} q(w_i)^2
$$

And then we're taking the derivative with respect to $w_i$.

The derivative of $p$ with respect to $q$ returns the error itself, $y - \hat{y}$; the 2 in the exponential drops down and cancels out the $\frac{1}{2}$. Then we're left with the derivative of the error with respect to $w_i$. 

$$
\begin{align}
\frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} (y - \hat{y})^2 \\
& = (y - \hat{y}) \frac{\partial}{\partial w_i} (y - \hat{y})
\end{align}
$$

The target value, $y$, doesn't depend on the weights, but $\hat{y}$ does. Using the Chain Rule again, the minus sign comes out in front and we're left with the partial derivative of $\hat{y}$.

$$
\begin{align}
= - (y & - \hat{y}) \frac{\partial \hat{y}} {\partial w_i} \\
& , \text{where} \quad \hat{y}=f(h) \quad \text{and} \quad h=\sum_i w_i x_i
\end{align}
$$

Taking the derivative of $\hat{y}$, and again using the cahin rule, we get the derivative of the activation function at $h$ times the partial derivative of the linear combination. 

$$
\begin{align}
\frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \\
& = -(y - \hat{y})f'(h) \frac{\partial}{\partial w_i} \sum w_i x_i
\end{align}
$$

In the sum, there is only one term that depends on each weight. 

$$
\frac{\partial}{\partial w_i} \sum w_i x_i
$$

Writing this out for weight one, you see that only the first term with $x_1$ depends on weight one. 

$$
\frac{\partial}{\partial w_1}[w_1 x_1 + w_2 x_2 + \cdots + w_n x_n]
$$

Then the partial derivative of the sum with respect to weight one is just $x_1$.

$$
= x_1 + 0 + 0 + 0 + \cdots
$$

Then the partial derivative of this sum with respect to $w_i$ is just $x_i$. 

$$
\frac{\partial}{\partial w_i} \sum w_i x_i = x_i
$$

Finally, putting all this together, the gradient of the squared error with repect to $w_i$ is the negative of the error times the derivative of the activation function at $h$ times the input value $x_i$. Then the weight step is a learning rate eta times those. 

$$
\begin{align}
\frac{\partial E}{\partial w_i} & = -(y - \hat{y})f'(h) x_i \\
\Delta w_i &= \eta (y - \hat{y})f'(h) x_i
\end{align}
$$

For convenience, we can define an __error term__, $\delta$, as the error times the activation function derivative at h.

$$
\delta = (y - \hat{y})f'(h)
$$

Then we can write our weight update,

$$
w_i = w_i + \eta \delta x_i.
$$

You might be working with multiple output units. You can think of this as just stocking the architecture from the single output netwrok, but connecting the input units to the new output units. Now, the total error would include the error of each outputs sum together.

<img src="Figures/Gradient-Descent-Math4.png" width=500>

The gradient descendant can be extended to a network with multiple outputs by calculating an error term for each output unit to know with the subscript $j$. 

### Gradient Descent: The Code <a name="Gradient Descent: The Code"></a>

From before we saw that one weight update can be calculated as:

$$
\Delta w_i = \eta \delta x_i 
$$
with the error term $\delta$ as

$$
\delta = (y - \hat{y})f'(h) = (y - \hat{y})f'(\Sigma w_i x_i)
$$

Now I'll write this out in code for the case of only one output unit. We'll also be using the sigmoid as the activation function $f(h)$.

```python
# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

# The neural network output (y-hat)
nn_output = sigmoid(x[0]*weights[0] + x[1]*weights[1])
# or nn_output = sigmoid(np.dot(x, weights))

# output error (y - y-hat)
error = y - nn_output

# error term (lowercase delta)
error_term = error * sigmoid_prime(np.dot(x,weights))

# Gradient descent step 
del_w = [ learnrate * error_term * x[0],
                 learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x
```

In [4]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

learnrate = 0.5
x = np.array([1, 2])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5])

# Calculate one gradient descent step for each weight
# TODO: Calculate output of neural network
nn_output = sigmoid(np.dot(w, x))

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate change in weights
del_w = learnrate * error * nn_output * (1 - nn_output) * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.377540668798
Amount of Error:
0.122459331202
Change in Weights:
[ 0.0143892  0.0287784]


### Gradient Descent: Implementing gradient descent <a name="Gradient Descent: Implementing gradient descent"></a>

Now we know how to update our weights,

$$
\Delta w_{ij} = \eta \delta_j x_i.
$$

As an example, Use gradient descent to train a network on [graduate school admissions data](http://www.ats.ucla.edu/stat/data/binary.csv). This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

<img src="Figures/admissions-data.png" width=500>

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.d

#### recipe 1) Data cleanup

You might think there will be three input units, but we actually need to transform the data first. The __rank__ feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics) to encode __rank__, splitting the data into four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank 1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the rank 2 dummy column, and zeros in all other columns. And so on.

We'll also need to standardize the GRE and GPA data, which means to scale the values such they have zero mean and a standard deviation of 1. This is necessary because _the sigmoid function squashes really small and really large inputs_. _The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too_. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

<img src="Figures/example-data.png" width=400>

Now that the data is ready, we see that there are six input features: __gre__, __gpa__, and the four __rank__ dummy variables.

#### recipe 2) Mean Square Error

Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, $m$ to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the __mean of the square errors__ (MSE) to calculate the gradient and the result is the same as before, just averaged instead of summed.

$$
E = \frac{1}{2m}\sum_\mu (y^\mu - \hat{y}^\mu)^2
$$

Here's the general algorithm for updating the weights with gradient descent:

- Set the wieght step to zero: $\Delta w_i = 0$
- For each record in the training data:
    - Make a forward pass through the netwrok, calculating the output $\hat{y} = f(\sum_i w_i x_i)$
    - Caculate the error gradient in the output unit, $\delta = (y-\hat{y})\cdot f'(\sum_i w_i x_i)$
    - Update the weight step $\Delta w_i = \Delta w_i + \delta x_i$
- Update the weights $w_i = w_i + \eta \frac{\Delta w_i}{m}$, where $\eta$ is the learning rate and $m$ is the number of records. Here we're averaging the weight steps to help reduce any large variations in the training data.
- Repeat for $e$ epochs.

You can also update the weights on each record instead of averaging the weight steps after going through all the records.

Remember that we're using the sigmoid for the activation function, $f(h)=\frac{1}{1+e^{-h}}$, and the gradient of the sigmoid is $f'(h)=f(h)(1-f(h))$, where $h$ is the input to the output unit, $h=\sum_i w_i x_i$

#### recipe 3) Initialize wheights

irst, you'll need to initialize the weights. We want these to be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is $\frac{1}{\sqrt{n}}$ where $n$ is the number of input units. This keeps the input to the sigmoid low for increasing numbers of input units.

```Python
weights = np.random.normal(scale=1/n_features**.5, size=n_features)
```

#### recipe 4) NumPy product

NumPy provides a function that calculates the dot product of two arrays, which conveniently calculates h for us. The dot product multiplies two arrays element-wise, the first element in array 1 is multiplied by the first element in array 2, and so on. Then, each product is summed.

```Python
# input to the output layer
output_in = np.dot(weights, inputs)
```

#### recipe 5) Weight update

We can update $\Delta w$ and $w_i$ by incrementing them with __weights += ...__ which is shorthand for __weights = weights + ...__.

#### recipe 6) Efficiency tip!

You can save some calculations since we're using a sigmoid here. For the sigmoid function, $f'(h)=f(h)(1-f(h))$. That means that once you calculate $f(h)$, the activation of the output unit, you can use it to calculate the gradient for the error gradient.

#### Programming exercise

Below, you'll implement gradient descent and train the network on the admissions data. Your goal here is to train the network until you reach a minimum in the mean square error (MSE) on the training set. You need to implement:

- The network output: __output__.
- The error gradient: __error__.
- Update the weight step: __del_w +=__.
- Update the weights: __weights +=__.

<!---
```Python
# data cleanup

import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

print(admissions.head())
print(data.head())
print(features.head())
print(targets.head())
```
-->

In [5]:
# implementing gradient

import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
print("n_records: {0}, n_features: {1}".format(n_records, n_features))

last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
print("Shape of weights:", weights.shape)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # TODO: Calculate the output
        output = sigmoid(np.dot(weights, x))

        # TODO: Calculate the error
        error = y - output

        # TODO: Calculate change in weights
        del_w += error * sigmoid(output) * (1 - sigmoid(output)) * x

        # TODO: Update weights
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))


n_records: 360, n_features: 6
Shape of weights: (6,)
Train loss:  0.26391603901225774
Train loss:  0.20947790846116127
Train loss:  0.20022906580324684
Train loss:  0.19791133617137951
Train loss:  0.19711212743496892
Train loss:  0.19677547089788686
Train loss:  0.19661603665721109
Train loss:  0.1965347560597135
Train loss:  0.196491103916669
Train loss:  0.19646667878024576
Prediction accuracy: 0.725


___
# 6. Multilayer Perceptrons <a name="Multilayer Perceptrons"></a>

### Derivation

Before, we were dealing with only one output node which made the code straightforward. However now that we have multiple input units and multiple hidden units, the weights between them will require two indices: $w_{ij}$, where $i$ denotes input units and $j$ are the hidden units.

For example, the following image shows our network, with its input units labeled $x_1, x_2$, and $x_3$, and its hidden nodes labeled $h_1$ and $h_2$ :
<img src="Figures/network-with-labeled-nodes.png" width=500>

The lines indicating the weights leading to $h_1$ have been colored differently from those leading to $h_2$ just to make it easier to read.

Now to index the weights, we take the input unit number for the $i$ and the hidden unit number for the $j$. That gives us $w_{11}$ for the wieght leading from $x_1$ to $h_1$, and $w_{12}$ for the weight leading from $x_1$ to $h_2$.

The following image includes all of the weights between the input layer and the hidden layer, labeled with their appropriate $w_{ij}$ indices:
<img src="Figures/network-with-labeled-weights.png" width=500>

Before, we were able to write the weights as an array, indexed as $w_i$.

But now, the weights need to be stored in a __matrix__, indexed as $w_{ij}$. Each __row__ in the matrix will correspond to the weights __leading out of a single input unit__, and each __column__ will correspond to the weights __leading in to a single hidden unit__. For our three input units and two hidden units, the weights matrix looks like this:
<img src="Figures/multilayer-diagram-weights.png" width=500>

Be sure to compare the matrix above with the diagram shown before it so you can see where the different weights in the network end up in the matrix.

To initialize these weights in Numpy, we have to provide the shape of the matrix. If __features__ is a 2D array containing the input data:

```Python
# Number of records and input units
n_records, n_inputs = features.shape
# Number of hidden units
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))
```

This creates a 2D array (i.e. a matrix) named weights_input_to_hidden with dimensions n_inputs by n_hidden. Remember how the input to a hidden unit is the sum of all the inputs multiplied by the hidden unit's weights. So for each hidden layer unit, $h_j$, we need to calculate the following:

$$
h_j = \sum_i w_{ij}x_i
$$

To do that, we now need to use [matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication). In this case, we're multiplying the inputs (a row vector here) by the weights. To do this, you take the dot (inner) product of the inputs with each column in the weights matrix. For example, to calculate the input to the first hidden unit, $j=1$, you'd take the dot product of the inputs with the first column of the weights matrix, like so:
<img src="Figures/input-times-weights.png" width=500>
$$
h_1 = x_1 w_{11} + x_2 w_{21} + x_3 w_{31}
$$

And for the second hidden layer input, you calculate the dot product of the inputs with the second column. And so on and so forth.

In Numpy, you can do this for all the inputs and all the outputs at once using np.dot

```Python
hidden_inputs = np.dot(inputs, weights_input_to_hidden)
```

You could also define your weights matrix such that it has dimensions __n_hidden__ by __n_inputs__ then multiply like so where the inputs form a column vector:
<img src="Figures/inputs-matrix.png" width=300>

    Note: The weight indices have changed in the above image and no longer match up with the labels used in the earlier diagrams. That's because, in matrix notation, the row index always precedes the column index, so it would be misleading to label them the way we did in the neural net diagram. Just keep in mind that this is the same weight matrix as before, but rotated so the first column is now the first row, and the second column is now the second row. If we were to use the labels from the earlier diagram, the weights would fit into the matrix in the following locations:   
<img src="Figures/weight-label-reference.gif" width=300>
$$\text{Weight matrix shown with labels matching earlier diagrams.}$$

    Remember, the above is not a correct view of the indices, but it uses the labels from the earlier neural net diagrams to show you where each weight ends up in the matrix.
    
The important thing with matrix multiplication is that the dimensions match. For matrix multiplication to work, there has to be the same number of elements in the dot products. In the first example, there are three columns in the input vector, and three rows in the weights matrix. In the second example, there are three columns in the weights matrix and three rows in the input vector. If the dimensions don't match, you'll get this:
```Python
# Same weights and features as above, but swapped the order
hidden_inputs = np.dot(weights_input_to_hidden, features)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1bfa0f615c45> in <module>()
----> 1 hidden_in = np.dot(weights_input_to_hidden, X)

ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)
```

The dot product can't be computed for a 3x2 matrix and 3-element array. That's because the 2 columns in the matrix don't match the number of elements in the array. Some of the dimensions that could work would be the following:
<img src="Figures/matrix-mult-3.png" width=300>

The rule is that if you're multiplying an array from the left, the array must have the same number of elements as there are rows in the matrix. And if you're multiplying the matrix from the left, the number of columns in the matrix must equal the number of elements in the array on the right.

### Making a column vector

You see above that sometimes you'll want a column vector, even though by default Numpy arrays work like row vectors. It's possible to get the transpose of an array like so __arr.T__, but for a 1D array, the transpose will return a row vector. Instead, use __arr[:,None]__ to create a column vector:

```Python
print(features)
> array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features.T)
> array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features[:, None])
> array([[ 0.49671415],
       [-0.1382643 ],
       [ 0.64768854]])
```

Alternatively, you can create arrays with two dimensions. Then, you can use __arr.T__ to get the column vector.

```Python
np.array(features, ndmin=2)
> array([[ 0.49671415, -0.1382643 ,  0.64768854]])

np.array(features, ndmin=2).T
> array([[ 0.49671415],
       [-0.1382643 ],
       [ 0.64768854]])
```

I personally prefer keeping all vectors as 1D arrays, it just works better in my head.

### Programming quiz

Below, you'll implement a forward pass through a 4x3x2 network, with sigmoid activation functions for both layers.

Things to do:
- Calculate the input to the hidden layer.
- Calculate the hidden layer output.
- Calculate the input to the output layer.
- Calculate the output of the network.

In [6]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

Hidden-layer Output:
[ 0.41492192  0.42604313  0.5002434 ]
Output-layer Output:
[ 0.49815196  0.48539772]


---
# 7. Backpropagation <a name="Backpropagation"></a>

Now we've come to the problem of how to make a multilayer neural network _learn_. Before, we saw how to update weights with gradient descent. The backpropagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two layer network).

<img src="Figures/backpropagation1.png" width=500>

To update the weights to hidden layers using gradient descent, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backwards to hidden layers.

For example, in the output layer, you have errors $\delta_k^o$ attributed to each output unit $k$. Then, the error attributed to hidden unit $j$ is the output errors, scaled by the weights between the output and hidden layers (and the gradient):

$$
\delta_j^h = \sum_k w_{jk} \delta_k^o f'(h_j)
$$

Then, the gradient descent step is the same as before, just with the new errors:

$$
\Delta w_{ij} = \eta \delta_j^h x_i
$$

, where $w_{ij}$  are the weights between the inputs and hidden layer and $x_i$ are input unit values. This form holds for however many layers there are. The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer, 

$$
w_{pq} = \eta \delta_{output} V_{in}
$$

Here, you get the output error, $\delta_{output}$, by propagating the errors backwards from higher layers. And the input values, $V_{in}$ are the inputs to the layer, the hidden layer activations to the output unit for example.

### Working through an example

Let's walk through the steps of calculating the weight updates for a simple two layer network. Suppose there are two input values, one hidden unit, and one output unit, with sigmoid activations on the hidden and output units. The following image depicts this network. (__Note__: the input values are shown as nodes at the bottom of the image, while the networks output value is shown as $\hat{y}$ at the top. The inputs themselves do not count as a layer, which is why this is considered a two layer network.)

<img src="Figures/backprop-network.png" width=200>

Assume we're trying to fit some binary data and the target is $y=1$. We'll start with the forward pass, first calculating the input to the hidden unit

$$
h = \sum_i w_i x_i = 0.1 \times 0.4 - 0.2 \times 0.3 = =0.02
$$

and the output of the hidden unit

$$
a = f(h) = \text{sigmoid}(-0.02) = 0.495.
$$

Using this as the input to the output unit, the output of the network is

$$
\hat{y} = f(W \cdot a) = \text{sigmoid}(0.1 \times 0.495) = 0.512.
$$

With the network output, we can start the backwards pass to calculate the weight updates for both layers. Using the fact that for the sigmoid function

$$
f'(W \cdot a) = f(W \cdot a)(1 - f(W \cdot a))
$$

, the error for the output unit is

$$
\delta^o = (y -\hat{y})f'(W \cdot a) = (1 - 0.512) \times 0.512 \times (1 - 0.512) = 0.122.
$$

Now we need to calculate the error for the hidden unit with backpropagation. Here we'll scale the error from the output unit by the weight $W$ connecting it to the hidden unit. For the hidden unit error, $\delta_j^h = \sum_k w_{jk} \delta_k^o f'(h_j)$, but since we have one hidden unit and one output unit, this is much simpler.

$$
\delta^h = W \delta^o f'(h) = 0.1 \times 0.122 \times 0.495 \times (1 - 0.495) = 0.003
$$

Now that we have the errors, we can calculate the gradient descent steps. The hidden to output weight step is the learning rate, times the output unit error, times the hidden unit activation value.

$$
\Delta W = \eta \delta^o a = 0.5 \times 0.122 \times 0.495 = 0.0302
$$

Then, for the input to hidden weights $w_i$, it's the learning rate times the hidden unit error, times the input values.

$$
\Delta w_i = \eta \delta^h x_i =(0.5×0.003×0.1,0.5×0.003×0.3)=(0.00015,0.00045)
$$

From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is $0.25$, so the errors in the output layer get reduced by at least $75%$, and errors in the hidden layer are scaled down by at least $93.75%$! You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input. This is known as the __vanishing gradient__ problem. Later in the course you'll learn about other activation functions that perform better in this regard and are more commonly used in modern network architectures.

### Implementing in NumPy

For the most part you have everything you need to implement backpropagation with Numpy.

However, previously we were only dealing with error terms from one unit. Now, in the weight update, we have to consider the error for each unit in the hidden layer, $\delta_j$:

$$
\Delta w_{ij} = \eta \delta_j x_i
$$

Firstly, there will likely be a different number of input and hidden units, so trying to multiply the errors and the inputs as row vectors will throw an error

```Python
hidden_error*inputs
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-3b59121cb809> in <module>()
----> 1 hidden_error*x

ValueError: operands could not be broadcast together with shapes (3,) (6,)
```

Also, $w_{ij}$ is a matrix now, so the right side of the assignment must have the same shape as the left side. Luckily, Numpy takes care of this for us. If you multiply a row vector array with a column vector array, it will multiply the first element in the column by each element in the row vector and set that as the first row in a new 2D array. This continues for each element in the column vector, so you get a 2D array that has shape __(len(column_vector), len(row_vector))__.

```Python
hidden_error*inputs[:,None]
array([[ -8.24195994e-04,  -2.71771975e-04,   1.29713395e-03],
       [ -2.87777394e-04,  -9.48922722e-05,   4.52909055e-04],
       [  6.44605731e-04,   2.12553536e-04,  -1.01449168e-03],
       [  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00]])
```

It turns out this is exactly how we want to calculate the weight update step. As before, if you have your inputs as a 2D array with one row, you can also do __hidden_error*inputs.T__, but that won't work if __inputs__ is a 1D array.

### Backpropagation exercise

Below, you'll implement the code to calculate one backpropagation update step for two sets of weights. I wrote the forward pass, your goal is to code the backward pass.

Things to do
- Calculate the network error.
- Calculate the output layer error gradient.
- Use backpropagation to calculate the hidden layer error.
- Calculate the weight update steps.

In [7]:
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate error
error = target - output

# TODO: Calculate error gradient for output layer
del_err_output = error * output * (1 - output)

# TODO: Calculate error gradient for hidden layer
del_err_hidden = np.dot(del_err_output, weights_hidden_output) * \
                 hidden_layer_output * (1 - hidden_layer_output)

# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * del_err_output * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * del_err_hidden * x[:, None]

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

Change in weights for hidden layer to output layer:
[ 0.00804047  0.00555918]
Change in weights for input layer to hidden layer:
[[  1.77005547e-04  -5.11178506e-04]
 [  3.54011093e-05  -1.02235701e-04]
 [ -7.08022187e-05   2.04471402e-04]]


### Implementing backpropagation

Now we've seen that the error in the output layer is $\delta_k = (y_k - \hat{y}_k) f'(a_k)$, and the error in the hidden layer is

$$
\delta_j = \sum_k [w_{jk} \delta_k] f'(h_j)
$$

For now we'll only consider a simple network with one hidden layer and one output unit. Here's the general algorithm for updating the weights with backpropagation:

- Set the weight steps for each layer to zero
    - The input to hidden weights $\Delta w_{ij} = 0$
    - The hidden to output weights $\Delta W_j = 0$
    
- For each record in the training data:
    - Make a forward pass through the network, calculating the output $\hat{y}$
    - Calculate the error gradient in the output unit, $\delta^o = (y - \hat{y})f'(z)$ where $z = \sum_j W_j a_j$, the input to the output unit.
    - Propagate the errors to the hidden layer $\delta_j^h = \delta^o W_j f'(h_j)$
    - Update the wieght steps:
        - $\Delta W_j = \Delta W_j + \delta^o a_j$
        - $\Delta w_{ij} = \Delta w_{ij} + \delta_j^h a_i$
    - Update the weights, where $eta$ is the learning rate and $m$ is the number of records:
        - $W_j = W_j + \eta \frac{\Delta W_j}{m}$
        - $w_{ij} = w_{ij} + \eta \frac{\Delta_{ij}}{m}$
    - Repeat for $e$ epochs.
    
Now you're going to implement the backprop algorithm for a network trained on the graduate school admission data. You should have everything you need from the previous exercises to complete this one.

Your goals here:
- Implement the forward pass.
- Implement the backpropagation algorithm.
- Update the weights.

<!--
```Python
import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(21)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']
```
-->

In [8]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)

        output = sigmoid(np.dot(hidden_output,
                                weights_hidden_output))

        ## Backward pass ##
        # TODO: Calculate the error
        error = y - output

        # TODO: Calculate error gradient in output unit
        output_error = error * output * (1 - output)

        # TODO: propagate errors to hidden layer
        hidden_error = np.dot(output_error, weights_hidden_output) * \
                       hidden_output * (1 - hidden_output)

        # TODO: Update the change in weights
        del_w_hidden_output += output_error * hidden_output
        del_w_input_hidden += hidden_error * x[:, None]

    # TODO: Update weights
    weights_input_hidden += learnrate * del_w_input_hidden / n_records
    weights_hidden_output += learnrate * del_w_hidden_output / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Train loss:  0.25135725242598617
Train loss:  0.24996540718842886
Train loss:  0.24862005218904654
Train loss:  0.24731993217179746
Train loss:  0.24606380465584848
Train loss:  0.24485044179257162
Train loss:  0.2436786320186832
Train loss:  0.24254718151769536
Train loss:  0.24145491550165465
Train loss:  0.24040067932493367
Prediction accuracy: 0.725
