# Completing the feedforward: the last layer 

### Introduction

Now so far we have learned how to write almost the entire hypothesis function of a neural network.  There is only step remaining, and that is adding the last layer to a neural network.

### Getting Started

We call the last layer of a neural network the output layer.  How many outputs should there be in the output layer?  With a classification problem -- which we have been using thus far -- it's generally one prediction per each potential outcome.  So in building a network to detect cancer the practice would be to have one neuron that makes a prediction for positive and one prediction for negative.

This makes more sense in a problem that has more than two potential outcomes.  For example, below is observations from the famous mnist dataset.  The mnist dataset feeds images (in the form of pixels) into a neural network, and the network is tasked with predicting which of the ten digits the handwritten digit represents.

<img src="./mnist.png" width="30%">

This means that the neural network should output a vector with ten entries, with each entry indicating the probability of a different outcome.  

In [67]:
[.05, .62, .03, .03, .05, .03, .15, .01, .02, .01]
# 0    1    2   3    4     5   6   7     8   9  

[0.05, 0.62, 0.03, 0.03, 0.05, 0.03, 0.15, 0.01, 0.02, 0.01]

> For example, above, the neural network is saying there is a probability of .62 that an observation represents the number 1.

But let's build up to this.  We can start by defining the input layer.

### The input layer

Each observation in our dataset, that is, each image of hand written digits has 28 pixels per row and 28 per column.  And so each observation of is represented by a feature vector of 784, one pixel per each column.  

So this means that the first layer should have neurons that accept feature vector that has $28\times28 = 784$ entries.  Let's initialize a weight matrix that has 784 rows and 20 neurons.

In [82]:
np.random.seed(2)
x = np.random.randn(784)

In [83]:
W_1 = np.random.randn(784, 20)
b_1 = np.random.randn(20)

In [84]:
def sigmoid(value): return 1/(1 + np.exp(-value))

In [85]:
l1 = sigmoid(x.dot(W_1) + b_1)

In [86]:
l1.shape

(20,)

So we have twenty outputs from our first layer.  Now we don't need more than two layers in our neural network, so let's stick to just an input layer and an output layer to keep things simple. Let's get going.

### The last layer: the output layer

Ok, now let's think of the dimensions of the output layer.  We want to connect the twenty neurons from the input layer to each of the neurons in the output layer.  So this means that our weight matrix should have 20 rows, so that each neuron can feed in the outputs from the previous layer.  

And how many neurons should we have.  Well we want our network to output ten values, so this means that we should have 10 neurons.  Each neuron is responsible for the probability of the observation being each digit.

In [87]:
W_2 = np.random.randn(20, 10)
b_2 = np.random.randn(10)

So now our network looks like the following.

In [88]:
l1 = sigmoid(x.dot(W_1) + b_1)
l2 = l1.dot(W_2) + b_2

In [89]:
l2

array([-1.61553713,  0.07881626, -1.41537295,  0.62776172,  0.12679912,
        0.11159318,  0.90092195,  4.79589342, -1.77899941, -1.75327508])

The vector above contains pretty standard outputs from our last linear layer.

### Taking the exponent

Now there are a couple of goals that we would like to achieve with this ouput layer.  One is that we want just one number to dominate.  Remember that we have 10 outcomes, and each will be assigned a probability.  So if all of the other 9 outcomes is assigned a small probability like .03, this still only leaves $1 - 9*.03 = .73$ for the outcome we are most confident about. 

So we can exaggerate the outputs we are confident about by taking the exponent of each value.  So now for small numbers like say, $1.2$, they are turned into:

In [91]:
np.exp(1.2)

3.3201169227365472

But for larger numbers like $4.8$ it's transformed into:

In [92]:
np.exp(4.8)

121.51041751873485

So the exponent really exaggerates our largest number in comparison to all of the other numbers.  And let's see how it treats negative values:

In [94]:
np.exp(-3)

0.049787068367863944

It transforms the number into a small positive output, which also is appropriate for returning probabilities.  (We cannot say there is a negative probability that an outcome occurs.)

So let's apply the exponent function to every entry in our vector.

In [97]:
exp = np.exp(l2)
exp

array([  0.19878387,   1.0820055 ,   0.24283503,   1.87341266,
         1.13518896,   1.11805792,   2.46187179, 121.01244834,
         0.16880697,   0.17320575])

So just as we wanted, we have one entry that dominates all of the others.

In [98]:
exp[7]

121.01244834062315

### Converting to probabilities

Now does our vector above look like a vector of probabilities? 

In [99]:
exp

array([  0.19878387,   1.0820055 ,   0.24283503,   1.87341266,
         1.13518896,   1.11805792,   2.46187179, 121.01244834,
         0.16880697,   0.17320575])

Probably not.  For an entry to be a valid probability, it must be between 0 and 1, and the entire vector should add up to one.  

> In other words, the probability that our observation is any of the potential outcomes is 1.

We can transform our vector this by calculating the percentage that each entry is of the total.

In [101]:
sum_exp = sum(exp)
sum_exp

129.46661679732003

In [102]:
exp/sum_exp

array([0.00153541, 0.00835741, 0.00187566, 0.01447024, 0.0087682 ,
       0.00863588, 0.01901549, 0.93470001, 0.00130386, 0.00133784])

So now we are saying there is a $.93$ probability of our image being the number 7.

### Wrapping up

It may have seen complicated, but this was our entire neural network.

In [104]:
l1 = sigmoid(x.dot(W_1) + b_1)
l2 = l1.dot(W_2) + b_2
np.exp(l2)/sum(np.exp(l2))

array([0.00153541, 0.00835741, 0.00187566, 0.01447024, 0.0087682 ,
       0.00863588, 0.01901549, 0.93470001, 0.00130386, 0.00133784])

And this is the formula for the softmax function: 

$softmax(x) = \frac{e^x}{\sum e^x}$

So with the softmax function, we can end finish with a neural network that returns the probability of each outcome.