# The Output Layer

### Introduction

In the last lesson, we learned how to build a neural network with multiple layers.  And we saw that, through multiple layers, our neural network can make more and more abstract assessments.  Now something like this is really powerful when working with image data, where earlier layers can make concrete assessements like whether there is are straight or curved lines, and later layers can combine the outputs from earlier layers to see determine if the combination of curved and straight lines makes our picture likely to represent a specific number like 2 or 3.

Now how does a neural network indicate a prediction of a 2 or 3, or a number 9?  That's what we'll learn in this lesson.

### Representing image data

Before we understand how our neural network will output a prediction about an image, let's briefly see what it means to take in a prediction.  When we pass an image into our neural network, really what we're passing through is a list of pixels from an image.  Our image data will be black and white images of handwritten digits.

<img src="./mnist.png" width="30%">

So each image is a different observation and each individual pixel is represented as a different number, indicating how light or dark that pixel is.

So to generate some input data, let's just create a random list of 784 numbers to represent a single image.

In [74]:
import torch
torch.manual_seed(1)


x = torch.randint(100, (1, 784))

x.shape

torch.Size([1, 784])

So above we chose random numbers from one to one hundred, and did so to produce a tensor 784 elements long.

### The output layer

So we just saw what we feed into a neural network -- a vector where every pixel of an image is represented as a different value.  What do we want as the output of a neural network?  Well a classification problem -- which we have been using thus far -- it's generally one prediction per each potential outcome.

So for our handwriting dataset, the potential outcomes are each of digits that an image could represent.  This means that the neural network should output a tensor with ten entries.  And each entry will represent the probability of a different entry.  

Let's see this below.

> Below, our neural network predicts a $.05$ percent probability of the picture being a 0, a $.62$ probability of the picture being a 1, and so on.

In [75]:
output = [.05, .62, .03, .03, .05, .03, .15, .01, .02, .01]
output
# 0    1    2   3    4     5   6   7     8   9  

[0.05, 0.62, 0.03, 0.03, 0.05, 0.03, 0.15, 0.01, 0.02, 0.01]

Ok, so this is what we want outputted from our neural network.  And with this in mind, let's move forward to constructing it.

### Defining our Neural Network

Ok, now it's time to build our neural network.  We know that each neuron in the first layer must take 784 features, and it's also true that we can have as many neurons as we want in that first layer.  

> A good rule of thumb is to have around half the number of features.  Here, that would be 392, but we're building a smaller neural network so we'll choose way less. 

We'll specify 64 neurons in the first layer, and then pass these 64 outputs to the sigmoid function.  

In [121]:
import torch.nn as nn

torch.manual_seed(5)

net = nn.Sequential(
    nn.Linear(784, 64),
    nn.Sigmoid(),
    nn.Linear(64, 10)
)

net

Sequential(
  (0): Linear(in_features=784, out_features=64, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=64, out_features=10, bias=True)
)

Then in the last layer, we'll see that each neuron takes in 64 features, and that we have 10 neurons.  We need 10 neurons in that last layer to produce the 10 outputs from the neural network.  And we need each neuron to take in 64 features, because we pass the 64 outputs from the previous layer to each of the neurons in our next layer.

Let's make sure we understand what's going on under the hood by representing our neural network mathematically.  Mathematically, our our neural network is performing the following operation to produce the output of 10 numbers.

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(Z_{1x64})$

$z_{1x10} = A_{1x64} \cdot W_{64x10} + b_{1x10}$

> Confirm our rules that when performing matrix multiplication, the inner numbers are equal.  And that the ouput of the matrix multiplication is determined by outer dimensions.

And we can confirm that we get an output of tensor of length 10, by passing through our data like so:

In [122]:
x.shape

torch.Size([1, 784])

In [123]:
prediction = net(x.float())

prediction

tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]], grad_fn=<AddmmBackward>)

Ok, so we did produce 10 numbers to output from our neural network -- but there are a couple of issues with the output above.  We'll discuss, and remedy, those issues below.

### Taking the exponent

Ok, so there are a couple of goals that we would like to achieve with this ouput layer.  One is that we want just one prediction to dominate.  For example, if our neural network thinks that the digit is most likely the number 0, we want that first slot to have a probability of .9 to express that confidence.  

But, if our output, we have to represent 10 outcomes, and each potential output is assigned at least some probability, this becomes difficult.  For example, if each of the other 9 outcomes is assigned a small probability like $.03$, this would only leave something like $1 - 9*.03 = .73$ for the outcome we are most confident about.  And a $.73$ doesn't exactly express confidence about a particular outcome.  

To fix this, we have our neural network exaggerate the outputs it is most confident about.  And it does this by taking the exponent.  Let's see how this works through a few examples. For small numbers, like say, $1.2$, applying the exponent produces:

In [124]:
num = torch.tensor(1.2)

torch.exp(num)

tensor(3.3201)

But for larger numbers like $4.8$, applying the exponent produces:

In [125]:
larger_num = torch.tensor(4.8)

torch.exp(larger_num)

tensor(121.5104)

So the exponent really exaggerates the larger numbers -- blowing up the number $4.8$ while only slightly increeasing the number $1.2$. 

And let's see how it treats negative values:

In [126]:
neg_num = torch.tensor(-3.)

torch.exp(neg_num)

tensor(0.0498)

So it transforms the negative number into a small positive output, which also is appropriate for returning probabilities.  (After all we cannot say there is a negative probability that an outcome occurs.)

So let's apply the exponent function to every output from our last layer.

> We start with the following prediction from our neural network:

In [127]:
prediction

tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]], grad_fn=<AddmmBackward>)

And applying the exponent produces the following:

In [128]:
exp = torch.exp(prediction)
exp

tensor([[0.9309, 1.2098, 1.3248, 0.9702, 1.5754, 0.9600, 2.7777, 0.8836, 0.6991,
         2.3330]], grad_fn=<ExpBackward>)

So we can see that one number, $2.3182$, tended to dominate the others.

### Converting to probabilities

Now does our vector above look like a vector of probabilities? 

In [129]:
exp

tensor([[0.9309, 1.2098, 1.3248, 0.9702, 1.5754, 0.9600, 2.7777, 0.8836, 0.6991,
         2.3330]], grad_fn=<ExpBackward>)

Not really.  For an entry to be a valid probability, it must be between 0 and 1, and the entire vector should add up to one.  

> In other words, there is a probability of 1, that our observation is *any* of the potential outcomes (0 through 9).

But we can can transform our vector into a probability by calculating each entry's percentage of the total is of the total.  

> So we get the total by adding up the vector.

In [130]:
sum_exp = torch.sum(exp)
sum_exp

tensor(13.6647, grad_fn=<SumBackward0>)

And then divide each entry by that total with the following:

In [131]:
(exp/sum_exp)

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<DivBackward0>)

So now we are saying there is a $.20$ probability of our image being the number 6.  Not exactly the dominant prediction we were looking for, but it's a start.

### Wrapping up

So the operation that we learned above of applying the exponent to each of our outputs, and then dividing by the sum of those exponents is called the softmax function.  And as we saw, it just looks like the following:

In [132]:
torch.exp(prediction)/torch.sum(torch.exp(prediction))

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<DivBackward0>)

Or, mathematically, it's the following:

$softmax(x) = \frac{e^x}{\sum e^x}$

And we can tack it onto the end of our neural network, so that we get an output of probabilities, where one number tends to be dominant with the following:

In [140]:
torch.manual_seed(5)
net = nn.Sequential(
    nn.Linear(784, 64),
    nn.Sigmoid(),
    nn.Linear(64, 10),
    nn.Softmax(dim = 1)
)

net

Sequential(
  (0): Linear(in_features=784, out_features=64, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=64, out_features=10, bias=True)
  (3): Softmax(dim=1)
)

In [141]:
net(x.float())

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<SoftmaxBackward>)

And if we want to represent this mathematically, this was our entire neural network.

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(z_{1x64})$

$z_{1x10} = A_{1x64} \cdot W_{64x10} + b_{1x10}$

$a_{1x10} = \frac{e^z}{\sum e^z}$ 

### Summary

In this lesson, we wrapped up the prediction function for a neural network.  In doing so, we saw that we want our neural network to predict a different probability for each potential output.  So in predicting which of 10 digits to output, we'd want something like the following.

In [None]:
output = [.05, .62, .03, .03, .05, .03, .15, .01, .02, .01]
output
# 0    1    2   3    4     5   6   7     8   9  

To produce an output like the above, we first need our final linear layer to have a separate neuron for each output.  Then our next goal is to turn this output into a set of probabilities where one number tends to dominate.  To get one number to dominate, we apply the exponent to each entry, like so:

In [146]:
output = torch.tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]])

exp_output = torch.exp(output)

Applying the exponent, exaggerates our larger numbers, increasing them by a lot -- while leaving the smaller numbers significantly less affected.  Finally, we need to turn our outputs into probabilities.  To do this, each entry must be between 0 and 1, and the entire set of outputs should add up to one.  It turns out we can achieve both goals, just by turning the above set of outputs into percentages, by dividing by the total.

In [147]:
exp_output/torch.sum(exp_output)

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]])

We saw that this procedure is called the softmax, and represented mathematically as: $softmax(x) = \frac{e^x}{\sum e^x}$.

And, we represented all of the layers of our neural network with the following:

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(Z_{1x64})$

$z_{1x10} = A_{1x64} \cdot W_{64x10} + b_{1x10}$

$a_{1x10} = \frac{e^z}{\sum e^z}$ 

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="./jigsaw-icon.png" width="15%" style="text-align: center"></a>
</center>