# The Output Layer

### Introduction

How does a neural network indicate a prediction of a 2 or 3, or a number 9?

### Representing image data

Really what we're passing through is a list of pixels from an image.  Our image data will be black and white images of handwritten digits.

<img src="./mnist.png" width="30%">

In [2]:
import torch
torch.manual_seed(1)


x = torch.randint(100, (1, 784))

x.shape

torch.Size([1, 784])

### The output layer

Our neural network will output the probability that the image could be any of the potential digits.

In [75]:
output = [.05, .62, .03, .03, .05, .03, .15, .01, .02, .01]
output
# 0    1    2   3    4     5   6   7     8   9  

[0.05, 0.62, 0.03, 0.03, 0.05, 0.03, 0.15, 0.01, 0.02, 0.01]

So above,  
* a $.05$ percent probability of the picture being a 0
* a $.62$ probability of being a 1

### Defining our Neural Network

Now with this, we can define our neural network:

In [5]:
import torch.nn as nn

torch.manual_seed(5)

net = nn.Sequential(
    nn.Linear(784, 64),
    nn.Sigmoid(),
    nn.Linear(64, 10)
)

net

Sequential(
  (0): Linear(in_features=784, out_features=64, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=64, out_features=10, bias=True)
)

Let's make sure we understand what's going on under the hood by representing our neural network mathematically.  Mathematically, our our neural network is performing the following operation to produce the output of 10 numbers.

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(Z_{1x64})$

$z_{1x10} = a_{1x64} \cdot W_{64x10} + b_{1x10}$

> Notice that the above follow our rules of matrix multiplication: (1) the inner numbers are equal and (2) the ouput of the matrix multiplication is determined by outer dimensions.

And we can confirm that we get an output of tensor of length 10, by passing through our data like so:

In [6]:
x.shape

torch.Size([1, 784])

> Our feature vector starts of length 784, and when we pass it through our neural network, we get a vector of length 10.

In [7]:
prediction = net(x.float())

prediction

tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]], grad_fn=<AddmmBackward>)

### Taking the exponent

Additional goals:

1. Have one prediction dominate.
* But because each potential output assigned some probability, this becomes difficult.  
* So we want neural network to exaggerate prediction of most confident output.  


* Do this with the exponent, which influences larger numbers a lot more than smaller numbers

In [8]:
num = torch.tensor(1.2)

torch.exp(num)

tensor(3.3201)

> The above is called applying the exponential function.  So above, we do this by caclulating $e^{1.2} = 3.201$.

But for larger numbers like $4.8$, applying the exponent produces:

In [125]:
larger_num = torch.tensor(4.8)

torch.exp(larger_num)

tensor(121.5104)

In [126]:
neg_num = torch.tensor(-3.)

torch.exp(neg_num)

tensor(0.0498)

> We start with the following prediction from our neural network:

In [9]:
prediction

tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]], grad_fn=<AddmmBackward>)

And applying the exponent produces the following:

In [11]:
exp = torch.exp(prediction)
exp

tensor([[0.9309, 1.2098, 1.3248, 0.9702, 1.5754, 0.9600, 2.7777, 0.8836, 0.6991,
         2.3330]], grad_fn=<ExpBackward>)

So we can see that one number, $2.77$, tended to dominate many of the others.

### Converting to probabilities

Now does our vector above look like a vector of probabilities? 

In [129]:
exp

tensor([[0.9309, 1.2098, 1.3248, 0.9702, 1.5754, 0.9600, 2.7777, 0.8836, 0.6991,
         2.3330]], grad_fn=<ExpBackward>)

* But for valid probability, we need all of these numbers to add up to one.

So we need to convert to percentages, by dividing by the total.

In [12]:
sum_exp = torch.sum(exp)
sum_exp

tensor(13.6647, grad_fn=<SumBackward0>)

And then divide each entry by that total with the following:

In [13]:
(exp/sum_exp)

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<DivBackward0>)

### Wrapping up

**softmax function**: apply the exponent to each of our outputs, and then dividing by the sum of those exponents 

In [132]:
torch.exp(prediction)/torch.sum(torch.exp(prediction))

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<DivBackward0>)

Or, mathematically, it's the following:

$softmax(x) = \frac{e^x}{\sum e^x}$

And we can tack it onto the end of our neural network, so that we get an output of probabilities, where one number tends to be dominant with the following:

In [140]:
torch.manual_seed(5)
net = nn.Sequential(
    nn.Linear(784, 64),
    nn.Sigmoid(),
    nn.Linear(64, 10),
    nn.Softmax(dim = 1)
)

net

Sequential(
  (0): Linear(in_features=784, out_features=64, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=64, out_features=10, bias=True)
  (3): Softmax(dim=1)
)

In [141]:
net(x.float())

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]], grad_fn=<SoftmaxBackward>)

And if we want to represent this mathematically, this was our entire neural network.

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(z_{1x64})$

$z_{1x10} = A_{1x64} \cdot W_{64x10} + b_{1x10}$

$a_{1x10} = \frac{e^z}{\sum e^z}$ 

### Summary

In this lesson, we wrapped up the prediction function for a neural network.  In doing so, we saw that we want our neural network to predict a different probability for each potential output.  So in predicting which of 10 digits to output, we'd want something like the following.

In [None]:
output = [.05, .62, .03, .03, .05, .03, .15, .01, .02, .01]
output
# 0    1    2   3    4     5   6   7     8   9  

To produce an output like the above, we first need our final linear layer to have a separate neuron for each output.  Then our next goal is to turn this output into a set of probabilities where one number tends to dominate.  To get one number to dominate, we apply the exponent to each entry, like so:

In [146]:
output = torch.tensor([[-0.0716,  0.1905,  0.2812, -0.0302,  0.4545, -0.0408,  1.0216, -0.1237,
         -0.3579,  0.8472]])

exp_output = torch.exp(output)

Applying the exponent, exaggerates our larger numbers, increasing them by a lot -- while leaving the smaller numbers significantly less affected.  Finally, we need to turn our outputs into probabilities.  To do this, each entry must be between 0 and 1, and the entire set of outputs should add up to one.  It turns out we can achieve both goals, just by turning the above set of outputs into percentages, by dividing by the total.

In [147]:
exp_output/torch.sum(exp_output)

tensor([[0.0681, 0.0885, 0.0969, 0.0710, 0.1153, 0.0703, 0.2033, 0.0647, 0.0512,
         0.1707]])

We saw that this procedure is called the softmax, and represented mathematically as: $softmax(x) = \frac{e^x}{\sum e^x}$.

And, we represented all of the layers of our neural network with the following:

$z_{1x64} = x_{1x784} \cdot W_{784x64} + b_{1x64}$

$a_{1x64} = \sigma(Z_{1x64})$

$z_{1x10} = A_{1x64} \cdot W_{64x10} + b_{1x10}$

$a_{1x10} = \frac{e^z}{\sum e^z}$ 

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="./jigsaw-icon.png" width="15%" style="text-align: center"></a>
</center>