# Building Layers

### Introduction

In the last lesson, we learned about building a layer of mutliple neurons.

<img src="./first-layer.png" width="20%">

We did so by building a weight matrix where we specified the dimensions of the attributes of the layer - a column for each neuron, and a row for each feature that the neuron accepts.

$z = x\cdot W  + b = \begin{bmatrix}
- & x &  -  
\end{bmatrix} \cdot \begin{bmatrix}
|  & |  \\
w_1  & w_2 \\
|   & |
\end{bmatrix} + \begin{bmatrix} b_1 & b_2 \end{bmatrix} = \begin{bmatrix}
x \cdot w_1 & x \cdot w_2 \end{bmatrix} + \begin{bmatrix} b_1 & b_2 \end{bmatrix} = \begin{bmatrix} z_1(x) & z_2(x) \end{bmatrix}$

$A(z) = \sigma (W\cdot x + b) = \begin{bmatrix} \sigma(z_1) \\ \sigma(z_2) \end{bmatrix}$

In this lesson, we'll begin to see how we can take the outputs from a linear layer and activation layer, and feed it into yet another layer.

<img src="./artificial-network.png" width="50%">

If we think about our handwriting detection example, we can remember why we want to do this.  The idea is that the first layer could make more concrete determinations, like whether the picture has straight lines or curves.  And the later layers could take these earlier determinations to make more abstract assessments like whether these curves are stacked on top of each other to form an 8, as opposed to a 0.  So the information from one layer is passed to these deeper layers.  In this lesson, we'll see precisely how this works in code.

### The rules of matrix multiplication

So now it's time understand what it means to pass outputs from one layer to another in code.  To understand this, we first to be more comfortable working with matrices.  Let's start by representing our single layer in the example above in code.  We'll start with the weight matrix for the two neurons -- we'll add in the bias vector later on:

In [2]:
import torch

w_size = torch.tensor([1, 3, -.5, 0])
w_shape = torch.tensor([0, -.5, 3, 1.5])

W = torch.stack((w_size, w_shape), dim = 0).T
W

tensor([[ 1.0000,  0.0000],
        [ 3.0000, -0.5000],
        [-0.5000,  3.0000],
        [ 0.0000,  1.5000]])

And we have our feature vector representing a single observation:

In [36]:
# area, perimeter, number concave points, symmetry error  
x_1 = torch.tensor([[2, 4, 3, 2]]).float()

And we'll can start by calculating the weighted sum.

$g(x) = x\cdot W = \begin{bmatrix}
- & x &  -  
\end{bmatrix} \cdot \begin{bmatrix}
|  & |  \\
w_1  & w_2 \\
|   & |
\end{bmatrix} $

In [37]:
weighted_sum = x @ W 
weighted_sum

tensor([12.5000, 10.0000])

Now, the most important thing to understand from the matrix multiplication above is the dimensions of the inputs, and the dimensions of the output.  Let's go through this.

Above, our feature vector $x_1$ has one row and four columns. 

In [18]:
x_1.shape

torch.Size([1, 4])

And the weight matrix $W$ has 4 rows and two columns, where each column represents a neuron.

In [19]:
W.shape

torch.Size([4, 2])

Now it turns out that for matrix multiplication to work, the inner two dimensions must be equal.  So we can see this above -- we have dimensions of `[1, 4]` and `[4, 1]`.  In other words, the number of *columns* of the first matrix (or vector), must equal the number of *rows* of the second matrix.  

This is what we get when we multiply `x_1` by `W`.  But notice that if we place the W first, and try to perform $W \cdot x$, then the dimensions are `[2 4]` `[1 4]` and the inner dimensions are no longer equal.  

Pytorch, following the rules of matrix multiplication, will throw an error.

In [27]:
W @ x

RuntimeError: size mismatch, [4 x 2], [4] at /Users/distiller/project/conda/conda-bld/pytorch_1580186068235/work/aten/src/TH/generic/THTensorMath.cpp:292

So our first rule is that the inner dimensions must be equal.  Here is our second rule: the *outer dimensions* will be the dimensions of the resulting matrix.  So if we multiply $x \cdot W$, where we have dimensions of `[1, 4]` `[4, 2]`, the resulting matrix will have dimensions of `[1, 2]`: one row and two columns.

In [28]:
x @ W

tensor([12.5000, 10.0000])

So those are our two rules for matrix multiplication:
1. The inner dimensions must be equal
2. The *outer dimensions* will be the dimensions of the resulting matrix

Finally, our rule for matrix *addition* is that the dimensions must be precisely equal.  So when we add our bias vector, $b$ to the result above, we'll need to a bias vector to be the same dimensions as $x \cdot W$, two entries. 

In [39]:
x @ W

tensor([12.5000, 10.0000])

In [40]:
b = torch.tensor([-12, -7])

In [41]:
x @ W + b

tensor([0.5000, 3.0000])

And when we move onto the activation layer, by applying the sigmoid function, our dimensions stay the same.  This is because the sigmoid function applies to each entry: $z = x \cdot W + b$

$\sigma (W\cdot x + b) = \begin{bmatrix} \sigma(z_1) \\ \sigma(z_2) \end{bmatrix}$

In [42]:
z = x @ W + b
z

tensor([0.5000, 3.0000])

In [44]:
torch.sigmoid(z)

tensor([0.6225, 0.9526])

### Why it matters

The reason why this matters is because we'll need to build our multiple layers by passing the output from one layer as the input to another layer.  Let's see this visually, and then we'll try to better understand it through code.  

The diagram below represents taking a single observation's features, `[x1, x2, x3, x4, x5, x6]`, and passes those features to each of the four neurons in our linear layer below.  The different neurons each make different assessments, producing an output from the sigmoid function between 0 and 1.  Those four numbers is the output from the first layer -- below `.12`, `.83`, `.42`, and `.76`.  These outputs are packaged up in a list, with the set of outputs passed to each neuron in the next layer.

<img src="./big_layers.svg" width="40%">

So to recap, all of the features from an observation are passed to all of the neurons in the first layer.  And then the set of outputs -- one from each neuron -- is passed to each of the neurons in the next layer.

Ok, now let's see this in code.

> We'll start with our observation representing the features of a single cancer cell -- this time adding features representing color.

In [19]:
# area, perimeter, number concave points, symmetry error, darkness, contrast
x_2 = torch.tensor([[2, 4, 3, 2, 3, 5]]).float()

To build a layer that takes in a feature vector of this size, we need a matrix where each column has 6 rows -- one for each feaature.  So here's our layer of four neurons, each with 6 rows.

In [78]:
w_size = torch.tensor([1, 3, -.5, 0, .1, 1.3])
w_shape = torch.tensor([0, -.5, 3., 1.5, .5, 1.])
w_smoothness = torch.tensor([1.5, .5, 2, 1.5, .8, 1.5])
w_color = torch.tensor([.1, -.5, .3, 0, 4, -2])

W_2 = torch.stack((w_size, w_shape, w_smoothness, w_color), dim = 0)
W_2

tensor([[ 1.0000,  3.0000, -0.5000,  0.0000,  0.1000,  1.3000],
        [ 0.0000, -0.5000,  3.0000,  1.5000,  0.5000,  1.0000],
        [ 1.5000,  0.5000,  2.0000,  1.5000,  0.8000,  1.5000],
        [ 0.1000, -0.5000,  0.3000,  0.0000,  4.0000, -2.0000]])

In [39]:
W_2.shape

torch.Size([6, 4])

And we'll need a bias vector of length 4, one for each neuron.

In [40]:
b_2 = torch.tensor([-3, -12, -15, -4])
b_2

tensor([ -3, -12, -15,  -4])

Ok, so now let's see how this examples lines up with our two rules about matrix algebra.  First, to perform $x \cdot W$, we'll need the inner two dimensions to be equal.

In [41]:
x_2.shape, W_2.shape

(torch.Size([1, 6]), torch.Size([6, 4]))

Ok, looks good.  And to from the above, we can also predict that the output will be a [1 x 4] matrix, as those are the outer dimensions.

In [42]:
x_2 @ W_2

tensor([[19.3000, 16.5000, 23.9000,  1.1000]])

Cool.  Now let's pass our data through the first linear layer (including the bias), followed by the activation layer.

In [44]:
z = x_2 @ W_2 + b_1
z

tensor([[16.3000,  4.5000,  8.9000, -2.9000]])

In [50]:
A_1 = torch.sigmoid(z)
A_1

tensor([[1.0000, 0.9890, 0.9999, 0.0522]])

So the above is the output from the first linear layer.  And then we take this output of a vector, and pass it to each of the three neurons in the second linear layer, whose neurons use it's own set of weights and biases to make a more abstract assessment.

In [47]:
w_large_dark = torch.tensor([1, 3, -.5, 0])
w_dark_irregular = torch.tensor([0, -.5, 3., 1.5])
w_large_irregular = torch.tensor([1.5, .5, 2, 1.5])

W_2 = torch.stack((w_large_dark, w_dark_irregular, w_large_irregular), dim = 0).T
W_2

tensor([[ 1.0000,  0.0000,  1.5000],
        [ 3.0000, -0.5000,  0.5000],
        [-0.5000,  3.0000,  2.0000],
        [ 0.0000,  1.5000,  1.5000]])

In [67]:
b_2 = torch.tensor([-4, -5, -2])

And we take the output from the previous layer, $A_1$, and pass it to our linear layer, followed by our activation layer.

In [68]:
Z_2 = A_1 @ W_2 + b_2
Z_2

tensor([[-0.5329, -2.4167,  2.0725]])

In [69]:
A_2 = torch.sigmoid(Z_2)
A_2

tensor([[0.3698, 0.0819, 0.8882]])

And if we use our knowledge about matrix algebra, we'll see that once again our dimensions lined up.  That is, when we performed $A_1 \cdot W_2$, we had dimensions of `[1, 4]` `[4, 3]`, so the inner dimensions aligned, and the has dimensions of `[1, 3]`.

### Making it Predictable

Ok, so let's see draw some conclusions about neural networks based on what we saw above.

> We'll use the image below as an example:

<img src="./big_layers.svg" width="30%">

Above we have a neural network, where each observation has six features -- $x_1$ through $x_6$.  So we feed in a vector x, with dimensions of `[1, 6]`.  This means that in our next layer, each neuron must have a weight vector of length 6 -- and we could have as many neurons as we want -- but we chose 4.  So our second layer has a weight matrix of dimensions `[6, 4]`.  We also know that the output will be a `[1, 4]` vector, meaning that each neuron in the next layer must have 4 features, and we chose to have 3 neurons.

> So notice that the number of *neurons* in each layer is flexible, while the number of weights in each neuron is determined by the input it's receiving.

Now, it's time to construct our neural network using Pytorch.  We need a neural network that will start by taking in six features to match our observation.  Remember that a single observation looks like the following:

In [72]:
x_1.shape

torch.Size([1, 6])

And we can feed it into a neural network that looks like the following:

In [83]:
import torch.nn as nn

net = nn.Sequential(
    nn.Linear(6, 4),
    nn.Sigmoid(),
    nn.Linear(4, 3)
)

net

Sequential(
  (0): Linear(in_features=6, out_features=4, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=4, out_features=3, bias=True)
)

So in the neural network above, we have four neurons in the first layer, and then the outputs from each of these four neurons gets passed to the three neurons in the second layer.

Let's try passing through our observation.

In [89]:
net(x_1)

tensor([[-0.2261, -0.3032, -0.9666]], grad_fn=<AddmmBackward>)

So above we pass through our observation, and get the output from the three neurons in the last layer.

### Summary

In this lesson, we learned how to build a neural network with multiple layers.  We can imagine that the earlier layers make more concrete assessments -- like assessing specific features, and pass these outputs to later layers to make more abstract assessements.   

<img src="./big_layers.svg" width="20%">

We saw that under the hood, this works through matrix algebra.  With our example above, we saw a single observation with 6 features, multiplied  by the six weights of 4 neurons, results in an output from each neuron.  Then these four outputs are passed to the sigmoid layer, still resulting in four outputs.  Then these four outputs are each passed to the second layer, of three neurons, each with four weights.  The final output has three outputs.  We saw this with in Pytorch with the following:

In [91]:
net = nn.Sequential(
    nn.Linear(6, 4),
    nn.Sigmoid(),
    nn.Linear(4, 3)
)

x_1 # tensor([[2., 4., 3., 2., 3., 5.]])

net(x_1)

tensor([[-0.8253, -0.5326, -0.6160]], grad_fn=<AddmmBackward>)

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="./jigsaw-icon.png" width="15%" style="text-align: center"></a>
</center>