# 04: Matrix Notation

So far we've explicitly written individual terms for every parameter, data feature, and intermediate value in the network in the examples. This makes it clear exactly what is going on at each step, but it gets clumsy, long, and inefficient.

Here we show how we can represent networks with matrices, creating a much simpler (if less explicit) notation.

## Example Network and Definitions

We'll refer back to the network below throughout this section. It has an unspecified number of layers (and nodes in each layer), but we assume 2 input data features, 3 nodes in the first layer, and 1 output node for the examples.

![](../img/04_matrix_network.png)

We have labelled the notation we're going to start to use to represent the values/parameters in whole layers, rather than at individual nodes/connections. The terms are:

**Size of the network and dataset:**

- $L$ - the number of layers in the network
- $n^{[l]}$ - the number of nodes in layer $[l]$
  - $n^{[0]}$ - the number of input features (the number of nodes $x_i$ in the input layer)
  - $n^{[L]}$ - the number of outputs (one in all the networks we've considered so far)
- $m$ - the number of data samples per gradient descent iteration (aka the batch size)

**Parameters of, and computed values in, the network:**

- $W^{[l]}$ - the weights (arrows) between all the activations in layer $[l-1]$ and layer $[l]$, dimensions: $(n^{[l]}, n^{[l-1]})$
- $b^{[l]}$ - the bias for all the nodes in layer $[l]$, dimensions: $(n^{[l]}, 1)$
- $Z^{[l]}$ - the linear predictor for all the nodes in layer $[l]$ for all the data samples, dimensions: $(n^{[l]}, m)$
- $A^{[l]}$ - the activation for all the nodes in layer $[l]$ for all the data samples, dimensions: $(n^{[l]}, m)$
  - $X = A^{[0]}$ - we can represent the input dataset as the activations of the zeroth layer
  - $\hat{y} = A^{[L]}$ - the activations of the last layer are the predictions $\hat{y}$
- $\mathcal{L}(y, \hat{y})$ - the total loss (cost) for all the data samples, dimensions: (1)


## Forward Pass

### Value for a single node and data sample: Dot products

In the network above, the value of $z_1^{[1](j)}$ (the first node in the first hidden layer) for a single data point, $(j)$, can be written as:

$$
z_1^{[1](j)} = w_{1 \rightarrow 1}^{[1]} x_1^{(j)} + w_{2 \rightarrow 1}^{[1]} x_2^{(j)} + b_1^{[1]}
$$

The first two terms on the right (multiplying the inputs by the weights) can be expressed as a _dot product_:

$$
z_1^{[1](j)} = \mathbf{w_{1}^{[1]} . x^{(j)}} + b_1^{[1]}
$$

where:

- $\mathbf{w_{1}^{[1]}}$ is a _row vector_ of all the weights to the first node in the first layer
- $\mathbf{x}^{(j)}$ is a _column vector_ of all the data feature values for data sample $(j)$
- $b_1^{[1]}$ is the bias term for the first node in the first layer

which can be expanded as follows:

$$
z_1^{[1](j)} =
\begin{bmatrix}w_{1 \rightarrow 1}^{[1]} & w_{2 \rightarrow 1}^{[1]}\end{bmatrix}
\begin{bmatrix}
x_{1}^{(j)} \\
x_{2}^{(j)} \\
\end{bmatrix}
 + b_1^{[1]}
$$

and represents the same expression as the first equation above.

### Values for all nodes in a layer for all data in a batch: Matrix multiplication

By moving from vectors to matrices we can represent the terms for a whole layer (rather than a single node) and for a whole dataset in an efficient way.

Sticking the the first layer in the network above as an example, the six weights (2 input nodes * 3 layer 1 nodes) can be represented as:

$$
\mathbf{W^{[1]}} = 
\begin{bmatrix}
w_{1 \rightarrow 1}^{[1]} & w_{2 \rightarrow 1}^{[1]} \\
w_{1 \rightarrow 2}^{[1]} & w_{2 \rightarrow 2}^{[1]} \\
w_{1 \rightarrow 3}^{[1]} & w_{2 \rightarrow 3}^{[1]}
\end{bmatrix}
$$

where:

- each _row_ corresponds to a node in the current layer (the 1st layer in this case) - there are $n^{[1]}$ rows.
- each _column_ corresponds to a node in the previous layer (the zeroth layer in this case, aka the data inputs) -  there are $n^{[0]}$ columns.
- the values are the weights between those nodes (e.g. the value at row 3, column 2 is the weight between the 2nd input, $x_2$, and the 3rd node in the first layer).

And the data as:

$$
\mathbf{X} = \mathbf{A^{[0]}} =
\begin{bmatrix} 
x_{1}^{(1)} & x_{1}^{(2)} & x_{1}^{(3)} & \dots & x_{1}^{(m)} \\
x_{2}^{(1)} & x_{2}^{(2)} & x_{2}^{(3)} & \dots & x_{2}^{(m)} \\
\end{bmatrix}
$$

where:

- each _row_ corresponds to a feature in the data (or more generally a node activation in the previous layer) - there are $n^{[0]}$ rows.
- each _column_ corresponds to a data sample - there are $m$ columns.

The linear predictor values for all the nodes in the first layer can then be expressed as:

$$
\mathbf{Z^{[1]} = W^{[1]} X + b^{[l]}}
$$

The first term, $\mathbf{W^{[1]} X}$, is a _matrix multiplication_, that (by the definition of matrix multiplication and given the way we have defined $\mathbf{W^{[1]}}$ and $\mathbf{X}$), contains the dot product of the inputs and the weights for every node (each of the 3 nodes in the first layer) and every data sample:

$$
\mathbf{W^{[1]} X} =
\begin{bmatrix}
\mathbf{w_{1}^{[1]} . x^{(1)}} & \mathbf{w_{1}^{[1]} . x^{(2)}} & \mathbf{w_{1}^{[1]} . x^{(3)}} & \dots &  \mathbf{w_{1}^{[1]} . x^{(m)}} \\
\mathbf{w_{2}^{[1]} . x^{(1)}} & \mathbf{w_{2}^{[1]} . x^{(2)}} & \mathbf{w_{2}^{[1]} . x^{(3)}} & \dots &  \mathbf{w_{1}^{[1]} . x^{(m)}} \\
\mathbf{w_{3}^{[1]} . x^{(1)}} & \mathbf{w_{3}^{[1]} . x^{(2)}} & \mathbf{w_{3}^{[1]} . x^{(3)}} & \dots &  \mathbf{w_{1}^{[1]} . x^{(m)}}
\end{bmatrix}
$$

### Broadcasting

$\mathbf{b^{[l]}}$, the last term in $\mathbf{Z^{[1]}}$, is a _column vector_ containing the bias for each node in the layer:

$$
\mathbf{b^{[l]}} =
\begin{bmatrix} 
b_{1}^{[1]} \\
b_{2}^{[1]} \\
b_{3}^{[1]} \\
\end{bmatrix}
$$

But now there's a problem, we need to add $\mathbf{W^{[1]} X}$ and $\mathbf{b^{[l]}}$ to  but they have different dimensions - $(3, m)$ and $(3, 1)$ respectively.

We want to add the same bias values for each data sample. To do this we use _broadcasting_ (which might be familiar to you from `numpy`). In other words, we create a $(3, m)$ matrix by copying $\mathbf{b^{[l]}}$ $m$ times:

$$
\begin{bmatrix} 
b_{1}^{[1]} & b_{1}^{[1]} & b_{1}^{[1]} & \dots & b_{1}^{[1]} \\
b_{2}^{[1]} & b_{2}^{[1]} & b_{2}^{[1]} & \dots & b_{2}^{[1]} \\
b_{3}^{[1]} & b_{3}^{[1]} & b_{3}^{[1]} & \dots & b_{3}^{[1]}
\end{bmatrix}
$$

where each column is $\mathbf{b^{[l]}}$, and there are $m$ columns. This matrix has the same dimensions as $\mathbf{W^{[1]} X}$, so the two can now be added together.

### Summary

We've now seen all the structures and operations that are needed to represent a forward pass through the network for a whole batch of data.

The general form for computing the linear predictor in any layer is:

$$
\mathbf{Z^{[l]}} = \mathbf{W^{[l]}} \mathbf{A^{[l-1]}} + \mathbf{b^{[l]}} \\
(n^{[l]}, m) = (n^{[l]}, n^{[l-1]}) \times (n^{[l-1]}, m) + (n^{[l]}, 1) \\
$$

where the second line shows the dimensions of each term.

To compute the node activations $\mathbf{A^{[l]}}$, we apply the activation function $g$ (element-wise):

$$
\mathbf{A^{[l]}} = g^{[l]}(\mathbf{Z^{[l]}})
$$

Where $g^{[l]}$ is the activation function for layer $l$ (different layers may use different activation functions).

## Backward Pass

From https://www.coursera.org/learn/neural-networks-deep-learning/supplement/E79Uh/clarification-for-what-does-this-have-to-do-with-the-brain


$$
\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{1}{m} \frac{\partial \mathcal{L}}{\partial Z^{[l]}} A^{[l-1]^T} \\
\frac{\partial \mathcal{L}}{\partial b^{[l]}} = \frac{1}{m} \sum_{\mathrm{cols}} \frac{\partial \mathcal{L}}{\partial Z^{[l]}} \\
$$

$$
\frac{\partial \mathcal{L}}{\partial Z^{[L]}} = A^{[L]} - Y \\
\frac{\partial \mathcal{L}}{\partial Z^{[l]}} = W^{[l]^T} \frac{\partial \mathcal{L}}{\partial Z^{[l+1]}} * \frac{\partial A^{[l]}}{\partial Z^{[l]}}
$$
