# Notation
___

## Reference for tricky symbols
[Tricky symbols](http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html)

[Table Generator](http://www.tablesgenerator.com/)

[Fonts in Latex](https://www.sharelatex.com/learn/Mathematical_fonts)

[Integrals, sums and limits](https://www.sharelatex.com/learn/Integrals,_sums_and_limits)

$\mathbb{R}$ = `$\mathbb{R}$`

## Model structure

### Some general rules

We use square brackets for layers, and round brackets for training examples.

Training examples are spaced out horizontally as column vectors - i.e. the second value in the matrix shape $(n_x, m)$ generally corresponds to the number of training examples.

### The general matrices

We have 6 key matrices that we'll play with, which we can divide into two groups.

The first four are representations of our features:
* $X$, representing the input to our model - the training data. This has the shape $(n_x, m)$ where $n_x$ is the number of features, and $m$ is the number of training samples.
* $Y$, representing the output to our model - the model outputs. This may have a number of shapes, depending on on the model structure we're chasing, but we can generally describe it as $(n_{x\ final}, m)$ where $n_{x\ final}$ is the number of neurons in our final (output) layer and $m$ is the number of training samples.
* $Z$, representing the inputs to a layer that has undergone a linear transformation, shape $(n_x, m)$, where $n_x$ is the number of neurons on a given layer.
* $A$, representing the activated linearised inputs, shape $(n_x, m)$, where $n_x$ is the number of neurons on a given layer and $m$ is the number of samples. This is essentially $Z$ but transformed.

The second group of two are the parameters of our model:
* $W$, representing the weights of the model. This is of shape $(n_x, n_{x-1})$, where $n_x$ and $n_{x-1}$ are the number of neurons in the current and previous layers, respectively.
* $B$, representing the bias for each layer of the model. This is of shape $(n_x, 1)$, where $n_x$ is the number of neurons on the relevant layer. The second dimension is 1 because this is a bias relevant to each neuron.

So given this:

$X$ = $\begin{bmatrix}
    \vdots & \vdots & \vdots &        & \vdots \\
    x^{(1)}  & x^{(2)}  & x^{(3)}  & \dots  & x^{(m)} \\
    \vdots & \vdots & \vdots &        & \vdots \\
\end{bmatrix}$ = sum of feature vectors

We would have 

$Z^{[1]} = W^{[1]}X^{[0]} + B^{[1]}$

and 

$A^{[1]} = {\sigma}(Z^{[1]})$


### Some special cases

Some simple cases:

$A^{[0]} = X$, i.e. the 'activated output' of the zeroth layer is the input to the model.

$A^{[n]} = \hat{Y}$, i.e. the output of the last activation layer in the prediction.

### Backward propagation

$dz^{[l]} = d^{[l]}\times g^{[l]\prime}(z^{[l]})$, or

$dz^{[l]} = W^{[l+1]T}dz^{[l+1]}\times g^{[l]\prime}(z^{[l]})$

$dW^{[l]} = dz^{[l]}.a^{[l-1]}$

$db^{[l]} = dz^{[l]}$

$da^{[l-1]} = W^{[l]T}.dz^{[l]}$

## Binary classification
$(x,y)$ = training set, where $x\in\mathbb{R}^{n_x}, y\in{0,1}$

$(x^1, y^1)$ = single training example

$(x^{\{1\}}, y^{\{1\}})$ = single mini-batch extracted from a full training set

$x$ = feature vector

$X$ = $\begin{bmatrix}
    \vdots & \vdots & \vdots &        & \vdots \\
    x^{1}  & x^{2}  & x^{3}  & \dots  & x^{m} \\
    \vdots & \vdots & \vdots &        & \vdots \\
\end{bmatrix}$ = sum of feature vectors

Note that we have $m$ samples as columns and $n_x$ features as rows. In Python, this comes out as `X.shape=(n_x, m)`.

$Y$ = $\begin{bmatrix} y^1 & y^2 & y^3 & \dots & y^m \end{bmatrix}$ = sum of labels. 

In Python, this comes out as `Y.shape=(1,m)`.

$m$ = number of sample pairs

$m_{train}$ = number of sample pairs in train set