# Week 4 Notes

## Deep L Layer Network

### Notation: L Layers Network and Hidden (L-1) Layer Networks

We don't count the input layer when deciding how many layers a neural network architecture has. Some refer to the input layer as Layer $0$. 

We usually refer to a $L$ Layer network as one having $L$ layers not including the input layer. 

Take one off that and we refer to $L-1$ hidden layers - as in layers whose values are never observed, only inferred.

### Shallow and Deep Layer Networks

Talking of shallow and deep, it's a matter of relativity. I would tend to think of more than 3/4 layers as deep. One to two layers is considered a shallow network.

![Example Network](https://kharshit.github.io/img/deep_neural_net.png)

101 layer networks have been trained successfully - such as the Microsoft ResNet-101 for computer vision problems.

## Forward Propagation In A Deep Network

Forward propagation is easy. We'll just set out some notation and then the recurrence propagation equations here:

### Notation

- $n^{[i]}$ is the number of neurons at layer $i$, $i=1, \ldots, L$. Note that $n^{[0]} = n_x = n$ is the number of features for each example.

- $w^{[i]}_{j, k}$ is the $k^{th}$ weight (for $k=0,\ldots,n^{[i-1]}$), a scalar that is the weighting put on $k^{th}$ input of layer $i$ which goes into the linear combination of neuron $j$.

- $b^{[i]}_{j}$ is a scalar, that is the bias of the linear combination in the neuron $j$ at layer $i$.

- $z^{[i]}_{j}$ is a scalar. It is the linear combination in neuron $j$ at layer $i$, this is the product-sum of the weights and inputs to neuron $j$ at layer $i$ plus the bias, $b^{[i]}_{j}$. The inputs are $a^{[i-1]}_{1}, \ldots, a^{[i-1]}_{n^{[i-1]}}$. That is:

- $z^{[i]}_{j} = ({w^{[i]}_{j}})^T \cdot (a^{[i-1]}_{0}, \ldots, a^{[i-1]}_{i}) + b^{[i]}_{j} $

- $g^{[i]}_{j}$ is the activation function of the linear combination ($z^{[i]}_{j}$) in neuron $j$ at layer $i$ - denoted by  $g^{[i]}_{j}$. 

- $a^{[i]}_{j}$ is $g^{[i]}_{j}(z^{[i]}_{j})$ the scalar activated value which neuron $j$ of layer $i$ outputs - this is the input to the next layer, $(i+1)$. There are $n^{[i]}$ such activations.

#### Compressed Vector Notation

These can be combined into matrices and vectors over the neurons in each layer, to make the notation more compact. The quantities that interest us (and are important for forward and backward propagation) are matrices W, Z, A, B and G. There is one for each layer $i$, for $i$ ranging over the $L$ layers.

##### Weights and Bias

- $\mathbf{n} = [n^{[0]}, n^{[1]}, ...,n^{[L]}]$ is the vector giving the number of neurons in each layer of the network architecture. 

- $w^{[i]}_{j}$ is the vector of weights on layer i which maps the inputs of neuron j to the linear combination. The shape of  $w^{[i]}_{j}$ is $(n^{[i-1]}, 1)$, that is to say it is a column vector of length $n^{[i-1]}$. There are $n^{[i]}$ such vectors.

- $W^[i]$ is the matrix of vector weights for layer $i$. The $j^{th}$ row vector corresponds to $(w^{[i]}_{j})^T$. In other words, $w^{[i]}_{j, k}$ is the entry for the $j^{th}$ row and $k^{th}$ column of $W^{[i]}$. The shape of $W^{[i]}$ is $(n^{[i]}, n^[i-1])$

- $B^[i]$ is the column vector of biases for layer $i$. It has a shape that is $(n^{[i]},1)$. The $j^{th}$ component of $B^[i]$ is the bias parameter of the neuron $j$ on layer $i$.

###  Z, G, A: With Single Example Input

For each layer $i$ there are Z, G and A vectors.

- $Z_j^{[i]}$: The $j^th$ component of column vector $Z^{[i]}$. This is the linear combination that goes through the activation function in neuron $j$ of layer $[i]$. For a single example as input, $Z^{[i]}$ has shape $(n^{[i]}, 1)$.

- $G_j^{[i]}$: The $j^th$ component of column vector $G^{[i]}$. This is the activation function definition in neuron $j$ of layer $[i]$. For a single example as input, $G^{[i]}$ has shape $(n^{[i]}, 1)$.

- $A_j^{[i]}$: The $j^th$ component of column vector $A^{[i]}$. This is the activated value in neuron $j$ of layer $[i]$. For a single example as input, $A^{[i]}$ has shape $(n^{[i]}, 1)$. This is also defined by the equation: $A_j^{[i]} = $G_j^[(i)]$(Z_j^[(i)])$. For a single example as input, $Z^{[i]}$ has shape $(n^{[i]}, 1)$.

As we will see later, setting $a^{[0]} = x$, an individual example, produces the vectorized form of recurrence relation with matrices for A, G, and Z.


###  Z, G, A: With Full Feature Matrix As Input

These work as above only the input is a feature matrix, whose columns are individual examples. The rows are for each kind of feature. As below the vectors become matrices by adding more columns, one for each example, (Recall that $m = n^{[0]}$). 

- $Z_j^{[i] (k)}$: The $j^{th}$ component of column vector $Z^{[i]}$. This is the linear combination that goes through the activation function in neuron $j$ of layer $[i]$. For the feature matrix as input, $Z^{[i]}$ has shape $(n^{[i]}, n^{[0]})$.

- $G_j^{[i] (k)}$: The $j^th$ component of column vector $G^{[i]}$. This is the activation function definition in neuron $j$ of layer $[i]$. For the feature matrix as input, $G^{[i]}$ has shape $(n^{[i]}, n^{[0]})$.

- $A_j^{[i] (k)}$: The $j^{th}$ row of the $k^{th}$ column of matrix $A^{[i]}$. This is the activated value in neuron $j$ of layer $[i]$ of the $k^{th}$ example. For a single example as input, $A^{[i]}$ has shape $(n^{[i]}, 1)$. This is also defined by the equation: $A_j^{[i]} = G_j^{[(i)]}(Z_j^{[(i)]})$. For the feature matrix as input, $Z^{[i]}$ has shape $(n^{[i]}, n^{[0]})$.


As we will see later, setting $A^{[0]} = X$, the feature matrix, produces the vectorized form of recurrence relation with matrices for A, G, and Z.



## Getting Your Matrix Dimensions Correct

A good sanity check is to confirm the dimensions of parameter matrices. Below we note the dimensions in unvectorized as well as vectorized implementations. Note that  $m = n^{[0]}$

### Unvectorized Dimensions
Here we list `variable:shape` for unvectorized implementations. Shape is given as a numpy style tuple (dim1, dim2).

- $w^{[l]}:(n^{[l-1]}, n^{[l]})$

- $b^{[l]}:(n^{[l]}, 1)$

- $dw:(n^{[l-1]}, n^{[l]})$

- $db:(n^{[l]}, 1)$

- $z:(n^{[l]}, 1)$

- $a:(n^{[l]}, 1)$

- $dz:(n^{[l]}, 1)$

- $da:(n^{[l]}, 1)$

### Vectorized Dimensions
Here we list `variable:shape` for vectorized implementations. Shape is given as a numpy style tuple (dim1, dim2,..).
- $w^{[l]}:(n^{[l-1]}, n^{[l]})$

- $b^{[l]}:(n^{[l]}, n^{[0]})$

- $dw:(n^{[l-1]}, n^{[l]})$

- $db:(n^{[l]},  n^{[0]})$

- $z:(n^{[l]}, n^{[0]})$

- $a:(n^{[l]}, n^{[0]})$

- $dz:(n^{[l]}, n^{[0]})$

- $da:(n^{[l]}, n^{[0]})$

## Why Deep Representations

Deep representations tend to have a tree like structure. They tend to non-increase (and eventually reduce) the number of neurons until the final output vector at the final layer. There has been discussion about why such representations are so effective as opposed to a shallow representation that might use only one hidden layer. (In theory, a one hidden layer network can model any non-linear function.)

### Building Up Local Features

Start by learning simple features like edges, then learn small similar surfaces (like eyes, nose, lips or phonemes such as "ca", "ta"), then more complex compositions of those (such as faces, words like "cat"). Some biological justification if we believe that artificial neural networks do similarly.

### Reducing Parameters By Networking Neurons

Functions with $2^n$ parameters can be learned with $\log(n)$ using deep networks. Using just 1 layer hidden network, we can also learn such a function, but we will need  $2^n$ neurons to learn such a function. This is the power of having many connections, they expand the possibilities much more.

## Building Blocks of Deep Neural Networks

The forward propagation to calculate the final layer and the backward propagation to calculate the gradient repeat some calculations. 

![BackProp and ForwardProp working with Gradient Descent](https://cdn-images-1.medium.com/max/1600/1*I_fAEeJbT_G4hGkXBJ-PcQ.gif)



### Caching Terms From The Forward Propagation

Starting with intial parameters one can propagate and cache $A$'s and $Z$'s ($W$'s and $B's$ are available via a higher scope - one could cache them too) in each layer and then use them to calculate the gradient at these parameters.

### Using Cached Terms For The Backward Propagation
 As we are backward propagating the gradient (initialized with the value of the loss function), we can use the values of $A$'s and $Z$ from the cache ($𝑊$'s and $𝐵$′𝑠 are available via a higher scope - one could cache them too) and update the cached values of the gradients of the parameters. These can then be used to update the parameters which are then used again in the next forward propagation. 


## Forward and Backward Propagation

There are a set of recurrence relations which allow us to: 

- propagate parameters forwards and get outputs at the final layer. 

- propagate parameters and outputs(from above), backwards via gradient equations and get the gradient of the loss function wrt the parameters.

We can use the caching described earlier to speed the calculation up, but it is not strictly necessary. 

Below we list the forward and backward propagation equations, in both unvectorized and vectorized forms.

### Forward Propagation Equations

#### Unvectorized

#### Vectorized

### Backward Propagation Equations

#### Unvectorized

#### Vectorized


## Parameters and Hyper-parameters

For any representation, there are a set of numbers to be learned. These are split into two parts:

- Those which are preset before learning from data, and are repeatedly evaluated using some strategy to tune  the best set of such of presets.

- Those which can be learned directly from the data.

### Parameters

The parameters are learned from the model specified. Usually, following some usually iterative process to minimize model loss, a best set of parameters is learned - given the presets of the hyperparameters. By tuning the hyperparameters best generalization performance can be found.

### Hyperparameters

The hyperparameters need to be tuned by one of several methods:

* grid search: expensive but exhausts a certain space. 

* random search: might be more effective.

* Bayesian hyper-optimization: Using a framework, finding a balance between exploring for new candidates and exploiting what is known about the hyperparameters.

They can be considered as a kind of empirical Bayes, where their distribution is learned from repeatedly setting and then fitting the data. 

The preferred way to select between hyperparameters is to use the same loss function evaluated on data (validation or dev set of data) which has not been used to learn the parameters.

## What Does This Have To Do With The Brain

### Similarity With Dendrites, Soma and Axons

There is some superficial similarity between how a real neuron works and an artificial neuron. 

![Dendrites, Soma and Axons](http://oerpub.github.io/epubjs-demo-book/resources/1206_The_Neuron.jpg)

Dendrites are like the inputs to a neuron. 

The soma is like the linear combination and activation of the inputs of artificial neuron. 

The Axon is like the output of a neuron.

### Marketing

There's a *cool* aspect to re-branding neural networks with more layers as "Deep Learning". Neural nets have been around since the 1950s, recasting them with a new name helps to freshen the image.