[View in Colaboratory](https://colab.research.google.com/github/ppant/deeplearning.ai-notes/blob/master/Summary_of_Deep_Learning_Concepts.ipynb)

# Summary of concepts in Deep Learning
---

Deep Learning is an emerging field which needs no introduction. The aim of this article is to collaboratively learn various concepts in Deep Learning in a concise manner. If you feel something can be added or updated please add a comment. I will keep adding new material to this article as well. 

## How to use this article?

Deep Learning has many parameters, hyperparameters and concepts. This article aims to give a quick refresher on some core topics, especially to those who are new to this field. Some use cases I can think of,
-	Whenever you encounter a term in Deep Learning online when surfing the web and cannot remember what it means, come here and do a quick find. 
-	If you are preparing for an interview for a ML role and want to quickly revise, this is a good place as being concise and complete was my primary motto

## Credits
A big thank you to DeepLearning.ai team and their Deep Learning specialization on Coursera. All the material here including notations, concepts, some diagrams are a heavily shortened form of their excellent 5 course series.

## Notation used throughout
Refer this section for any variables used in the article

notation | description
--- | ---
$m$ | number of training examples
$n_x$ | number of features per training example
$X$ | input matrix where each column is a training example
$Y$ | output matrix where each column is the corresponding label of the training example in $X$, i.e. $Y[0]$ is the label for $X[0]$, the 1st training example
$\hat{Y}$ | predicted labels for new test inputs
$Z$ | linear transformation of $X$
$A$	| non-linear transformation of $Z$, the result of an activation function
$W$ |	weights matrix for each feature in $X$
$x$	| features of one training example
$y$	| output label of one training example
$\hat{y}$	| predicted output label of one training example
$z$	| linear transformation of $x$
$a$	| non-linear transformation of $z$, the result of an activation function
$w$	| weights matrix for $x$
$b$	| bias matrix
$\sigma$ | sigmoid function, $\sigma(z)= \frac{1}{(1 + e^{-z} )}$ and the output lies in between (0, 1) for any value of $z$
$x_j^{(i)}$ | value of the feature j in i<sup>th</sup> training example 
$w_j^{(i)[k]}$ | value of the weight for j<sup>th</sup> hidden unit in k<sup>th</sup> layer of i<sup>th</sup> training example
$L$ | total number of layers in a deep neural net (excluding input layer)
$n^{[l]}$ |number of units in hidden layer $l$
$X^{\{t\}}$, $Y^{\{t\}}$ | $X$ and $Y$ values for t<sup>th</sup> mini-batch in mini-batch gradient descend
$J$ | cost function of the model considering all training examples
$C$	| number of classes in a multi-class classifier
$X^{(i)<t>}$ | In sequence models, represents the t<sup>th</sup> element in i<sup>th</sup> training example
$Y^{(i)<t>}$ | In sequence models, represents the output value of t<sup>th</sup> element in i<sup>th</sup> training example
$T_X^{(i)}$ | In sequence models, represents the length of input sequence in i<sup>th</sup> training example
$T_Y^{(i)}$ | In sequence models, represents the length of output sequence in i<sup>th</sup> training example



## Logistic Regression, the building block of Deep Neural Nets

It is a linear model for classification. The goal of the model is to predict probabilities of output labels for a given input.
$$z= w^T x+b$$
$$a=\sigma(z)$$
$$\hat{y} = a$$

Cross entropy loss for finding out how good the predictions are for a single training example,
$$L(\hat{y},y)= -(y log(⁡\hat{y}) + (1-y)log(⁡\hat{y}))$$

Cost function for all examples,
$$J= -\frac{1}{m} \sum_{i=1}^mL(\hat{y}^{(i)} ,y^{(i)})$$
This $J$ is used by an optimization algorithm (like gradient descend) to find optimal values for $w$ and $b$.


## Shallow Neural Nets

In logistic regression $z$ and $a$ are computed to obtain prediction for each training example. In a shallow neural net, this process is repeated twice before predicting the output label. In logistic regression,

![logistic_regression_network](https://drive.google.com/uc?export=view&id=1dm9gVeDOf6FOZdaBhIHp94fkRPCP7_nN)

whereas in a shallow net,

![shallow_neural_network](https://drive.google.com/uc?export=view&id=1dFX5kBsG45RLvnAyFNl80TNdIN7UGizH)

[1] and [2] are layers in the network. Layer [1] is a hidden layer as it is neither the input nor output. Layer [1] has three (hidden) units / neurons and layer [2] has one unit. The prediction for a training example $x$, is as follows in a shallow neural net,

$$z^{[1]}= w^{[1]} x+b^{[1]}$$
$$a^{[1]} = \sigma(z^{[1]})$$
$$z^{[2]} = w^{[2]}a^{[1]} +b^{[2]}$$
$$\hat{y}=a^{[2]} =\sigma(z^{[2]})$$

This process is extended to all training examples to obtain $Z^{[1]}$, $Z^{[2]}$, $A^{[1]}$, $A^{[2]}$, $\hat{Y}$. If this process is extended to more than 2 hidden layers it is called a deep neural net!


## Activation functions

- sigmoid, $\sigma(z)= \frac{1}{1+e^{-z}}$
  - σ(z) lies in between (0, 1)
  - generally used for binary classification tasks in the last layer
- $tanh(z)= \frac{e^z  - e^{-z}}{e^z  + e^{-z}}$
  - $tanh(z)$ lies in between (-1, 1)
  - the graph is centered at 0, unlike sigmoid
- $ReLU(z) = max (0,z)$
  - both sigmoid and tanh slow down learning when $z$ is too small or high
  - neural net learns much faster when compared to sigmoid or tanh 
  - generally used in the hidden layers
- $Leaky\ ReLU(z) = max⁡(0.01z,z)$

## Deep Neural Nets

Simply put, it is a neural network with multiple hidden layers. The number of layers $L$ and number of units in each layer are hyperparameters decided before training.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1_qm8c14Gws-k_aR1zeBzlN2O0wvP1kgt" height="250px" alt="deep neural nets" />
</center>
<caption>
  <center>
    <strong>Figure 1: </strong>
    A 4 layer, fully connected deep neural network
  </center>
</caption>

The above network has $L = 4$, $n^{[1]} = 3$, $n^{[2]} = 4$, $n^{[3]} = 3$ and $n^{[4]}=1$. $\hat{Y} = A^L$ is the result for all training examples. $X = A^{[0]}$ is computed as,

$$Z^{[1]} = W^{[1]}  A^{[0]} + b^{[1]}$$

$$A^{[1]} = g^{[1]}( Z^{[1]})$$

Similarly, the process is repeated for layers [2], [3] and [4]
$$\hat{Y}= A^{[L=4]}= g^{[4]}( Z^{[4]})$$

Here $g^{[l]}$ is the activation function used in layer $l$. When implemented with numpy vectors, all computations are parallelized across training examples and is called a vectorized implementation. Without vectorization, the neural net has to loop over training examples one by one to complete one epoch of training which slows down learning.

Each training example $x^{(i)}$, is passed through the net to obtain the prediction $\hat{y}^{(i)}$ from the last layer. This step is called **forward propagation** in the entire process. $\hat{y}^{(i)}$ is compared with $y^{(i)}$ using $J$ to obtain the error in prediction. This error is passed back from layer $[L]$ to $[L-1]$ to $[L-2]$ and so on to $[1]$ to adjust $W^{[l]}$ , $b^{[l]}$ at each layer so that the next prediction causes smaller error. This step of passing back the error is called **back propagation** in the entire process. Every time error is passed back, the amount of change the system makes to the parameters $W^{[l]}$, $b^{[l]}$ is governed by a hyperparameter called learning rate, $\alpha$.


### Dimensionality checks

These formulae can help debug [dimensions](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.shape.html) of various matrices during implementing deep neural nets
- $w^{[l]}.shape = (n^{[l]}, n^{[l-1]})$
- $b^{[l]}.shape = (n^{[l]}, 1)$
- $A^{[l]}.shape = Z^{[l]}.shape = (n^{[l]}, m)$


### Hyperparameters to choose

$W^{[l]}$, $b^{[l]}$ are parameters of the neural net and are learned during the training phase. Hyperparameters are manually set by the developer before training.

- learning rate alpha, $\alpha$ - the rate at which parameters are updated to bring the predictions close to actual values
- number of epochs – After training with the entire training data once, one epoch is completed. This parameter controls how many times this should be repeated.
- hidden layers, L – how many hidden layers in the Deep Neural Net (DNN)
- hidden units per layer – values for $n^{[1]}$, $n^{[2]}$, $n^{[3]}$,…, $n^{[L]}$
- activation functions – [activation function](#activation-functions) to use in each layer, $g^{[1]}$, $g^{[2]}$, $g^{[3]}$,…, $g^{[L]}$


## Optimizing Deep Neural Networks

### Data splitting – All data from same distribution

All the available labelled data is split into,

<img src="https://drive.google.com/uc?export=view&id=1IVxLaZtnxIJNmXwFAgmHdDR6gn-7xf6Z" alt="data splitting in same distribution" />

-	Train data – Majority of the data is used for training
-	Dev data – Also called validation set / data. Used for validating the model and hyperparameter tuning
-	Test data – Used for validating the final chosen model


#### Error Types

As shown in Figure 2, a DNN has a train and dev error besides the test error

- Avoidable bias – difference between human error (the benchmark many a times) and training error. Possible solutions to reduce this are:
    - Train on a bigger network (increase $L$ or $n^{[l]}$)
    - Increase number of epochs
    - Change network architecture
- Variance – difference between training error and dev error. This happens due to overfitting to training data. Possible solutions to reduce this are:
    - Train on more data
    - Regularization
    - Change network architecture
    
<center>
  <img src="https://drive.google.com/uc?export=view&id=1d438hrqAbvGFRLcXdERzSV2dJz1su2hV" alt="errors when all data is from same distribution" />
</center>
<center>
  <caption>
    <strong>Figure 2: </strong>
    Range of each error
  </caption>
</center>

### Data Splitting – Data from different distributions

Ideally train, dev and test sets should be from the same data distribution for best results. But sometimes big enough data might not be available for performing a deep learning experiment. For example, for creating a DNN to classify 100 pictures of your 2 cats, training on cat pictures from internet and testing on your 100 cat pictures may not yield good results as data distributions are different. In such situations,

<center>
  <img src="https://drive.google.com/uc?export=view&id=1VFingcLOJDHFw9RIYz8sDnK7BZ3xEUp3" height="75px" alt="split limited data sample figure" />
</center>

split the available 100 cat pictures 50-50. Mix the Train (50) pictures with internet pictures like so

<center>
  <img src="https://drive.google.com/uc?export=view&id=1xZvPvNkKEQTpEmp1wjnwZ5CMnOjAL5iM" alt="how to use the split limited data figure" />
</center>

As the train and dev data are different distributions, comparing the training and dev errors does not clarify if it is due to high variance or due to data mismatch. Hence, the train data is split into train and training-dev after mixing your 50 cat pictures.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1PXrBR1meIcWKDy_U0ygYVCNfJGtPR0OT" alt="final data after mixing all available data sources figure" />
</center>

Now as train and training-dev sets are from same distribution, it can be understood the root cause of the problem as either bias or variance or data mismatch.

<center>
  <img src="https://drive.google.com/uc?export=view&id=13FSFXlY-ivHnPLrQgHpBbaA9yfFVfUY3" alt="range of errors when data is from different distributions figure" />
</center>
<center>
  <caption>
    <strong>Figure 3: </strong>
    Range of errors when not all data is from same distribution
  </caption>
</center>

As shown in Figure 3, as training-dev set and dev-set are from different data distributions, the difference between their errors is due to data mismatch.


### Regularization

When the neural net over fits (high variance) the model to training data, predictions on unseen dev set can be poor. Regularization reduces the impact of (various) neurons in the model so that it can generalize better to unseen inputs. lambda $\lambda$, is the hyperparameter which controls the amount of regularization used in L1 and L2 algorithms. Here are some algorithms / ideas for regularization,

- L1 – Uses L1-norm to penalize $W$’s
- L2 – Uses L2-norm to penalize $W$’s
- Dropout – Randomly zeros (drops) some neurons from the network thus making it simpler and generalize better. $keep\_prob$ is the hyperparamater which is the probability of retaining a neuron. Different layers can have different values of $keep\_prob$ based on density of connections
- Data augmentation – transform, randomly crop and translate input training images
- Early stopping – after every epoch compute dev error and once it starts increasing, stop the training though training error continues to decrease (sign for overfitting)


### Normalization

Normalize input features with varying ranges to learn faster. Normalizing, sets $\mu=0$ and $\sigma^2=1$ for all training examples.

- Batch normalization – the idea of normalizing inputs is extended to all layers. $z^{[l]}$ is normalized before applying the activation function. The flow of parameters would then be,

$$X \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow{\beta^{[1]},\gamma^{[1]}} \tilde{Z}^{[1]} \to a^{[1]} = g( \tilde{Z}^{[1]}) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]}…$$

> $\tilde{Z}^{[1]}$ is the normalized $Z^{[1]}$ computed using parameters $\beta^{[1]}$, $\gamma^{[1]}$. Just like $W^{[l]}$ and $b^{[l]}$ are parameters that are learned during training, $\beta^{[l]}$ and $\gamma^{[l]}$ are too. 

> In case of mini-batch gradient descend, exponential weighted averages of $\mu$ and $\sigma^2$ across batches are saved during training. These are used to compute $\tilde{Z}^{[l]\{t\}})$ during inference time.


### Train faster and better

- Mini-batch gradient descend – if the training set size is huge, models learn better, but each epoch takes longer. In mini-batch gradient descend, the inputs are sliced into batches and a step is taken by gradient descend after training on a mini-batch. Mini-batch size is generally chosen in between 1 and $m$ to take advantage of both vectorization and quicker steps. Typical batch sizes are 64, 128, 256 or 512 training examples, such that each mini-batch fits in memory of CPU / GPU

- Gradient descend with Momentum – mini-batch gradient descend introduces oscillations which may slow down reaching the optimum. Momentum solves this problem by adding a moving average like affect and dampening the oscillations to reach the optimum faster. Momentum $\beta$, controls the size of the sliding window $\approx \frac{1}{1- β}$

- RMS Prop – Guides the gradient descend algorithm towards the minimum by taking longer steps in the dimensions farther away from minimum and smaller steps in the dimensions closer to minimum. $\beta_2$  and $\epsilon$ are hyperparameters for this optimization. $\epsilon$ is not so important and is added only to avoid division by zero error and is generally set to $10^{-8}$

- Adam – combines ideas from gradient descend with momentum and RMS prop and uses $\beta$, $\beta_2$  and $\epsilon$ as hyperparameters

- Learning rate decay – mini-batch gradient descend adds oscillations around the minimum. Adding a decay to learning rate converges better. So $\alpha$ is no longer a constant and becomes

$$\alpha=\frac{1}{(1 + decay\_rate \times epoch\_number) \times \alpha_0}$$


### Hyperparameter Tuning

As there are many hyperparameters to set before training, it is important to realize that not all of them are equally important. For example, $\alpha$ is more important $\lambda$, so fine tuning $\alpha$ first is better. Some approaches for tuning a hyperparameter are,

- Grid based search – create a table of combinations of hyperparameter 1 and 2 values. For each combination evaluate on dev set to find the best combination

- Random based search – randomly select combinations of values for hyperparameters 1 and 2. For each combination evaluate on dev set to find the best combination. After performing a random search in a broad domain of values, a more fine-grained search in the area(s) of interest using the results from coarse random search can be performed. It is important to scale the hyperparameters before selecting values uniformly at random

- Panda VS Caviar approach – If the model is complex that multiple combinations cannot be tested, it is a better idea to baby sit watching how $J$ varies with time and change hyperparameter values at runtime.


### Multiclass classification

Softmax layer is used as the final layer to classify into $C$ classes. The activations from final layer $L$ are computed as,

$$a_i^{[L]} = \frac{t_i}{\sum_{j=1}^C t_i}$$

where $t_i = (e^{z_i})^{[L]}$


### Transfer Learning

Use the learned parameters from one model to another. It is done by replacing the last few layers in the original trained network. The new layers can then be trained using the new dataset of interest. This is generally applicable when features identified by initial layers of an existing model can be re-used for a another task.


## Convolutional Neural Nets

A class of deep neural nets for computer vision tasks. It is expected of a DNN to identify the features from $X$ without the need for hand tuning them. Therefore, in computer vision tasks images, videos are generally used as is as $X$. Without feature engineering if the image is passed as is to the network the number of parameters to learn can be quite high based on the image’s resolution. For example, if the input image is (width, height, RGB channels) = (1000, 1000, 3) dimensional, fully connecting it (as shown in Figure 1) to a layer with $n^{[1]} = 1000$, would imply $W.shape = (1000, 3\times10^6)$, i.e. 3 billion parameters. Training so many parameters demands lot of training data and hence existing ideas from DNN are not used for computer vision applications. Therefore, a new class called Convolutional Neural Nets (CNN) is studied.

It is known that earlier layers in a DNN identify simple features like edges and the later ones detect more complex shapes in a given image. The operator, convolution $\ast$, in Mathematics solves both the above problems – identify edges in earlier layers and shapes in the later, requires fewer parameters than a fully connected DNN.


### Working of a convolution operation

<center>
  <img src="https://drive.google.com/uc?export=view&id=1i8dczLaLdZ6OYpQXBDrIW7qSwK-CjROb" alt="convolution operator animation" />
</center>
<center>
  <caption>
    <strong>Figure 4:</strong> Convolution operator in action
  </caption>
  <p>
    <small>
      <strong>Source:</strong> Coding exercise “Convolution model - Step by Step - v2” in the course https://www.coursera.org/learn/convolutional-neural-networks/
    </small>
  </p>
</center>

- The number of channels (the 3rd dimension) in the input layer should match the number of dimension in convolution filter


### Padding

Due to the way $\ast$ works, cells on the edges contribute lesser compared to inner cells in the output layer. Strip(s) of zeros are added to input layer before $\ast$ operation which is called padding, $p$ to solve this problem. There are two types of padding,

- Valid $\implies p = 0$

- Same $p = \frac{f-1}{2}$ where $f$ is dimension of the convolution filter. More on $f$ below.


### Dimensionality involving a convolution operation 

$$(n, n, \#channels) \ast (f, f, \#channels) \to (\lfloor \frac{n + 2p - f}{s + 1} \rfloor,\lfloor \frac{n + 2p - f}{s + 1} \rfloor, \#filters)$$

where $n$ = dimension of input layer / image

> $f$ = dimension of convolution filter

> $p$ = amount of padding to input layer

> $s$ = stride length of convolution filter on input layer

> $\# filters$ = number of convolution filters used on input layer

For Figure 4: $n = 5, \# channels = 1, f = 3, p = 0, s = 1, \# filters = 1$


### Pooling

Another type of operator like $\ast$, which is mainly used to shrink the height and width of the input. Just like $\ast$, pooling layers also are filters which run across the input. However, they do not have any parameters to learn.

- Max Pooling – pick the max value at every position of filter on the input

- Average Pooling – pick the average value at every position of filter on the input


## Sequence Models - [ Work in Progress ]

A class of Deep Neural Networks for modelling inputs that have an ordering or exist in sequences. For example, a sequence of words is a sentence, a sequence of air pressure values (over time) is an audio / sound. 

### Natural Language Processing (NLP)

Sequence modelling techniques are widely used for processing Natural (Human) language. Since neural networks work on matrices of numbers a simple way to encode (English) words to numbers is to one-hot encode them. In order to do that, assuming there are 10000 words in English dictionary, every word is assigned a unique random number between (10000, 1). Then, if the word 'Aaron' is assigned number 3, its one-hot encoded matrix would be,

<center>
  $
  \begin{bmatrix}
    0 \\
    0 \\
    1 \\
    0 \\
    . \\
    . \\
    . \\
    0 \\
  \end{bmatrix}_{10000\times 1}
  $
</center>

After every word is converted to its one-hot representation, the sentence is fed into a Recurrent Neural Network (RNN) as shown in Figure 5. For a given sentence say, "Aaron bikes to school everyday", $x^{<1>}$ would be one-hot encoding of 'Aaron', $x^{<2>}$ would be one-hot encoding of 'bikes' and so on. If the goal of the RNN is to find the parts of speech of each word in the sentence, then  $y^{<1>}$ would be parts of speech for 'Aaron', $y^{<2>}$ would be parts of speech for 'bikes' and so on. The training data for this task would be, X = one-not encoded words of English sentences, and Y = parts of speech for each word for each sentence.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1FGV5ZaPfAwsRJZJaNzbj7iO2D1Uc9MhH" alt="a recurrent neural network" />
</center>
<center>
  <caption>
    <strong>Figure 5:</strong> Recurrent Neural Network
  </caption>
</center>

### Why RNNs and not just use DNNs?

Just like CNNs are well suited for Computer Vision tasks, RNNs are designed for tasks which have a temporal nature. RNNs,


*   allow outputs to be of different lengths, i.e. $T_X \neq T_Y$, for example in the task of Machine Translation where length of output sentence depends on input sentence
*   can share features learned accross different positions of text (in NLP)
*   have fewer parameters to compute compared to DNNs just like in CNNs

[The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) blog post talks about various applications of RNNs besides NLP.



### Forward Propagation in RNNs

<center>
  <p>$a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$</p>
  <p>$\hat{y}^{<t>} = g_2(W_{ya}a^{<t>} + b_y)$</p>
</center>

where 

* $a^{<0>} = \vec{0}$, usually
* $W_{aa}, W_{ax}, W_{ya}, b_a$ and $b_y$ are parameters learned by gradient descend
* $g_1$ is usually tanh or ReLU and $g_2$ is usually sigmoid or softmax

Note that, the parameters are shared across the time steps &lt;t&gt;. The notation is however simplified to,

<center>
  <p>$a^{<t>} = g(W_a[a^{<t-1>}, x^{<t>}] + b_a)$</p>
  <p>$\hat{y}^{<t>} = g(W_ya^{<t>} + b_y)$</p>
</center>

where

* $W_a$ is column wise concatenated matrix of $W_{aa}$ and $W_{ax}$
* $[a^{<t-1>}, x^{<t>}]$ is concatenated row wise i.e.

<center>
$
  \begin{bmatrix}
      W_{aa} & W_{ax}
  \end{bmatrix}
  \begin{bmatrix}
      a^{<t-1>} \\
      x^{<t>}
  \end{bmatrix} = W_{aa}a^{<t-1>} + W_{ax}x^{<t>}
$
</center>

### Back Propagation in RNNs

In Figure 6, it can be seen how parameters flow to compute loss in forward propagation step and how gradient descend flows the derivates of Loss, $L$ back to adjust the parameters via back propagation.

<center>
  <img src="https://drive.google.com/uc?export=view&id=16DR4wsumB0sguKDPWRQf5-_Bx2kC84cu" alt="recurrent neural network computational graph" />
</center>
<center>
  <caption>
    <strong>Figure 6:</strong> Recurrent Neural Network Computational Graph
  </caption>
</center>

<br />
In the form of equations, first $L^{<t>}$ is computed using cross entropy loss like before. Final loss, $L$ is a summation of losses for all $t$

<center>
  <p>$L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = - (y^{<t>}log(\hat{y}^{<t>}) - (1-y^{<t>})log(1-\hat{y}^{<t>}))$</p>
  <p>$L(\hat{y}, y) = \sum_{t=1}^{T_y} L^{<t>}(\hat{y}^{<t>}, y^{<t>})$</p>
</center>


### Language Model

Grammatically and logically, "Ram ate an apple" is more likely than "Ram <b>an ate</b> apple". The goal of the language model is to assign a higher probability for the first sentence. In other words the language model should,

<center>
  $P(y^{<1>} = Ram,\ y^{<2>} = ate,\ y^{<3>} = an,\ y^{<4>} = apple)\ \mathbf{>}\ P(y^{<1>} = Ram,\ y^{<2>} = an,\ y^{<3>} = ate,\ y^{<4>} = apple)$
</center>

For this task, a one-to-many RNN is used. $y^{<t>}$ are the probabilities of all words in vocabulary to be present at position $<t>$ in a sentence. $y^{<t-1>}$ is fed as input, $x^{<t>}$, making it a conditional probabilty, $P(y^{<2>} = ate\ |\ y^{<1>} = Ram)$

<center>
  <img src="https://drive.google.com/uc?export=view&id=181G51N_MwiB-dz9ufBsXOSrmcALILz_o" alt="one-to-many RNN for modelling language" />
</center>
<center>
  <caption>
    <strong>Figure 7:</strong> one-to-many RNN for modelling language
  </caption>
</center>

Training this on a huge corpus of text, models the sequence of words in a language. Using the conditional probabilities obtained at each $<t>$, the probability of a sentence can be found,

<center>
  $P(y^{<1>} = Ram,\ y^{<2>} = ate,\ y^{<3>} = an,\ y^{<4>} = apple) = P(y^{<1>} = Ram) \times P(y^{<2>} = ate\ |\ y^{<1>} = Ram) \times P(y^{<3>} = an\ |\ y^{<1>} = Ram, \ y^{<2>} = ate) \times P(y^{<4>} = apple\ |\ y^{<1>} = Ram, \ y^{<2>} = ate, \ y^{<3>} = an)$
</center>

<br />
$P(y^{<3>} = an\ |\ y^{<1>} = Ram, \ y^{<2>} = ate)$ is obtained from the 3<sup>rd</sup> neuron and so on. As the vocabulary cannot incorporate all the words that ever appear, a special &lt;UNK> token is used to represent all the words out of vocabulary. Similarly character level language models can be created where each character would be $\hat{y}^{<t>}$ and this problem of &lt;UNK> words impacting the accuracy could be mitigated. The problem with character models is that the sequences would be much longer and RNNs cannot carry very long range dependencies. Also, character models are more computationally intensive and so not common.


### Gated Recurrent Units (GRU)

RNNs have the problem of vanishing gradients if the sentences are long just like in DNNs. Long term dependencies are quite common in English as a word in the beginning of the sentence can dictate the state of a word in the end of the sentence. To enable long term dependencies, GRU's introduce a concept called a memory cell. 

<center>
  <img src="https://drive.google.com/uc?export=view&id=1aQNuDtnrP0wZ26-AgabAeKm41uV8vSVF" alt="RNN and GRU cells" />
</center>
<center>
  <caption>
    <strong>Figure 8:</strong> RNN and GRU cells
  </caption>
</center>

<br />
<center>
  <p>$\tilde{c}^{<t>} = tanh(W_c[c^{<t-1>},\ x^{<t>}] + b_c) $</p>
  <p>$\Gamma_u = \sigma(W_u[c^{<t-1>},\ x^{<t>}] + b_u) $</p>
</center>

TODO

## Deep Learning Hyperparameters

Hyperparameter | Symbol | Common Values	|
--- | --- | --- | ---
regularization | $\lambda$ | | also called, “weight decay”
learning rate | $\alpha$	| 0.01 | 
keep_prob | | 0.7 | from Dropout regularization
momentum | $\beta$ | 0.9 | also used in Adam 
mini-batch size |  $t$	| 64, 128, 256, 512	| 
RMS Prop | $\beta_2$ | 0.999 | also used in Adam 
learning rate decay | | | also called decay_rate
filter size | $f^{[l]}$	|	| In CNN, size of a filter in layer $l$
stride | $s^{[l]}$ | | In CNN, stride length in layer $l$
padding |  $p^{[l]}$ | | In CNN, padding in layer $l$
\# filters | $n_c^{[l]}$ | | In CNN, number of filters used in layer $l$


			
			
