RNN's are most often associated with text processing and text generation because of the way sentences are structured as a sequence of words!

So far the simple NNs and CNNs focus on the current inputs, meanwhile an RNN differentiates itself from this by having memory of past inputs which makes it great for tasks where predicting the next word in a sentence is the output of the network.

### RNN History

First attempts:

- [TDNN](https://en.wikipedia.org/wiki/Time_delay_neural_network)
- [Elman networks](https://en.wikipedia.org/wiki/Recurrent_neural_network#Elman_networks_and_Jordan_networks)
- [LSTM](http://www.bioinf.jku.at/publications/older/2604.pdf)

RNN still have a big flaw, the **vanishing gradient** that previous networks haven't resolved. While an RNN can captures relationships from several steps in the past if it goes too far the weight updates become very small as the gradient becomes geometrically smaller as the number of steps increases thus backpropagation is working with very small derivatives. 

Long Short-Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) give a solution to the vanishing gradient problem, by helping us apply networks that have temporal dependencies.

### Feedforward Neural Network-Reminder

The goal is to decide what to decide what goes inside the black box (e.g. neural network, other algorithms) but what we really want is **finding the optimal set of weights**.

There are two phases:
- Training (includes feedforward and backpropagation)
- Evaluation

During training our goal is to find the optimal set of weights that would best map the inputs to the desired outputs. In evaluation, we test our network with inputs that it has never seen before.

### Feedforward

![Feedforward from inputs to hidden layers](part-4_images/feedforward1.png)

1. Calculating the value of the hidden states.

Our first step is finding $hat_{h}$ (the hidden layers) and this is done by multiplying all inputs with their corresponding weights, followed by a summation. 

![Finding h hat](part-4_images/inputs_weights.png)

After we find $h'$ we need an activation function to finalize the computation of the hidden layer's values which is generally noted as $\hat{h} = \Phi(hat{x}W^1)$ or $\hat{h} = \Phi(h')$

![Calculating h](part-4_images/calc_hidden_layers.png)

Resource on activation functions: [here](https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions) and [here](https://theclevermachine.wordpress.com/tag/tanh-function/).

2. Calculating the values of the Outputs

What happens next is the hidden layers get multiplied with another set of weights. Essentially, each new layer in an neural network is calculated by a vector by matrix multiplication, where the vector represents the inputs to the new layer and the matrix is the one connecting these new inputs to the next layer.

Below an example of how the calculations would be done if we had 3 hidden layers.

![Generalized output calculation](part-4_images/generalized_calc_outputs.png)

In some applications it can be beneficial to use a softmax function (if we want all output values to be between zero and 1, and their sum to be 1).

Now, what comes next is determining the loss or how far is the generated output from the real label. This is done with the help of a loss function which is essential for backpropagation to work. 

The two error functions that are most commonly used are the **Mean Squared Error (MSE)** (usually used in regression problems) and the **cross entropy** (usually used in classification problems).

Summary:

On a simple example: 
- A single input x
- Two hidden layers with M and N neurons
- A singe output

To calculate the number of multiplications needed for a single feedforward pass, we can break down the network to three steps:

Step 1: From the single input to the first hidden layer

Step 2: From the first hidden layer to the second hidden layer

Step 3: From the second hidden layer to the single output

#### Backpropagation

To better picture how backpropagation works let's picture how would the weights change, since that's what we are interested as the error is getting smaller. What is happening? 

![Gradient considerations](part-4_images/gradient_considerations.png)

In the example above, if the weight was on the other side of the valley then our weights would have to become smaller in order to get to the bottom.

![Gradient considerations 2](part-4_images/gradient_considerations2.png)

A closer look at updating the weight.

![Weight update](part-4_images/weight_update.png)

As a summary, backprop involves two steps:
- calculate the partial derivative of the error E w.r.t to each weights
- adjust the weights according to the calculated value of delta Wij

where 
- the superscript (k) indicates that the weight connects layer k to layer k+1. 
- he updated value $\Delta W_{ij}^k $ is calculated through the use of the gradient calculation


![Backpropagation summary](part-4_images/backpropagation_summary.png)


The chain of thought in the weight updating process is as follows:

To update the weights, we need the network error. To find the network error, we need the network output, and to find the network output we need the value of the hidden layer.

