# Recurrent NNs for sequence modeling

## 1. RNN model

We have done NLP before, looking at the data sets with bank complaints. Why are "regular' neural networks sometimes insufficient to deal with NLP and sequence-type problems?
- We created vectors with 0s and 1s denoting if certain words are in a given bank complaint, but no information on the sequence of the word is used!
- In this example, inputs and outputs can be different lengths for different index numbers (i)
- General neural networks cannot learn features across different positions of text.

Take first word $x^{<1>}$, feed it in an NN layer, and try to predict $\hat y^{<1>}$. previous words also have an effect on the output for words later in the sequence. We use the weights $w_{ax}, w_{aa}$ and $w_{ya}$

![title](RNN_2.png)

Disadvantage: networks only uses words earlier in the sequence. Eg:

In "Europe, people tend to use square meters instead of square feet."

The "meters" and "feet" would have been useful to identify that in this case, square is not a physical location.

$a^{<0>}$ = vector of zeros

$a^{<1>} = g(w_{aa}a^{<0>} +w_{ax}x^{<1>}+b_a )$, tanh or relu

$\hat y^{<1>} = g(w_{ya}a^{<1>} + b_y )$, sigmoid

simpler notation:

$a^{<1>} = g(w_{a}[a^{<0>},x^{<1>}]+b_a )$

$\hat y^{<t>} = g(w_{y}a^{<t>}+b_y )$

Matrix $w_a=[w_{aa}; w_{ax}]$



"backpropagation through time" --> Loss function in each vertical in the image above, and then take the sum over all of them, also, right-to=left backpropagation

## 2. Different types of architectures

- Many-to-many --> (location identifyer)
- Many-to-one --> text classifyer (good vs bad review)


![title](RNN_manytoone.png)

- One-to-many --> music generation

![title](RNN_onetomany.png)

- many to many, but input and output lengths are different

![title](RNN_manytomany.png)

Sources:

https://www.coursera.org/learn/nlp-sequence-models/lecture/gw1Xw/language-model-and-sequence-generation

https://www.coursera.org/learn/nlp-sequence-models/lecture/MACos/sampling-novel-sequences


## 3. Issues in sequence models

- Vanishing gradients: plural subject, and maybe much later in the sentence we'll need a plural verb.
- Exploding gradients: you'll notice NaNs, solution is gradient clipping (but, exploding gradients are rare in recurrent networks).

### 3.1 Solution for vanishing gradients: Gated Recurrent Unit (GRU)

Add a memory cell C. A simplified GRU then looks like this:

IN a GRU, $a^{<t>} = C^{<t>}$
Next, candidate for replacing $C^{<t>}$, which is 
    
$\tilde C ^{<t>} = \tanh(w_c[C^{<t-1>}, x^{<t>}]+b_c)$

$\Gamma_u = \sigma(w_u[c^{<t-1>}, x^{<t>}]+b_u])$
--> always 0 or 1

$C ^{<t>} =\Gamma_u * \tilde C ^{<t>} + (1- \Gamma_u)* C^{<t-1>}$

The $\Gamma$ is used to decide whether or not to update! If update gate is 0, $C^{<t>}$ will not be updated!

Depending on how many hidden activation values you have, $C^{<t>}, \tilde C^{<t>}$ and $\Gamma_u$ will be vectors of values, and the last equation above is 

A FULL GRU adds another gate which tells us how important $C^{<t-1>}$ for the update of $\tilde C ^{<t>}$. The new equations are:

$\tilde C ^{<t>} = \tanh(w_c[\Gamma_r * C^{<t-1>}, x^{<t>}]+b_c)$

$\Gamma_u = \sigma(w_u[c^{<t-1>}, x^{<t>}]+b_u])$

$C ^{<t>} =\Gamma_u * \tilde C ^{<t>} + (1- \Gamma_u)* C^{<t-1>}$

$\Gamma_r = \sigma(w_r[c^{<t-1>}, x^{<t>}]+b_r])$


https://www.coursera.org/learn/nlp-sequence-models/lecture/agZiL/gated-recurrent-unit-gru

### 3.2 Solution for vanishing gradients: Long Short Term Memory (LSTM)

In LSTM, we do not use that $a^{<t>}=C ^{<t>}$

$\tilde C ^{<t>} = \tanh(w_c[a^{<t-1>}, x^{<t>}]+b_c)$

$\Gamma_u = \sigma(w_u[a^{<t-1>}, x^{<t>}]+b_u])$

$\Gamma_f = \sigma(w_u[c^{<t-1>}, x^{<t>}]+b_f])$

$\Gamma_o = \sigma(w_o[c^{<t-1>}, x^{<t>}]+b_o])$

$\tilde C ^{<t>} = \Gamma_u* \tilde C ^{<t>} + \Gamma_f* C ^{<t-1>}$ 
$a ^{<t>} = \Gamma_o* \tanh C ^{<t>} $

With $\Gamma_u, \Gamma_f$ and $\Gamma_o$ the update, forget and output gate respectively.

### 3.3 Which one to use?

GRU: simpler, and easier to run on a big network

LSTM: more powerful, more flexible

LSTM is generally considered as the default, but more complex.

### 3.4 Other networks 

Bi-directional RNN: take information about the entire sequence, not just what is preceding!

Deep RNNs: several vertical layers