# Recurrent NNs for sequence modeling

## 1. Why sequence models?

Models like Recurrent neural networks (RNNs) have significantly contributed in areas like Natural Language Processing and speech recognition. Some examples:

- speech recognition: given an input audio file, and the output is a sequence of words.
- Music generation: Output y is a sequence, x can be nothing or the first few notes or...
- sentiment classification: is the review good or bad? Or how many stars?
- Machine translation: google translate
- ...

Let's look at this example. Imagine we have a dataset where each observation is a sentence, and for each sentence we want a model to detect which words are associated with locations. Let's look at the sentence: 

"While we were in school, mom went shopping in the square."

This is a sentence with 11 words, or put differently, a sequence with 11 elements. If this is the first sentence in our entire input space, we would denote it $ x^{(1)}= (x^{(1)<1>},...,x^{(1)<11>})$.

$y^{(1)}$ would then be: $(0,0,0,0,1,0,0,0,0,0,1)$, with 11 elements, $(y^{(1)<1>},...,y^{(1)<11>})$.

The sentence has eleven words, $T_x^{(1)} = T_y^{(1)} = 11$. As $T_x$ and $T_y$ will differ depending on the sentence length, they need the subscript (i) too!
Note that, this just represents this particular sentence x^{(1)}. Other sentences will have other lengths. 

## 2. One-hot representations.

- As we've seen before, you use the 10,000 most recurring words in the training set, or look at an online dictionary of commonly used words.
- Apply dictionary to our text to create one-hot representations per word
- each of the vector is 9,999 times 0 and 1 time 1.
- With 11 words in a sentence, 11 one-hot vectors of 10,000 units.
- If word not in in your 10,000 words dictionary, you create a new vector

## 3. RNN model

We have done NLP before, looking at the data sets with bank complaints. Why are "regular' neural networks sometimes insufficient to deal with NLP and sequence-type problems?
- We created vectors with 0s and 1s denoting if certain words are in a given bank complaint, but no information on the sequence of the word is used!
- In this example, inputs and outputs can be different lengths for different index numbers (i)
- General neural networks cannot learn features across different positions of text.

Take first word $x^{<1>}$, feed it in an NN layer, and try to predict $\hat y^{<1>}$. previous words also have an effect on the output for words later in the sequence. We use the weights $w_{ax}, w_{aa}$ and $w_{ya}$

![title](RNN_2.png)

Disadvantage: networks only uses words earlier in the sequence. Eg:

In "Europe, people tend to use square meters instead of square feet."

The "meters" and "feet" would have been useful to identify that in this case, square is not a physical location.

$a^{<0>}$ = vector of zeros

$a^{<1>} = g(w_{aa}a^{<0>} +w_{ax}x^{<1>}+b_a )$, tanh or relu

$\hat y^{<1>} = g(w_{ya}a^{<1>} + b_y )$, sigmoid

simpler notation:

$a^{<1>} = g(w_{a}[a^{<0>},x^{<1>}]+b_a )$

$\hat y^{<t>} = g(w_{y}a^{<t>}+b_y )$

Matrix $w_a=[w_{aa}; w_{ax}]$



"backpropagation through time" --> Loss function in each vertical in the image above, and then take the sum over all of them, also, right-to=left backpropagation

## 4. Different types of architectures

- Many-to-many --> (location identifyer)
- Many-to-one --> text classifyer (good vs bad review)


![title](RNN_manytoone.png)

- One-to-many --> music generation

![title](RNN_onetomany.png)

- many to many, but input and output lengths are different

![title](RNN_manytomany.png)