# Sequence Models

This is the fifth and final course of the deep learning specialization at [Coursera](https://www.coursera.org/specializations/deep-learning) which is moderated by [deeplearning.ai](http://deeplearning.ai/). The course is taught by Andrew Ng.

Notes adapted from: https://github.com/mbadry1/DeepLearning.ai-Summary/tree/master/5-%20Sequence%20Models#back-propagation-with-rnns

## Table of contents
* [Course summary](#course-summary)


* [1. Recurrent Neural Networks](#recurrent-neural-networks)
  * [1.1 Why sequence models](#why-sequence-models)
  * [1.2 Notation](#notation)
  * [1.3 Recurrent Neural Network Model](#recurrent-neural-network-model)
  * [1.4 Backpropagation through time](#backpropagation-through-time)
  * [1.5 Different types of RNNs](#different-types-of-rnns)
  * [1.6 Language model and sequence generation](#language-model-and-sequence-generation)
  * [1.7 Sampling novel sequences](#sampling-novel-sequences)
  * [1.8 Vanishing gradients with RNNs](#vanishing-gradients-with-rnns)
  * [1.9 Gated Recurrent Unit (GRU)](#gated-recurrent-unit-gru)
  * [1.10 Long Short Term Memory (LSTM)](#long-short-term-memory-lstm)
  * [1.11 Bidirectional RNN](#bidirectional-rnn)
  * [1.12 Deep RNNs](#deep-rnns)
  * [1.13 Back propagation with RNNs](#back-propagation-with-rnns)


* [2. Natural Language Processing &amp; Word Embeddings](#natural-language-processing--word-embeddings)
    * [2.1 Introduction to Word Embeddings](#introduction-to-word-embeddings)
     * [Word Representation](#word-representation)
     * [Using word embeddings](#using-word-embeddings)
     * [Properties of word embeddings](#properties-of-word-embeddings)
     * [Embedding matrix](#embedding-matrix)
     
    * [2.2 Learning Word Embeddings: Word2vec &amp; GloVe](#learning-word-embeddings-word2vec--glove)
     * [Learning word embeddings](#learning-word-embeddings)
     * [Word2Vec](#word2vec)
     * [Negative Sampling](#negative-sampling)
     * [GloVe word vectors](#glove-word-vectors) 
    * [2.3 Applications using Word Embeddings](#applications-using-word-embeddings)
     * [Sentiment Classification](#sentiment-classification)
     * [Debiasing word embeddings](#debiasing-word-embeddings)


* [3. Sequence models &amp; Attention mechanism](#sequence-models--attention-mechanism)
  * [3.1 Various sequence to sequence architectures](#various-sequence-to-sequence-architectures)
     * [Basic Models](#basic-models)
     * [Picking the most likely sentence](#picking-the-most-likely-sentence)
     * [Beam Search](#beam-search)
     * [Refinements to Beam Search](#refinements-to-beam-search)
     * [Error analysis in beam search](#error-analysis-in-beam-search)
     * [BLEU Score](#bleu-score)
     * [Attention Model Intuition](#attention-model-intuition)
     * [Attention Model](#attention-model)
  * [3.2 Speech recognition - Audio data](#speech-recognition---audio-data)
     * [Speech recognition](#speech-recognition)
     * [Trigger Word Detection](#trigger-word-detection)
  * [3.3 Extras](#extras)
     * [Machine translation attention model (From notebooks)](#machine-translation-attention-model-from-notebooks)


## Course summary
Here are the course summary as its given on the course [link](https://www.coursera.org/learn/nlp-sequence-models):

> This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and many others. 
>
> You will:
> - Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs and LSTMs.
> - Be able to apply sequence models to natural language problems, including text synthesis. 
> - Be able to apply sequence models to audio applications, including speech recognition and music synthesis.
>
> This is the fifth and final course of the Deep Learning Specialization.


# 1. Recurrent Neural Networks

## 1.1 Why sequence models
- Sequence Models like RNN and LSTMs have greatly transformed learning on sequences in the past few years.
- Examples of sequence data in applications:
  - Speech recognition (**sequence to sequence**):
    - X: wave sequence
    - Y: text sequence
  - Music generation (**one to sequence**):
    - X: nothing or an integer
    - Y: wave sequence
  - Sentiment classification (**sequence to one**):
    - X: text sequence
    - Y: integer rating from one to five
  - DNA sequence analysis (**sequence to sequence**):
    - X: DNA sequence
    - Y: DNA Labels
  - Machine translation (**sequence to sequence**):
    - X: text sequence (in one language)
    - Y: text sequence (in other language)
  - Video activity recognition (**sequence to one**):
    - X: video frames
    - Y: label (activity)
  - Name entity recognition (**sequence to sequence**):
    - X: text sequence
    - Y: label sequence
    - Can be used by seach engines to index different type of words inside a text.
- All of these problems with different input and output (sequence or not) can be addressed as supervised learning with label data X, Y as the training set.


## 1.2 Notation
- In this section we will discuss the notations that we will use through the course.
- **Motivating example**:
  - Named entity recognition example:
    - X: "Harry Potter and Hermoine Granger invented a new spell."
    - Y:   1   1   0   1   1   0   0   0   0
    - Both elements has a shape of 9. 1 means its a name, while 0 means its not a name.
- We will index the first element of x by x<sup> &lt;1&gt; </sup>, the second x<sup>&lt;2&gt;</sup> and so on.
  - x<sup>&lt;1&gt;</sup> = Harry
  - x<sup>&lt;2&gt;</sup> = Potter
- Similarly, we will index the first element of y by y<sup>&lt;1&gt;</sup>, the second y<sup>&lt;2&gt;</sup> and so on.
  - y<sup>&lt;1&gt;</sup> = 1
  - y<sup>&lt;2&gt;</sup> = 1

- T<sub>x</sub> is the size of the input sequence and T<sub>y</sub> is the size of the output sequence.
  - T<sub>x</sub> = T<sub>y</sub> = 9 in the last example although they can be different in other problems.
- x<sup>(i)&lt;t&gt;</sup> is the element t of the sequence of input vector i. Similarly y<sup>(i)&lt;t&gt;</sup> means the t-th element in the output sequence of the i training example.
- T<sub>x</sub><sup>(i)</sup> the input sequence length for training example i. It can be different across the examples. Similarly for T<sub>y</sub><sup>(i)</sup> will be the length of the output sequence in the i-th training example.


- **Representing words**:
    - We will now work in this course with **NLP** which stands for natural language processing. One of the challenges of NLP is how can we represent a word?

    1. We need a **vocabulary** list that contains all the words in our target sets.
        - Example:
            - [a ... And   ... Harry ... Potter ... Zulu]
            - Each word will have a unique index that it can be represented with.
            - The sorting here is in alphabetical order.
        - Vocabulary sizes in modern applications are from 30,000 to 50,000. 100,000 is not uncommon. Some of the bigger companies use even a million.
        - To build vocabulary list, you can read all the texts you have and get m words with the most occurrence, or search online for m most occurrent words.
    2. Create a **one-hot encoding** sequence for each word in your dataset given the vocabulary you have created.
        - While converting, what if we meet a word thats not in your dictionary?
        - We can add a token in the vocabulary with name `<UNK>` which stands for unknown text and use its index for your one-hot vector.
    - Full example:   
        ![](Images/01.png)

- The goal is given this representation for x to learn a mapping using a sequence model to then target output y as a supervised learning problem.


## 1.3 Recurrent Neural Network Model
- Why not to use a standard network for sequence tasks? There are two problems:
  - Inputs, outputs can be different lengths in different examples.
    - This can be solved for normal NNs by paddings with the maximum lengths but it's not a good solution.
  - Doesn't share features learned across different positions of text/sequence.
    - Using a feature sharing like in CNNs can significantly reduce the number of parameters in your model. That's what we will do in RNNs.
- Recurrent neural network doesn't have either of the two mentioned problems.
- Lets build a RNN that solves **name entity recognition** task:   
    ![](images/rnn_struc.png)
  - In this problem T<sub>x</sub> = T<sub>y</sub>. In other problems where they aren't equal, the RNN architecture may be different.
  - a<sup><0></sup> is usually initialized with zeros, but some others may initialize it randomly in some cases.
  - There are three weight matrices here: W<sub>ax</sub>, W<sub>aa</sub>, and W<sub>ya</sub> with shapes:
    - W<sub>ax</sub>: (NoOfHiddenNeurons, n<sub>x</sub>)
    - W<sub>aa</sub>: (NoOfHiddenNeurons, NoOfHiddenNeurons)
    - W<sub>ya</sub>: (n<sub>y</sub>, NoOfHiddenNeurons)
- The weight matrix W<sub>aa</sub> is the memory the RNN is trying to maintain from the previous layers.
- A lot of papers and books write the same architecture this way:  
  ![](Images/03.png)
  - It's harder to interpreter. It's easier to roll this drawings to the unrolled version.
    


- The parameters (weights) are shatred for EVERY timestep.
- So this means that when making the prediction for x3, you get information from x2 and x1...
- HOWEVER, In the discussed RNN architecture,  the current output y&#770;<sup>&lt;t&gt;></sup> depends on the previous inputs and activations.
- Information does NOT flow backwards (UNIDIRECTIONAL)... so words at end of the sentence generally give context to words at the begining of the sentence...
- Solution: Bidrectional networks... which we will discus later.
- Let's have this example 'He Said, "Teddy Roosevelt was a great president"'. In this example Teddy is a person name but we know that from the word **president** that came after Teddy not from **He** and **said** that were before it.
- So limitation of the discussed architecture is that it can not learn from elements later in the sequence. To address this problem we will later discuss **Bidirectional RNN**  (BRNN).


- Now let's discuss the forward propagation equations on the discussed architecture:   
    ![](Images/04.png)

  - The activation function of a is usually tanh or ReLU and for y depends on your task choosing some activation functions like sigmoid and softmax. In name entity recognition task we will use sigmoid because we only have two classes.
- In order to help us develop complex RNN architectures, the last equations needs to be simplified a bit.
- **Simplified RNN notation**:   
    ![](Images/05.png)
- W<sub>a</sub> is W<sub>aa</sub> and W<sub>ax</sub> stacked horizontaly.
- [a<sup>&lt;t-1&gt;</sup>, x<sup>&lt; t &gt;</sup>] is a<sup>&lt; t-1 &gt;</sup> and x<sup>&lt; t &gt;</sup> stacked verticaly.
- w<sub>a</sub> shape: (NoOfHiddenNeurons, NoOfHiddenNeurons + n<sub>x</sub>)
- [a<sup>&lt; t-1 &gt;</sup>, x<sup>&lt; t &gt;</sup>] shape: (NoOfHiddenNeurons + n<sub>x</sub>, 1)

- So if a was 100 dimensions, and x is 10,000 dimensions, then Waa would be 100x100 matrix, and Wax would be 100x10000 matrix
- The advantage of this notation is that we can compress the parameter matrices into ONE term. 
![](images/rnn_eqs.png)


### Extra (from Lazy Programmer Course)

- Imagine we have a sequence of length 5. If we were to unfold the network with NO RECURRENT NEURONS, we'd get this feedforward network with FIVE HIDDEN LAYERS.
- It's as if h(0) is the input, and each x(t) is just some additional control signal at each step
- The hidden-to-hidden weights (Wh) is just repeated at EVERY layer... its like a deep network with the same shared weights at each layer
- Wx is also shared between each of the X values going into each layer

![](images/rnn_1_unroll.png)


- h(t) = f ( [W<sub>h</sub><sup>T</sup>h(t-1)] + [W<sub>x</sub><sup>T</sup>x(t)] + b<sub>h</sub> )
    - Notice the function (f) can be anyone of the any usual hidden layer nonlinearities, its a hyperparam just like for other neural nets. The ReLu function is good for vanishing gradient in RNNs

- y(t) = softmax(W<sub>o</sub><sup>T</sup>h(t) + b<sub>o</sub>)
- f = sigmoid, tanh, relu, or whatever!

## 1.4 Backpropagation through time

### ForwardProp
- Let's see how backpropagation works with the RNN architecture.
- Usually deep learning frameworks do backpropagation automatically for you. But it's useful to know how it works in RNNs.
- Here is the graph:   
  ![](Images/06.png)
  - Where w<sub>a</sub>, b<sub>a</sub>, w<sub>y</sub>, and b<sub>y</sub> are shared across each element in a sequence.
  - the 'a' values refer to the ACTIVATION values
  - Same weight parameters are used for each timestep

### Calculate Loss
- We will use the cross-entropy loss function. First we will caclulate the element wise loss (this is the loss associated with a single prediciton for a single word at a single timestep).   
- Then we calculate the OVERALL LOSS... for the entire sequence. This is just the SUM of the individual loss values for each timestep.
  ![](Images/07.png)
  - Where the first equation is the loss for one example and the loss for the whole sequence is given by the summation over all the calculated single example losses.

### Graph with losses:   
  ![](Images/08.png)
- The backpropagation here is called **backpropagation through time** because we pass activation `a` from one sequence element to another like backwards in time.



### Extra (from Lazy Programmer Course)
- BBT - just a mental exercise only: we'll use Theano / TensorFlow to calculate gradients and do neural network training
- Recall: Back prop is just a fancy name for Gradient Descent. And BPTT is just  a fancy name for Back prop!
- We still do the fundemental formula:
    - W <- W - learning_rate\*T.grad(cost, W)

Below, biases have been dropped to make it a lil more simplistic:
![](images/rnn_1_grads.png)

**Gradients: dJ/DW<sub>o</sub> is calculated as normal, but dJ/DW<sub>h</sub> and dJ/DW<sub>x</sub> need more attention**
- Output weights occur AFTER recurrence, so don't need to consider time
- But if we want to go back in time 3 steps, Wh and Wx occur multiple times, so we need to be careful!!!
- We need to use the product rule from calculus to solve this, the rest is just calculating cross entropy and softmax gradients which should be standard by now!

Product rule from calculus (used where both terms depend on Wh):
![](images/rnn_1_grads1.png)

**Recall that weight gets updates via an error signal from whatever nodes it influences when it goes in the forward direction. The influence is shown below in green.**
- The upward arrows ONLY MATTER if you are considering that as apart of the output. These wil be different whether you have a target for every timestep or one for the ENTIRE sequence. 
- The arrows going LEFT-RIGHT must ALWAYS be updated, because these are the hidden-to-hidden weights.
![](images/rnn_1_unroll.png)

#### Vanishing / Exploding Gradient
- Due to the chain rule, the same things will be multiplied over and over
- Result: approach 0 or infinity very quickly
- To address this, we can use **Gradient Clipping**
![](images/rnn_1_gradclip.png)

#### Truncated BPTT
- Derivatives wrt Wh and Wx depend on every single timestep, the calculation will take longer and longer
- Commong approximation: stop after certain number of steps
- Disadvantage is that you wont incorporate error at longer periods of time
- Eg. If you dont care about dependencies past 3 timesteps, truncate at 3 

## 1.5 Different types of RNNs
- So far we have seen only one RNN architecture in which T<sub>x</sub> equals T<sub>Y</sub>. In some other problems, they may not equal so we need different architectures.
- The architecture we have descried before is called **Many to Many**.
- The ideas in this section was inspired by Andrej Karpathy [blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Mainly this image has all types:   
  ![](Images/09.jpg)

#### Many to One
- In sentiment analysis problem, X is a text while Y is an integer that rangers from 1 to 5. The RNN architecture for that is **Many to One** as in Andrej Karpathy image.   
  ![](Images/10.png)

#### One to Many
- A **One to Many** architecture application would be music generation.  
  ![](Images/11.png)
  - Note that starting the second layer we are **feeding the generated output back to the network (instead of a new x value).**

#### Decoder - Encoder (Many to Many)
- There are another interesting architecture in **Many To Many**. Applications like machine translation inputs and outputs sequences have different lengths in most of the cases. So an alternative _Many To Many_ architecture that fits the translation would be as follows:   
  ![](Images/12.png)
  - With this archtiecture, Ty and Tx can have different lengths.
  - There are an encoder and a decoder parts in this architecture. The encoder encodes the input sequence into one matrix and feed it to the decoder to generate the outputs. 
  - Encoder and decoder have different weight matrices.

#### Summary of RNN types:   
   ![](Images/12_different_types_of_rnn.jpg)
- One to one is just a standard neural net
- There is another architecture which is the **attention** architecture which we will talk about in chapter 3.



## 1.6 Language model and sequence generation
#### What is a language model
  - Let's say we are solving a speech recognition problem and someone says a sentence that can be interpreted into to two sentences:
    - The apple and **pair** salad
    - The apple and **pear** salad
  - **Pair** and **pear** sounds exactly the same, so how would a speech recognition application choose from the two.
  - That's where the language model comes in. It gives a probability for the two sentences and the application decides the best based on this probability.
- The job of a language model is to give a probability of any given sequence of words.

#### How to build language models with RNNs?
- The first thing is to get a **training set**: a large corpus of target language text.
- Then tokenize this training set by getting the vocabulary and then one-hot each word.
- Put an end of sentence token `<EOS>` with the vocabulary and include it with each converted sentence. 
- Also, use the token `<UNK>` for the unknown words... sometimes test data will include words that are NOT in the vocab, this is when you use the UNK token.

#### Example:
Given the sentence **"Cats average 15 hours of sleep a day. `<EOS>`"** (ignore the period, though you can make it a token if you like)
  - In training time we will use the following network.
  - a0 will be set to a vector of zero. 
  - a1 output is list of probabilities with V probas (its the softmax output)... and we want y1 to be the first word in the training sentence
  - So each step in the RNN will look at some set of preceeding words, and predict the probability distribution of getting the current word... YOU MULTIPLE ALL OF THESE PROBAILITIES TO GET THE SENTENCE PROBABILITY WOOOOT!
  - **So the RNN learns to predict one word at a time going from left to right**
    ![](Images/13.png)
  - The loss function is defined by cross-entropy loss:   
    ![](Images/14.png)
    - `i`  is for all elements in the corpus, `t` - for all timesteps.
- To use this model:
  1.  For predicting the chance of **next word**, we feed the sentence to the RNN and then get the final y<sup>&lt; t &gt;</sup> hot vector and sort it by maximum probability.
  2.  For taking the **probability of a sentence**, we compute this:
      - p(y<sup>&lt; 1 &gt;</sup>, y<sup>&lt; 2 &gt;</sup>, y<sup>&lt; 3 &gt;</sup>) = p(y<sup>&lt; 1 &gt;</sup>) \* p(y<sup>&lt; 2 &gt;</sup> | y<sup>&lt; 1 &gt;</sup>) \* p(y<sup>&lt; 3 &gt;</sup> | y<sup>&lt; 1 &gt;</sup>, y<sup>&lt; 2 &gt;</sup>)
      - This is simply feeding the sentence into the RNN and multiplying the probabilities (outputs).

![](images/rnn_nots.jpg)
- **Question:** * How is this similar to the straight bigrams first-order makov model?!? Well, for starters, in the markov model we are only calculating the probabilities as a funciton of t-1 (the current word depends ONLY on the previous word). In this model, all previous words are implicitly incorporated into the probabilities. *

## 1.7 Sampling Novel Sequences
- Recall that a sequence model models the probability of any particular series of words: P(y1, y2, y3, y4)
- After a sequence model is trained on a language model, to check what the model has learned you can apply it to sample novel sequence.
- This essentially means we sample from the probability distribution we have created and get random new words (creating novel sequences)
- Lets see the steps of how we can sample a novel sequence from a trained sequence language model:

### Word Level Language Model
  1. Given this model:   
     ![](Images/15.png)
  2. We first pass a<sup>&lt;0&gt;</sup> = zeros vector, and x<sup>&lt;1&gt;</sup> = zeros vector.
      - We pass 0 to x1 because we want the raw probability of getting any first word... So 'The' or 'This' would likely be picked randomly (assuming our trained probability distribution is solid).
  3. Then we choose a prediction randomly from distribution obtained by y&#770;<sup>&lt;1&gt;</sup>. For example it could be "The".
     - In numpy this can be implemented using: `numpy.random.choice(...)`
     - This is the line where you get a random beginning of the sentence each time you sample run a novel sequence.
  4. We pass the last predicted word with the calculated  a<sup>&lt;1&gt;</sup>
  5. We keep doing 3 & 4 steps for a fixed length or until we get the `<EOS>` token.
      - You could keep sampling (adding timesteps) until you get EOS, or you could just say you want sentences of N length (or a random length).
  6. You can reject any `<UNK>` token if you mind finding it in your output.
      - Keep sampling until you get something other than UNK
      
![](images/rnn_nots2.jpg)

### Character Level Language Model
- So far we have to build a word-level language model. It's also possible to implement a **character-level** language model.
- In the character-level language model, the vocabulary will contain `[a-zA-Z0-9]`, punctuation, special characters and possibly <EOS> token.
- Character-level language model has some pros and cons compared to the word-level language model
  - Pros:
    1. There will be no `<UNK>` token - it can create any word.
  - Cons:
    1. The main disadvantage is that you end up with much longer sequences. 
    2. Character-level language models are not as good as word-level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence.
    3. Also more computationally expensive and harder to train.
- The trend Andrew has seen in NLP is that for the most part, a word-level language model is still used, but as computers get faster there are more and more applications where people are, at least in some special cases, starting to look at more character-level models. Also, they are used in specialized applications where you might need to deal with unknown words or other vocabulary words a lot. Or they are also used in more specialized applications where you have a more specialized vocabulary.

## 1.8 Vanishing gradients with RNNs
- One of the problems with naive RNNs that they run into **vanishing gradient** problem.
- An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to optimize.
- Let's take an example. Suppose we are working with language modeling problem and there are two sequences that model tries to learn:

  - "The **cat**, which already ate ..., **was** full"
  - "The **cats**, which already ate ..., **were** full"
  - Dots represent many words in between.

- What we need to learn here that "was" came with "cat" and that "were" came with "cats". This is an example of **long term dependencies** in a sequence, where what comes much late in the sequence depends on much earlier timesteps. **The naive (basic) RNN is not very good at capturing very long-term dependencies like this.** 
- Generally, in a naive RNN, timesteps that are close to one another have a greater influence... so t1 has a greater influence on t3 than it does on t17.


![](images/rnn_vg.png)

### Long-term dependencies
- As we have discussed in Deep neural networks, deeper networks are getting into the vanishing gradient problem. That also happens with RNNs with a long sequence size.   
  - For computing the word "was", we need to compute the gradient for everything behind. Multiplying fractions tends to vanish the gradient, while multiplication of large number tends to explode it.
  - Therefore some of your weights may not be updated properly.
  ![](Images/16.png)   


- In the problem we descried it means that its hard for the network to memorize "was" word all over back to "cat". So in this case, the network won't identify the singular/plural words so that it gives it the right grammar form of verb was/were.

- The conclusion is that RNNs aren't good in **long-term dependencies**.

- > In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

- _Vanishing gradients_ problem tends to be the bigger problem with RNNs than the _exploding gradients_ problem. We will discuss how to solve it in next sections.

### Gradient Clipping
- Exploding gradients can be easily seen when your weight values become `NaN` (they are simply too big to be held in memory as numbers). So one of the ways solve exploding gradient is to apply **gradient clipping** means if your gradient is more than some threshold - re-scale some of your gradient vector so that is not too big. So there are cliped according to some maximum value.

  ![](Images/26.png)

### Extra
  - Solutions for the Exploding gradient problem:
    - Truncated backpropagation.
      - Not to update all the weights in the way back.
      - Not optimal. You won't update all the weights.
    - Gradient clipping.
  - Solution for the Vanishing gradient problem:
    - Weight initialization.
      - Like He initialization.
    - Echo state networks.
    - Use LSTM/GRU networks.
      - Most popular.
      - We will discuss it next.

## 1.9 Gated Recurrent Unit (GRU)
- GRU is an RNN type that can help solve the vanishing gradient problem and can remember the long-term dependencies.

- The basic RNN unit can be visualized to be like this:   
  ![](Images/17.png)

- We will represent the GRU with a similar drawings.

- Each layer in **GRUs**  has a new variable `C` which is the memory cell. It can tell to whether memorize something or not.

- In GRUs, C<sup>\<t></sup> = a<sup>\<t></sup>

- Equations of the GRUs:   
  ![](Images/18.png)
  - The update gate is between 0 and 1
    - To understand GRUs imagine that the update gate is either 0 or 1 most of the time.
  - So we update the memory cell based on the update cell and the previous cell.

- Lets take the cat sentence example and apply it to understand this equations:

  - Sentence: "The **cat**, which already ate ........................, **was** full"

  - We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be memorized.

  - Splitting the words and get values of C and U at each place:

    - | Word    | Update gate(U)             | Cell memory (C) |
      | ------- | -------------------------- | --------------- |
      | The     | 0                          | val             |
      | cat     | 1                          | new_val         |
      | which   | 0                          | new_val         |
      | already | 0                          | new_val         |
      | ...     | 0                          | new_val         |
      | was     | 1 (I don't need it anymore)| newer_val       |
      | full    | ..                         | ..              |
- Drawing for the GRUs   
  ![](Images/19.png)
  - Drawings like in http://colah.github.io/posts/2015-08-Understanding-LSTMs/ is so popular and makes it easier to understand GRUs and LSTMs. But Andrew Ng finds it's better to look at the equations.
- Because the update gate U is usually a small number like 0.00001, GRUs doesn't suffer the vanishing gradient problem.
  - In the equation this makes C<sup>\<t></sup> = C<sup>\<t-1></sup> in a lot of cases.
- Shapes:
  - a<sup>\<t></sup> shape is (NoOfHiddenNeurons, 1)
  - c<sup>\<t></sup> is the same as a<sup>\<t></sup>
  - c<sup>~\<t></sup> is the same as a<sup>\<t></sup>
  - u<sup>\<t></sup> is also the same dimensions of a<sup>\<t></sup>
- The multiplication in the equations are element wise multiplication.
- What has been descried so far is the Simplified GRU unit. Let's now describe the full one:
  - The full GRU contains a new gate that is used with to calculate the candidate C. The gate tells you how relevant is C<sup>\<t-1></sup> to C<sup>\<t></sup>
  - Equations:   
    ![](Images/20.png)
  - Shapes are the same
- So why we use these architectures, why don't we change them, how we know they will work, why not add another gate, why not use the simpler GRU instead of the full GRU; well researchers has experimented over years all the various types of these architectures with many many different versions and also addressing the vanishing gradient problem. They have found that full GRUs are one of the best RNN architectures  to be used for many different problems. You can make your design but put in mind that GRUs and LSTMs are standards.

## 1.10 Long Short Term Memory (LSTM)
- LSTM - the other type of RNN that can enable you to account for long-term dependencies. It's more powerful and general than GRU.
- In LSTM , C<sup>\<t></sup> != a<sup>\<t></sup>
- Here are the equations of an LSTM unit:   
  ![](Images/21.png)
- In GRU we have an update gate `U`, a relevance gate `r`, and a candidate cell variables C<sup>\~\<t></sup> while in LSTM we have an update gate `U` (sometimes it's called input gate I), a forget gate `F`, an output gate `O`, and a candidate cell variables C<sup>\~\<t></sup>
- Drawings (inspired by http://colah.github.io/posts/2015-08-Understanding-LSTMs/):    
  ![](Images/22.png)
- Some variants on LSTM includes:
  - LSTM with **peephole connections**.
    - The normal LSTM with C<sup>\<t-1></sup> included with every gate.
- There isn't a universal superior between LSTM and it's variants. One of the advantages of GRU is that it's simpler and can be used to build much bigger network but the LSTM is more powerful and general.

## 1.11 Bidirectional RNN
- There are still some ideas to let you build much more powerful sequence models. One of them is bidirectional RNNs and another is Deep RNNs.
- As we saw before, here is an example of the Name entity recognition task:  
  ![](Images/23.png)
- The name **Teddy** cannot be learned from **He** and **said**, but can be learned from **bears**.
- BiRNNs fixes this issue.
- Here is BRNNs architecture:   
  ![](Images/24.png)
- Note, that BiRNN is an **acyclic graph**.
- Part of the forward propagation goes from left to right, and part - from right to left. It learns from both sides.
- To make predictions we use y&#770;<sup>\<t></sup> by using the two activations that come from left and right.
- The blocks here can be any RNN block including the basic RNNs, LSTMs, or GRUs.
- For a lot of NLP or text processing problems, a BiRNN with LSTM appears to be commonly used.
- The disadvantage of BiRNNs that you need the entire sequence before you can process it. For example, in live speech recognition if you use BiRNNs you will need to wait for the person who speaks to stop to take the entire sequence and then make your predictions.

## 1.12 Deep RNNs
- In a lot of cases the standard one layer RNNs will solve your problem. But in some problems its useful to stack some RNN layers to make a deeper network.
- For example, a deep RNN with 3 layers would look like this:  
  ![](Images/25.png)
- In feed-forward deep nets, there could be 100 or even 200 layers. In deep RNNs stacking 3 layers is already considered deep and expensive to train.
- In some cases you might see some feed-forward network layers connected after recurrent cell.

## 1.13 Back propagation with RNNs
- > In modern deep learning frameworks, you only have to implement the forward pass, and the framework takes care of the backward pass, so most deep learning engineers do not need to bother with the details of the backward pass. If however you are an expert in calculus and want to see the details of backprop in RNNs, you can work through this optional portion of the notebook.

- The quote is taken from this [notebook](https://www.coursera.org/learn/nlp-sequence-models/notebook/X20PE/building-a-recurrent-neural-network-step-by-step). If you want the details of the back propagation with programming notes look at the linked notebook.