***
# Recurrent Neural Networks

### The idea behind RNNs is to make use of sequential information

### Neural Networks are called <i>recurrent</i> because they perform the same task for every element of a sequence
***

***
### Output depends on previous computations so we say RNNs have memory
<img src="./images/memory.jpg" width="400px">
<BR>
***

***
### The formulas that govern the computation are as follows:
#### * $x_t$ is the input at time step t. For example, $x_1$ could be a one-hot vector corresponding to the second word of a sentence.
#### * $s_t$ is the hidden state at time step t. It’s the “memory” of the network. $s_t$ is calculated based on the previous hidden state and the input at the current step: $s_t=f(U{x_t} + Ws_{t-1})$. The function f usually is a nonlinearity such as tanh or ReLU.  $s_{-1}$, which is required to calculate the first hidden state, is typically initialized to all zeroes.
#### * $o_t$ is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. $o_t = \mathrm{softmax}(Vs_t)$.
<BR>
<center><img src="./images/rnn1.jpg" width="600px"></center>
<BR>
#### Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (U, V, W above) across all steps. 
#### We are performing the same task at each step, just with different inputs! 
#### This greatly reduces the total number of parameters we need to learn.

### Training through Back-propagation with a <i>twist</i>!
#### In order to calculate the gradient at t=4 we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT). 
***

***
### The Problem of Long-Term Dependencies

#### Predict the last word in “the clouds are in the <i>sky</i>"

<BR>
<center><img src="./images/rnn2.png" width="600px"></center>
<BR>
   
#### Consider predicting the last word in the text “I grew up in <i>France</i>… I speak fluent <i>French</i>.” 
##### The issue is that we need long memory conext!
***


***
### Long Short Term Memory (LSTM) networks
#### A special kind of RNN, capable of learning long-term dependencies!
<BR>
<center><img src="./images/rnn3.png" width="800px"></center>
#### <center> Traditional RNN has only 1 neuron</center>
<BR>
<BR>
<center><img src="./images/rnn4.png" width="800px"></center>
#### <center> LSTM has 4 neurons !</center>
<BR>
    
### Let's Watch a Video about LSTMs


[A Gentle Introduction to LSTMs](https://www.youtube.com/watch?v=WCUNPb-5EYI)
***

***
#### Gates
 * Gates --> optionally let information through. 
 * Composed of a sigmoid neural net layer and a pointwise multiplication operation.
 * 0: let nothing through; 1: let everything through
<BR>
<BR>    
<center><img src="./images/rnn7.png" width="150px"></center>
<BR>
***


### Three gates protect and control the cell state.

***
### Step 1 - Memory Step
### Evaluate $h_{t−1}$ and $x_t$ and outputs a number between 0 and 1 for each number in the cell state $C_{t−1}$. 
### 1 represents “completely keep this” $\space\space\space\space\space$ 0 represents “completely get rid of this.”
<BR>
<center><img src="./images/rnn8.png" width="600px"></center>
<BR>
***

***
### Step 2 - What new information we’re going to store in the cell state
### 1) Sigmoid layer called the “input gate layer” decides which values we’ll update. 
### 2) Tanh layer creates a vector of new candidate values, $\hat{C}_t$, that could be added to the state. 
<BR>
<center><img src="./images/rnn10.png" width="600px"></center>
<BR>
***

***
### Step 3 - Update old cell state, $C{t−1}$ into new cell state $C_t$.
###  The previous steps already decided what to do, we just need to actually do it.
### 1) Multiply old state by $f_t$, forgetting things we decided to forget earlier. 
### 2) Add $i_t ∗ \hat{C}_t$. This is the new candidate values, scaled by how much we decided to update each state value.
<BR>
<center><img src="./images/rnn11.png" width="600px"></center>
<BR>
***

***
### Step 4 - Filtered Output Based on Our Cell State
### 1) Sigmoid layer decides what parts of cell state to output. 
### 2) Cell state travels through tanh (pushes values between −1 and 1)
### 3) Multiply it by output of the sigmoid gate --> This only outputs the parts we want.
<BR>
<center><img src="./images/rnn12.png" width="600px"></center>
<BR>
    
### Why are there 2 $h_t$ outputs ???
***

# NLP Sequence Labeling

### Subtask of information extraction - name-entity recoginition
### Classify named entities in text into 
* Names of persons
* Organizations
* Locations
* Expressions of times
* Quantities

### <U>IOB (Inside-Outside-Beginning) Format</U>
#### Example:
* I_O complained_O to_O Microsoft_I-ORG about_O Bill_I-PER Gates_I-PER
* They_O told_O me_O to_O see_O the_O mayor_O of_O New_I-LOC York_I-LOC
<BR>

* <B>Advantages:</B> Very fine-grained annotation
* <B>Disadvatages:</B> Requires annotation !!!
