In [1]:
from IPython.display import Latex
Latex(filename="../notation.tex")

<IPython.core.display.Latex object>

# Recurrent Neural Network

## Representation

<center><img src="./assets/simplified-representation.svg"></center>

<center><img src="./assets/bottom-up-representation.svg"></center>

## Sequence Models

<center><img src="./assets/sequence-models.svg"></center>

## Feed-Fordward Networks

<center><img src="./assets/feed-fordward-networks.svg"></center>

## Handling Individual Time Steps

<center><img src="./assets/individual-time-steps.svg"></center>

<center><img src="./assets/relationship-of-individual-time-steps.svg"></center>

## Neurons with Recurrence

<center><img src="./assets/neurons-with-recurrence.svg"></center>


## Recurrent Neural Networks (RNNs)

<center><img src="./assets/cell-state.svg"></center>

**RNNs** have a **state $h_t$** that is update **at each time step** as a sequence is processed

$$
h_{t} = f_{\mathbf{W}}(x_t, h_{t-1})
$$

> Note: The same function and set of parameters are used at every time step


**Code Intuition:**

```python
my_rnn = RNN()
hidden_state = [0, 0, 0, 0]

sentence = ["I", "am", "learning", "recurrent", "neural"]

for word in sentence:
    prediction, hidden_state = my_rnn(word, hidden_state)
   
next_word_prediction = prediction

# >>> "networks"
```

## RNNs - Update Cell State and Output

<center><img src="./assets/update-cell-state-and-output.svg"></center>

## Computational Graph Across Time

Re-use the **same weights matrices** at every time step

<center><img src="./assets/computational-graph-across-time.svg"></center>

### RNNs: Forward pass

<center><img src="./assets/rnn-loss.svg"></center>

**RNNs Code Intuition**

```python
class MyRNNCell(tf.keras.layers.Layer):
    def __init__(self, rnn_units, input_dim, output_dim):
        super(MyRNNCell, self).__init__()
        
        # Initialize weight matrices
        self.W_xh = self.add_weight([rnn_units, input_dim])
        self.W_hh = self.add_weight([rnn_units, rnn_units])
        self.W_hy = self.add_weight([output_dim, rnn_units])
        
        # Initialize hidden state to zeros
        self.h = tf.zeros([rnn_units, 1])
        
    def call(self, x):
        # Update the hidden state
        self.h = tdf.math.tanh(self.W_hh * self.h + self.W_xh * x)
        
        # Compute the output
        output = self.W_hy * self.h
        
        # Return the current output and hidden state
        return output, self.h

```

**RNNs TensorFlow**

```python
tf.keras.layers.SimpleRNN(rnn_units)
```

## Sequence Modelling: Design Criteria

To model sequences, we need to:

1. Handle **variable-length** sequences
2. Track **long-term** dependencies
3. Maintain information about **order**
4. **Share parameters** across the sequence

### Encoding Language for a Neural Network

<center><img src="./assets/encoding.svg"></center>

### Embedding

Let's use the following sentence as an example: **"This morning I took my dog for a walk"** 

<center><img src="./assets/embedding.svg"></center>

### Handle Variable Sequence Lengths

Examples:
- *"The food was great"*/
- *"We visited a restaurant for launch"*
- *"We are learning sequence modelling"*

### Model Long-Term Dependencies

We need information from **the distant past** to accurately predict the correct word.

Example:
- *"Spain is where I grew up, but I now live in Berlin. I speak fluent ____"*




### Capture Differences in Sequence Order

We need to be able to capture differences in sequence order which could result in differences in the overall meaning.

Example:
- *"The food was good, not bad at all"* $\neq$ *"The food was bad, not good at all"*

In this case where we havo two sentences that have opposite semantic meaning but have the **same words** with the **same counts** just in a **different order**

## Backpropagation Through Time (BPTT)

<center><img src="./assets/bptt.svg"></center>

## Standard RNN Gradient Flow

Computing the gradient wrt $h_0$ involve **many factors of $W_{hh}$ + repeated gradient computation**

<center><img src="./assets/gradient-flow.svg"></center>

<center><img src="./assets/problems.svg"></center>

## Problem: Vanishing Gradients

**Why are vanishing gradients a problem?**
- Multiply many small numbers together
- Error due to further back time steps have smaller and smaller gradients
- Bias parameters to capture short-term dependencies


## Trick #1: Activation Functions

Whe can smartly select the activation function our networks use.

## Trick #2: Parameter Initialization

- Initialize **weights** to identtity matrix
- Initialize **biases** to zero

$$
I_n = \begin{bmatrix}
1 && 0 && 0 && \cdots && 0 \\
0 && 1 && 0 && \cdots && 0 \\
0 && 0 && 1 && \cdots && 0 \\
\vdots && \vdots && \vdots && \ddots && \vdots \\
0 && 0 && 0 && \cdots && 1
\end{bmatrix}
$$


This helps prevent the weights from shrinking to zero.

## Trick #3: Gated Cells (most used)

Idea: use a more **complex recurrent unit with gates** to control what information is passed through.


<center><img src="./assets/gated-cell.svg"></center>

**Long Short Term Memory (LSTM)** networks rely on a gated cell to track information throughout many time steps.

## Long Short Term Memory (LSTM) Networks

In a standard RNN, repeating modules contain a **simple computation node**

<center><img src="./assets/simple-lstm.svg"></center>

### Control Information Flow

LSTM modules contain **computational blocks** that **control information flow**

<center><img src="./assets/control-information-flow.svg"></center>

LSTM cells are able to track information throughout many timesteps

Information is **added** or **removed** through structures called **gates**

<center><img src="./assets/add.svg"></center>

Gates optionally let information through, for example via a sigmoid neural net layer and pointwise multiplication.

1. Forget
2. Store
3. Update
4. Ouput

The **output gate** controls what information is sent to the next time step

<center><img src="./assets/lstm-steps.svg"></center>

### Gradient Flow

This is done by maintaining the separate cell state $c_t$ across which the actual gradient computation, so taking the derivative of the loss with respect to the weights updating the weights and shifting the weights in response occurs with respect to this separately maintained cell state $c_t$.

<center><img src="./assets/control-information-flow-gradient.svg"></center>

## LSTMs: Key Concepts

1. Maintain a **separate cell state** from what is outputted
2. Use **gates** to control the **flow of information**
    - **Forget** gate gets rid of irrelevant information
    - **Store** relevant information from current input
    - Selectively **update** cell state
    - **Output** gate returns a filtered version of the cell state
3. Backpropagation through time with **uninterrupted gradientw flow**