---

## 10.3 Understanding recurrent neural networks

In [1]:
from IPython.display import YouTubeVideo

---

### RNN

'Vanilla' recurrent nets

<!-- ![RNN](images/rnn/dprogrammer.RNN.png) -->
 
![RNN](https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/dprogrammer.RNN.png?raw=true)

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>

$$
\bbox[5px,border:2px solid red]
{
\begin{align*}
h_t &= \sigma_h(U_h \cdot x_t+V_h \cdot h_{t-1}+b_h)  \\
o_t &= \sigma_o(W_o \cdot h_t+b_o) 
\end{align*}
}
$$

$x_t$ : input vector.  
$h_t$ : hidden layer vector.  
$o_t$ : output vector.  
$b_h, b_o$ : bias vectors.  
$U,W,V$ : parameter matrices.  
$\sigma_h$: activation, typically $tanh$.  
$\sigma_o$ : activation, $softmax$ or $sigmoid$, depending on your needs.  

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>

---

### LSTM

Fully-fledged recurrent nets

<!-- ![LSTM](images/rnn/dprogrammer.LSTM.png) -->
 
![LSTM](https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/dprogrammer.LSTM.png?raw=true)

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>

$$
\bbox[5px,border:2px solid red]
{
\begin{align*}
f_t &= \sigma(W_f \cdot h_{t-1} + U_f \cdot x_t+b_f) \\
i_t &= \sigma(W_i \cdot h_{t-1} + U_i \cdot x_t+b_i) \\
o_t &= \sigma(W_o \cdot h_{t-1} + U_o \cdot x_t+b_o) \\
\tilde{C}_t &= \tanh(W_c\cdot h_{t-1} + U_c \cdot x_t+b_c) \\
C_t &= f_t \odot C_{t-1}+i_t\odot\tilde{C}_t \\
h_t &= o_t \odot \tanh(C_t) 
\end{align*}
}
$$

$h_t$ , $C_t$ : hidden layer vectors.  
$x_t$ : input vector.  
$b_f$ , $b_i$ , $b_c$ , $b_o$ : bias vector.  
$W_f$ , $W_i$ , $W_c$ , $W_o$ : parameter matrices.  
$U_f$ , $U_i$ , $U_c$ , $U_o$ : parameter matrices.  
$\sigma$ , $\tanh$ : activation functions.  
$\odot$ : the Hadamard (element-wise) product.

### Note

$f$ is for *forget*  
$c$ is for *carry*  
$i$ is for *input*  
$o$ is for *output*

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>


<!-- <img src="images/rnn/mit.lstm.1.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.1.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<!-- <img src="images/rnn/mit.lstm.2.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.2.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<!-- <img src="images/rnn/mit.lstm.3.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.3.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<img src="images/rnn/mit.lstm.4.png">
<!-- <img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.4.png?raw=true"> -->

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<!-- <img src="images/rnn/mit.lstm.5.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.5.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<!-- <img src="images/rnn/mit.lstm.6.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.6.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<img src="images/rnn/mit.lstm.7.png">
<!-- <img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.7.png?raw=true"> -->

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

<!-- <img src="images/rnn/mit.lstm.8.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/mit.lstm.8.png?raw=true">

<small>[Ava Soleimany, MIT 6.S191 (2021): Recurrent Neural Networks](https://www.youtube.com/watch?v=qjrad0V0uJE)</small>

#### LSTM pseudocode 


```python
memory = gated_prev_memory + gated_simple_RNN
output_t = gated_tanh(memory)
```

- `output_t`, `prev_output`: current and previous LSTM layer outputs
- `memory`: aka *carry* or *state* – the conveyor belt, allowing free flow of information from the past (if gates are open) 
- `simple_RNN`: `tanh(dot(input_t, W) + dot(prev_output, U) + b)`
- `simple_RNN`: `sigmoid(dot(input_t, W) + dot(prev_output, U) + b)`
- `gated_tanh(memory)`: the memory is squashed to $[-1, 1]$ and gated


#### The LSTM layer in detail

##### Memory

Also called *carry* by Chollet and, or *state* by other authors. 

1. A forget gate:  
  $ f_{t} = \sigma_g(W_f \cdot x_{t} + U_f \cdot h_{t - 1} + b_f)$  
  $\sigma_g$ is a sigmoid, outputing in $[0, 1]$, a kind of smoothed gate
   
2. An input gate:  
  $i_{t} = \sigma_g(W_i \cdot x_{t} + U_i \cdot h_{t - 1} + b_i)$  
  which we can also imagine as open or closed or in-between.

3. A simple RNN layer – to inject new information into the memory  
  $k_{t} = \sigma_h(W_k \cdot x_{t} + U_k \cdot h_{t - 1} + b_k)$  
  where $\sigma_h$ is the $\tanh$ activation.

4. Putting all this together:  
  `memory = gated_prev_memory + gated_simple_RNN`    
  $c_{t} = f_{t} \odot c_{t - 1} + i_{t} *\odot k_{t}$  


##### Output


1. An output gate  
  $o_{t} = \sigma_d(W_o \cdot x_{t} + U_o \cdot h_{t - 1} + b_o)$  

2. which multiplies, or gates, the memory  
  `output_t = output_gate_{t} * tanh(memory)`  
  $h_{t} = o_{t} \odot \sigma_h(c_{t})$    
  $\sigma_h$ is a hyperbolic tangent (or, in a 'peephole' LSTM, the identity function $\sigma(x) = x$). 
  
$\odot$ : the Hadamard (element-wise) product


---

### GRU

'Optimised' recurrent nets

<!-- ![GRU](images/rnn/dprogrammer.GRU.png) -->
 
![LSTM](https://github.com/jchwenger/AI/blob/main/6-additional-material/images/rnn/dprogrammer.GRU.png?raw=true)

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>

$$
\bbox[5px,border:2px solid red]
{
\begin{align*}
z_t &= \sigma(W_z \cdot h_{t-1} + U_z \cdot x_t+b_z) \\
r_t &= \sigma(W_r \cdot h_{t-1} + U_r \cdot x_t+b_r) \\
\tilde{h}_t &= \tanh(W_h \cdot (r_t \odot h_{t-1}) + U_h \cdot x_t+b_h) \\
h_t &= (1-z_t)\odot h_{t-1}+z_t \odot \tilde{h}_t 
\end{align*}
}
$$


$h_t $: hidden layer vectors.  
$x_t $: input vector.  
$b_z , b_r , b_h $: bias vector.  
$W_z , W_r , W_h $: parameter matrices.  
$U_z , U_r , U_h $: parameter matrices.  
$\sigma$ , $\tanh$ : activation functions.
$\odot$ : the Hadamard (element-wise) product.

<small>["RNN, LSTM & GRU", dprogrammer.org](http://dprogrammer.org/rnn-lstm-gru)</small>


---

## Lecture references

In [2]:
YouTubeVideo('qjrad0V0uJE', width=853, height=480) #  MIT 6.S191 (2021): Recurrent Neural Networks 

In [2]:
YouTubeVideo('0LixFSa7yts', width=853, height=480) #  Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 6 - Simple and LSTM RNNs 

In [3]:
YouTubeVideo('6niqTuYFZLQ', width=853, height=480) #  Lecture 10 | Recurrent Neural Networks 

In [5]:
YouTubeVideo('Keqep_PKrY8', width=853, height=480) # Lecture 8: Recurrent Neural Networks and Language Models 

In [7]:
YouTubeVideo('QuELiw8tbx8', width=853, height=480) #  Lecture 9: Machine Translation and Advanced Recurrent LSTMs and GRUs 