# Natural Language Processing

### From RNN to Neural Machine Translation (NMT)
<br><br>
Prof. Iacopo Masi and Prof. Stefano Faralli

In [1]:
import matplotlib.pyplot as plt
import scipy
import random
import numpy as np
import pandas as pd
pd.set_option('display.colheader_justify', 'center')

In [2]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#plt.style.use('seaborn-whitegrid')

font = {'family' : 'Times',
        'weight' : 'bold',
        'size'   : 12}

matplotlib.rc('font', **font)


# Aux functions

def plot_grid(Xs, Ys, axs=None):
    ''' Aux function to plot a grid'''
    t = np.arange(Xs.size) # define progression of int for indexing colormap
    if axs:
        axs.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        axs.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        axs.axis('scaled') # axis scaled
    else:
        plt.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        plt.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        plt.axis('scaled') # axis scaled
        
def linear_map(A, Xs, Ys):
    '''Map src points with A'''
    # [NxN,NxN] -> NxNx2 # add 3-rd axis, like adding another layer
    src = np.stack((Xs,Ys), axis=Xs.ndim)
    # flatten first two dimension
    # (NN)x2
    src_r = src.reshape(-1,src.shape[-1]) #ask reshape to keep last dimension and adjust the rest
    # 2x2 @ 2x(NN)
    dst = A @ src_r.T # 2xNN
    #(NN)x2 and then reshape as NxNx2
    dst = (dst.T).reshape(src.shape)
    # Access X and Y
    return dst[...,0], dst[...,1]


def plot_points(ax, Xs, Ys, col='red', unit=None, linestyle='solid'):
    '''Plots points'''
    ax.set_aspect('equal')
    ax.grid(True, which='both')
    ax.axhline(y=0, color='gray', linestyle="--")
    ax.axvline(x=0, color='gray',  linestyle="--")
    ax.plot(Xs, Ys, color=col)
    if unit is None:
        plotVectors(ax, [[0,1],[1,0]], ['gray']*2, alpha=1, linestyle=linestyle)
    else:
        plotVectors(ax, unit, [col]*2, alpha=1, linestyle=linestyle)

def plotVectors(ax, vecs, cols, alpha=1, linestyle='solid'):
    '''Plot set of vectors.'''
    for i in range(len(vecs)):
        x = np.concatenate([[0,0], vecs[i]])
        ax.quiver([x[0]],
                   [x[1]],
                   [x[2]],
                   [x[3]],
                   angles='xy', scale_units='xy', scale=1, color=cols[i],
                   alpha=alpha, linestyle=linestyle, linewidth=2)

## My own latex definitions

$$\def\mbf#1{\mathbf{#1}}$$
$$\def\bmf#1{\boldsymbol{#1}}$$
$$\def\bx{\mbf{x}}$$
$$\def\bxt#1{\mbf{x}_{\text{#1}}}$$
$$\def\bv{\mbf{v}}$$
$$\def\bz{\mbf{z}}$$
$$\def\bmu{\bmf{\mu}}$$
$$\def\bsigma{\bmf{\Sigma}}$$
$$\def\Rd#1{\in \mathbb{R}^{#1}}$$
$$\def\chain#1#2{\frac{\partial #1}{\partial #2}}$$
$$\def\loss{\mathcal{L}}$$
$$\def\params{\bmf{\theta}}$$


# Today's lecture
## - Recap on RNN
### - Stacked, Bidirectional RNN
### - Long short-Term Memory Networks (LSTM)
### - Introduction to Self-Attention and Transformers

# This lecture material is taken from
📘 **Chapter 9 Jurafsky Book**

📘 **Chapter 6.3 Eisenstein Book**
- [Stanford Slide LSTM](http://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture06-fancy-rnn.pdf)
- [Stanford Lecture LSTM](https://www.youtube.com/watch?v=0LixFSa7yts&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ&index=6)
- [Stanford Notes on RNN and LSTM](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes05-LM_RNN.pdf)
- [Andrej Karpathy Lecture on LSTM](https://www.youtube.com/watch?v=yCC09vCHzF8)
- [Andrej Karpathy Slides on LSTM](http://cs231n.stanford.edu/slides/2022/lecture_10_ruohan.pdf)

Another resource with code is [[d2l.ai] Modern RNN](https://d2l.ai/chapter_recurrent-modern/index.html)

LSTM: [colah.github.io](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) | 
[Illustrated Guide to LSTM](https://www.youtube.com/watch?v=8HyCNIVRbSU)

# Deep Learning for Sequence Processing

# Stacked Bidirectional RNN

# RNN so far

<br><div align='center'><img src="figs/stacked_rnn_01.png" width='30%' ></div>

# Stacked RNN (or Multi-layer RNN)

- RNNs are already "deep" on one dimension, **the time dimension**---they unroll over many timesteps.

- We can also make them “deep” in **another dimension** (the representation dimension)! We can do so by applying multiple RNNs – this is a multi-layer RNN (stacked RNN).

- This allows the network to compute more complex representations 

- The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features. 

# Stacked RNN (or Multi-layer RNN)

The hidden states from RNN layer $i$ are the inputs to RNN layer $i+1$

<br><div align='center'><img src="figs/stacked_rnn_02.png" width='40%' ></div>

# Stacked RNN (or Multi-layer RNN)


- **Multi-layer or stacked RNNs** allow a network to compute **more complex representations**: they work better than just have one layer of high-dimensional encodings!

- High-performing RNNs are usually multi-layer BUT NOT as deep as convolutional or feed-forward networks

- In a 2017 paper, Britz et al. find that for **Neural Machine Translation**, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN:
    - Often 2 layers is a lot better than 1, and 3 might be a little better than 2
    -  Usually, skip-connections/dense-connections are needed to train deeper RNNs (e.g., 8 layers)

# Bidirectional RNN: Motivation

**Task:** Sentiment Classification

We can regard the hidden state $\mbf{h}_4$ of `terribly` as a representation of the word given prior context this sentence. We call this a _contextual representation_.
Note that the sentence denotes `positivity` but `terribly` is usually used as negative word.

<div align='center'><img src="figs/bidirectional_01.png" width='25%' ></div>


# Bidirectional RNN: Motivation

**Task:** Sentiment Classification

The pooling of all hidden states may cancel then contribution of the hidden state of `exciting` making the model less effective.

**It would be nice if we could condition terribly on what comes "in the future" (right)**
<div align='center'><img src="figs/bidirectional_01.png" width='25%' ></div>



# Bidirectional RNN: Motivation

**Task:** Sentiment Classification

**Depending on the task, we have the text at our hand so we can go back at the end of text, we can go right etc.**

**Bottom line: We can move in the text as we want.**

<div align='center'><img src="figs/bidirectional_01.png" width='20%' ></div>



# Bidirectional RNN: Motivation

**Task:** Sentiment Classification

**Idea:** we learn the RNN in reverse order, but what do we do with the previous RNN? 🤔

<div align='center'><img src="figs/bidirectional_02.png" width='25%' ></div>



# Bidirectional RNN

**Task:** Sentiment Classification

**Idea:** We use both of the hidden layers.
1. A set of hidden layers goes from left $\rightarrow$ right
2. Another set of hidden layers goes left $\leftarrow$ right (new)
3. **Fusion:** We somehow "pool" the final representation (usually `concat`).

# Bidirectional RNN
<br><div align='center'><img src="figs/bidirectional_03.png" width='40%' ></div>

# Bidirectional RNN

In general for each side of the RNN, you have different weights.

The representation is thus:

$$ \overrightarrow{\mbf{h}}(t) = \text{RNN}_{\text{FW}}\big(\overrightarrow{\mbf{h}}(t-1), \mbf{x}(t)\big)$$
$$ \overleftarrow{\mbf{h}}(t) = \text{RNN}_{\text{BW}}\big(\overleftarrow{\mbf{h}}(t+1), \mbf{x}(t)\big)$$
$$ \mbf{h}(t) = [\overrightarrow{\mbf{h}}(t);\overleftarrow{\mbf{h}}(t)]$$

<br><div align='center'><img src="figs/bidirectional_03.png" width='65%' ></div>

# Bidirectional RNN

Note: bidirectional RNNs are only applicable if you have access to the **entire input sequence**

- They are NOT applicable to Language Modeling, because in LM you only have left context available.
- If you do have entire input sequence (e.g., for any kind of encoding), **bidirectionality is powerful** _(you should use it by default!)_.

- For example, **BERT (Bidirectional Encoder Representations from Transformers)** is a powerful pretrained contextual representation system built on **bidirectionality**.
- You will learn more about transformers, including BERT, soon.

# From simple RNN to LSTM

# Remember Back Propagation Through Time (BPTT)?

# Back Propagation Through Time (BPTT)

$$\chain{\loss}{\mbf{W}_h} = \underbrace{\chain{\loss}{\loss}}_1\underbrace{\chain{\loss}{\loss_3}}_{1/T=1/3}\underbrace{\chain{\loss_3}{\mbf{y}_3}}_{\mathbb{R}^{1\times|V|}}\underbrace{\chain{\mbf{y}_3}{\mbf{h}_3}}_{\mathbb{R}^{|V|\times h}}\underbrace{\chain{\mbf{h}_3}{\mbf{W}_h}}_{\mathbb{R}^{h}\times(\mathbb{R}^{h}\times \mathbb{R}^{h})}$$
 
 <div align='center'><img src="figs/Slide58.png" width='50%' ></div>

# Why LSTM? Vanishing Gradient Problem

If the recursive chain is too long, the produce of matrices $\chain{\mbf{h}_3}{\mbf{h}_2}\chain{\mbf{h}_2}{\mbf{h}_1}...$ can make the **gradient to vanish** especially if using tanh activation function.

Think as multiply $a\cdot b\cdot b\cdot b\cdot b\cdot b ...$ where $0<b<1$ but for matrices. 

**At the end the norm of the gradient will be so small that will get to zero numerically!**

$$\chain{\mbf{h}_3}{\mbf{W}_h}\Big\vert_{\text{all time steps}}=\chain{\mbf{h}_3}{\mbf{W}_h}+\chain{\mbf{h}_3}{\mbf{h}_2}\chain{\mbf{h}_2}{\mbf{W}_h}+\chain{\mbf{h}_3}{\mbf{h}_2}\chain{\mbf{h}_2}{\mbf{h}_1}\chain{\mbf{h}_1}{\mbf{W}_h}$$



# RNN to model Sequences


✅ Improvements over window-based LM 
- Can process any length input
- Computation for step $t$ can **(in theory)** use information from many steps back
- Bounded model size respect to context size
- Same weights applied on every timestep, so there is symmetry in how inputs are processed.

# RNN to model Sequences

❌ Remaining problems:
- Recurrent computation is slow (unroll in time)
- In practice, **difficult to access information from many steps back**

 # RNN Gradient Flow

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_01.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_02.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_03.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_04.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_05.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_06.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_07.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_08.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_09.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_10.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_11.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# RNN Gradient Flow
<br><div align='center'><img src="figs/rnn_grad_flow/rnn_grad_flow_12.png" width='60%' ></div>
    
<small>Picture from Stanford</small>

# Change RNN architecture!

# from simple RNN to Long Short-Term Memory RNN

# RNN Terminology

<div align='center'><img src="figs/rnn_category.png" width='60%' ></div>

# Long Short-Term Memory RNN (LSTM)

We have to modify the **Elman Unit** in something that can mitigate back-propagation vanishing gradients when the recurrence is **"plain"**:
<br><br><br>
$$
\mbf{h}_t=\tanh \Biggl(\mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right)\Biggr)
$$

# LSTM Madness 😱

<div align='center'><img src="figs/lstm_madness.png" width='60%' ></div>


# LSTM Motivation

- While gradient clipping helps with exploding gradients,  handling **vanishing gradients appears  to require a more elaborate solution.**

- LSTMs: each ordinary recurrent node is replaced by a **memory cell**. Each memory cell contains an **internal state** $\mbf{c}_t$, i.e., a node with a self-connected recurrent edge of fixed weight 1
- This ensures that the gradient can pass across many time steps without vanishing or exploding (LONG dependency).

# Why the name LSTM?

The term "long short-term memory" comes from the following intuition: 
- Simple RNN have **long-term memory** in the form of weights. The weights change slowly during training,  encoding general knowledge about the data. 
- They also have **short-short term memory** in the form of ephemeral activations, which pass from each node to successive nodes. 
- The LSTM model introduces an intermediate type of storage via **the memory cell**. **A memory cell is a composite unit,  built from simpler nodes in a specific connectivity pattern, with the novel inclusion of multiplicative nodes.**

# Best way to understand LSTM: a conveyor
**RNN:** compute and recurr

**LSTM:** compute, _let me see what I have to add/forget from the memory_, then recurr

<div align='center'><img src="figs/rnn_lstm.png" width='60%' ></div>

# Best way to understand LSTM: a conveyor

**LSTM:** have these highways (bottom lines) that forward information from the past

<br><div align='center'><img src="figs/rnn_lstm_02.png" width='60%' ></div>

# From RNN to LSTM Units

<br><div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width='60%' ></div>

<small>Taken from [colah.github.io](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)</small>

# Breaking down LSTM

The hidden states of the LSTM are now 2:
1. the usual hidden state $\mbf{h}_t$ like simple RNN
2. another state  $\mbf{c}_t$ which is called **memory cell state vector** $\leftarrow$ this is the highway!


# LSTM Unit Input

At each time step $t$, we have as input:
- $\mbf{x}_{t}$ (input from data)
- $\mbf{c}_{t-1}$ (previous memory state)
- $\mbf{h}_{t-1}$ (previous hidden state)

# LSTM Unit Output

At each time step $t$, we have to compute:
- $\mbf{c}_{t}$ (next memory state)
- $\mbf{h}_{t}$ (next hidden state)

# Breaking down LSTM

Given some **gates** ("controls"), we want to compute the next state:

$$
\begin{aligned}
\mbf{c}_t & = \operatorname{function}\Big(\mbf{c}_{t-1}, \operatorname{function}(\mbf{h}_{t-1},\mbf{x}_{t-1})  ;\operatorname{\mathbf{gates}}\Big)\\
\mbf{h}_t & = \operatorname{function}\Big(\mbf{c}_t;\operatorname{\mathbf{gates}}\Big)
\end{aligned}
$$

# What are the "controls"? They are Gates!

Gates ("controls") change the inductive bias of the architecture, allowing the network to control the information that flows.

$$
\begin{aligned}
\operatorname{\mathbf{gates}} & = \operatorname{function}\Big(\mbf{h}_{t-1},\mbf{x}_{t-1}\Big)\\
\mbf{c}_t & = \operatorname{function}\Big(\mbf{c}_{t-1}, \operatorname{function}(\mbf{h}_{t-1},\mbf{x}_{t-1})  ;\operatorname{\mathbf{gates}}\Big)\\
\mbf{h}_t & = \operatorname{function}\Big(\mbf{c}_t;\operatorname{\mathbf{gates}}\Big)
\end{aligned}
$$

# $\operatorname{\mathbf{gates}} = \operatorname{function}\Big(\mbf{h}_{t-1},\mbf{x}_{t-1}\Big)$

# What are the "controls"? They are Gates!

The LSTM "controls" the information in the memory (as a sort of RAM memory) using various **gates**.
Think **gates** as actions performed on the memory:
- **F**orget about the past in the cell memory state $\longrightarrow$ **soft binary gate**
- **I**nput new information in the cell memory state $\longrightarrow$ **soft binary gate**
- **O**utput information to retain in the next hidden state $\longrightarrow$ **soft binary gate**

Given that we are in 🇮🇹, to recall just remember **FIO**. 🫠

$$\begin{aligned}
\operatorname{\textbf{gates}} & = \operatorname{function}\Big(\mbf{h}_{t-1},\mbf{x}_{t-1}\Big)
\end{aligned}
$$

# Gates

Gates end with **sigmoid** and act as a **soft binary mask** to control how much information should flow.

Gates are function of $\Big(\mbf{h}_{t-1},\mbf{x}_{t-1}\Big)$
<div align='center'><img src="figs/lstm_switch.png?1" width='20%' ></div>

# Gates

Gates end with sigmoid and act as a **soft binary mask** to control how much information should flow.

Gates are function of $\Big(\mbf{h}_{t-1},\mbf{x}_{t-1}\Big)$
<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-gate.png" width='10%' ></div>

# Gates

A few details and terminology:
- $\sigma$ is the sigmoid function---the one you find also in logistic regression LR). Remember that $\sigma$ gives you an output in $[0...1]$.
    - you can think $\sigma$ as a probability as in LR or, as in here, as **soft-masking bounded [0,1]**
    - $\sigma$ tells you how much information you want to make flow.
- $\odot$ is the **Hadamard product** which is element-wise multiplication of tensors.

$$
\begin{aligned}
\left(\begin{array}{l}
i \\
f \\
o 
\end{array}\right) & =\left(\begin{array}{c}
\sigma \\
\sigma \\
\sigma \\
\end{array}\right) \mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right) \\
\end{aligned}
$$

# Gates

$$
\begin{aligned}
\left(\begin{array}{l}
\text{input} \\
\text{forget} \\
\text{output} 
\end{array}\right) & =\left(\begin{array}{c}
\sigma \\
\sigma \\
\sigma \\
\end{array}\right) \mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right) + \mbf{b} \\
\end{aligned}
$$

# Forget Gate

<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png" width='80%' ></div>

# Gates: alternative equation still same concept

Sometimes you can find it written as below but is more complex. In our case $\mbf{W}=[\mathbf{W}_{xi};\mathbf{W}_{hi};\mathbf{W}_{xf};\mathbf{W}_{hf};\mathbf{W}_{xo};\mathbf{W}_{ho};]$ concat of matrices, same for biases.

$$
\begin{aligned}
\mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i),\\
\mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f),\\
\mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o),
\end{aligned}
$$

# Candidate update for $\mbf{c}_t$: $\tilde{\mbf{c}}_t$

We can compute it in the similar way we do for **gates** but still with its own parameters

$\mbf{c}_t = \operatorname{function}\Big(\mbf{c}_{t-1}, \underbrace{\operatorname{function}(\mbf{h}_{t-1},\mbf{x}_{t-1})}_{\text{candidate update}~~\tilde{\mbf{c}}_t}; \underbrace{\operatorname{\mathbf{gates}}}_{\text{you know this!}}\Big)$

# Gates + Candidate update $\tilde{\mbf{c}}_t$

$$
\begin{aligned}
\left(\begin{array}{l}
\text{input} \\
\text{forget} \\
\text{output} \\
\tilde{\mbf{c}}_t
\end{array}\right) & =\left(\begin{array}{c}
\sigma \\
\sigma \\
\sigma \\
\tanh\\
\end{array}\right) \mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right) + \mbf{b} \\
\end{aligned}
$$

# Update the memory cell state: forget or add input


<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png" width='80%' ></div>

# Update the memory cell state: forget or add input


<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png" width='80%' ></div>

# Parametric model for LSTM Unit

The parametric model of an LSTM comprises of a matrix $\mbf{W}$ of dimensionality $4D \times 2D$ where $D$ is the dimension of the hidden state $\mbf{h}$.

- $2D$ because we take as input both $\mbf{h}_{t-1}$ and $\mbf{x}_{t}$
- $4D$ because we output 4 vector of dimensionality $D$: forget, input, output and candidate update $\tilde{\mbf{c}}_t$

The bias is optional and is $4D$. 


This assume dim($\mbf{h})$ = dim($\mbf{x}$) but they could differ.

# $\mbf{c}_t = \operatorname{function}\Big(\mbf{c}_{t-1}, \operatorname{function}(\mbf{h}_{t-1},\mbf{x}_{t-1})  ;\operatorname{\mathbf{controls}}\Big)$

# Update the memory cell state


$$ \mbf{c}_t = f \odot \mbf{c}_{t-1}+i \odot \tilde{\mbf{c}}_t$$

# Update the memory cell state


$$ \mbf{c}_t = \underbrace{f \odot \mbf{c}_{t-1}}_{\text{forget the past}}+\underbrace{i \odot \tilde{\mbf{c}}_t}_{\text{input new information}} $$

# Update the memory cell state: forget or add input


<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png" width='80%' ></div>

# Update the hidden state

$$
\begin{aligned}
\mbf{c}_t & =f \odot \mbf{c}_{t-1}+i \odot \tilde{\mbf{c}}_t \\
\mbf{h}_t & =o \odot \tanh \left(\mbf{c}_t\right)
\end{aligned}
$$

# Update the hidden state

<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png" width='80%' ></div>

# Putting all together

$$
\begin{aligned}
\left(\begin{array}{l}
i \\
f \\
o \\
\tilde{\mbf{c}}_t
\end{array}\right) & =\left(\begin{array}{c}
\sigma \\
\sigma \\
\sigma \\
\tanh
\end{array}\right) \mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right) + \mbf{b} \\
\mbf{c}_t & =f \odot \mbf{c}_{t-1}+i \odot \tilde{\mbf{c}}_t \\
\mbf{h}_t & =o \odot \tanh \left(\mbf{c}_t\right)
\end{aligned}
$$

# Putting all together

<div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width='80%' ></div>

# Simple RNN vs LSTM

\begin{align}
\mbf{h}_t=\tanh \Biggl(\mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right)\Biggr)
&& \qquad 
\begin{aligned}
\left(\begin{array}{l}
i \\
f \\
o \\
\tilde{\mbf{c}}
\end{array}\right) & =\left(\begin{array}{c}
\sigma \\
\sigma \\
\sigma \\
\tanh
\end{array}\right) \mbf{W}\left(\begin{array}{c}
\mbf{h}_{t-1} \\
\mbf{x}_t
\end{array}\right) + \mbf{b}\\
\mbf{c}_t & =f \odot \mbf{c}_{t-1}+i \odot \tilde{\mbf{c}} \\
\mbf{h}_t & =o \odot \tanh \left(\mbf{c}_t\right)
\end{aligned}
\end{align}

<small>Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997<br>   
Gers, Schmidhuber, and Cummins, 2000. Learning to Forget: Continual Prediction with LSTM</small>

# RNN vs LSTM Gradients


```python
H = 5  # dim of hidden staate
T = 50 # steps of unrolling the RNN
Whh = np.random.randn(H, H)
hs, ss = {}, {}
hs[-1] = np.random.randn(H)
# forward pass of the RNN ignoring input
for t in range(T):
    ss[t] = np.dot(Whh, hs[t-1])
    hs[t] = np.maximum(0,ss[t])
# backward pass of the RNN
dhs, dss = {}, {}
dhs[T-1] = np.random.randn(H) # we inject random gradients (similar to having a loss)
for t in reversed(range(T)):
    dss[t]=(hs[t-1] > 0)*dhs[t]  #backprop through the non-linearity
    dhs[t-1] = np.dot(Whh.T,dss[t]) #backprop into previous hidden state
    
```


<small>Code by A. Karpathy</small>

# RNN vs LSTM Gradients


```python 
dhs[0]= np.dot(Whh.T, dss[1]) # where dss[1]=(hs[0] > 0)*dhs[1]
```

```python 
dhs[0]= np.dot(Whh.T, (hs[0] > 0)*dhs[1]) # where dhs[1]=np.dot(Whh.T, (hs[1] > 0)*dhs[2])
```

```python 
dhs[0]= np.dot(Whh.T, (hs[0] > 0)*np.dot(Whh.T, (hs[1] > 0)*dhs[2])) 
```

# RNN vs LSTM Gradients

RNN vs LSTM gradients on the input weight matrix

**Error is generated at 128th step and propagated back. No error from other steps.**
At the beginning of training. Weights sampled from Normal Distribution in (-0.1, 0.1). 

<small>[Taken from http://imgur.com/gallery/vaNahKE](http://imgur.com/gallery/vaNahKE)</small>

<div align='center'><video controls loop src="figs/RNN-vs-LSTM-gradients.mp4"/></div>


# LSTM Variants

<div align='center'><img src="figs/rnn_variants.png" width='80%' ></div>

# LSTM and Neural Architectural Search

<div align='center'><img src="figs/rnn_nas.png" width='80%' ></div>

# LSTM Grad Flow

<div align='center'><img src="figs/rnn_lstm_grad_flow.png" width='80%' ></div>

# Does this configuration recall you of another famous architecture?

# LSTM vs ResNet

<br><div align='center'><img src="figs/lstm_resnet.png" width='70%' ></div>

# How does LSTM solve vanishing gradients?

- The LSTM architecture makes it **much easier** for an RNN to **preserve information over many timesteps**
    - e.g., if the **forget gate is set to 1** for a cell dimension and **the input gate
    set to 0**, then the **information of that cell is preserved indefinitely**.
    - In contrast, it’s harder for a vanilla RNN to learn a recurrent weight
    matrix Wh that preserves info in the hidden state
    - In practice, you get about <u>**100 timesteps rather than about 7**</u>
- LSTMs do not guarantee that there is no vanishing/exploding gradients but they do provide an easier way for the model to learn long-distance dependencies. 

# LSTM: Real-world success

- In 2013–2015, LSTMs started achieving state-of-the-art results
    - Successful tasks include **handwriting recognition, speech recognition, machine
translation, parsing, and image captioning**, as well as language models
    - LSTMs became the dominant approach for most NLP tasks
    
<small>Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf</small>

<small>Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf</small>

<small>Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf</small>

# Situation now
- Now (2019–2023), **Transformers** have become dominant for all tasks
    - For example, in WMT (a Machine Translation conference + competition):
    - In WMT **2014, there were 0 neural machine translation systems (!)**
    - In WMT **2016**, the summary report contains **“RNN” 44 times** (and these systems won)
    - In WMT **2019** **“RNN” 7 times**, **”Transformer” 105 times**
    


# Practical Application with LSTM

<small>Tutorial taken from https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html</small>

# Implementing LSTM from scratch with Pytorch

```python
class LSTMScratch(nn.Module):
    def __init__(self, num_inputs, num_hiddens, sigma=0.01):
        super().__init__()
        init_weight = lambda *shape: nn.Parameter(torch.randn(*shape) * sigma)
        triple = lambda: (init_weight(num_inputs, num_hiddens),
                          init_weight(num_hiddens, num_hiddens),
                          nn.Parameter(torch.zeros(num_hiddens)))
        self.W_xi, self.W_hi, self.b_i = triple()  # Input gate
        self.W_xf, self.W_hf, self.b_f = triple()  # Forget gate
        self.W_xo, self.W_ho, self.b_o = triple()  # Output gate
        self.W_xc, self.W_hc, self.b_c = triple()  # Input node
```

# Implementing LSTM from scratch with Pytorch

```python
def forward(self, inputs, H_C=None):
    # inputs is #seq_size x dimension
    if H_C is None:
        # Initial state with shape: (batch_size, num_hiddens)
        H = torch.zeros((inputs.shape[1], self.num_hiddens),
                      device=inputs.device)
        C = torch.zeros((inputs.shape[1], self.num_hiddens),
                      device=inputs.device)
    else:
        H, C = H_C
    outputs = []
    for X in inputs:
        I = torch.sigmoid(torch.matmul(X, self.W_xi) +
                        torch.matmul(H, self.W_hi) + self.b_i)
        F = torch.sigmoid(torch.matmul(X, self.W_xf) +
                        torch.matmul(H, self.W_hf) + self.b_f)
        O = torch.sigmoid(torch.matmul(X, self.W_xo) +
                        torch.matmul(H, self.W_ho) + self.b_o)
        C_tilde = torch.tanh(torch.matmul(X, self.W_xc) +
                           torch.matmul(H, self.W_hc) + self.b_c)
        C = F * C + I * C_tilde
        H = O * torch.tanh(C)
        outputs.append(H)
    return outputs, (H, C)

```

# Implementing LSTM using Pytorch

- As previously, the hyperparameter `num_hiddens` dictates the number of hidden units. 

- We initialize weights following a Gaussian distribution with 0.01 standard deviation, and we set the biases to 0.

# [Torch  Text](https://pytorch.org/text/stable/index.html)

<br><div align='center'><img src="figs/torchtext.png" width='80%' ></div>


# [LSTM Pytorch API](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#lstm)

<div align='center'><img src="figs/lstm.png" width='80%' ></div>

# [LSTM Pytorch API](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#lstm)

<div align='center'><img src="figs/lstm_02.png" width='80%' ></div>

# [LSTM Pytorch API](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#lstm)

<div align='center'><img src="figs/lstm_01.png" width='80%' ></div>

# [LSTM Pytorch API](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#lstm)

<div align='center'><img src="figs/lstm_params.png" width='80%' ></div>

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f89f142b390>

In [4]:
# we define an LSTM model with
# input_size a 10D vector
# hidden_size a 20D vector
# num_layers means 2 stacked LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2)

In [5]:
# we define an LSTM model with
# input_size a 3D vector
# hidden_size a 3D vector
# num_layers means 2 stacked LSTM
input_size, hidden_size, batch_size, = 3, 2, 1
lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=1)

# Let us make up a single training sequence

### This is very useful in deep learning to see if your model digests input correctly
### Very common to test the network on random data

In [6]:
inputs = [torch.randn(1, input_size) for _ in range(5)]  # make a sequence of length 5

# Now we will do "forward pass" manually

In [7]:
# initialize the hidden state.
hidden = (torch.randn(1, batch_size, hidden_size),
          torch.randn(1, batch_size, hidden_size))
for i in inputs: # loop over sequences
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden) # i becomes 1x1x5

# We can also let Pytorch handle everything

# Batch

Remember that in practice the input is batched. As input to the RNN you feed a 3D tensor $\mbf{X}$ of size: 

$$ \mbf{X} = (\text{seq_len, batch, feature}) $$

where:
- `seq_len` is the axis to index the sequence
- `batch` is the axis to index which sequence in the batch
- `feature` is the axis to index the feature

# Batch



$$ \mbf{X} = (\text{seq_len, batch, feature}) $$

$\mbf{X}[0][0][:]$ means the feature of the **first element** in the **first sequence**

We can do the entire sequence all at once.
1. first value `out` returned by LSTM is all of the hidden states throughout the sequence.
2. the second `hidden` is just the most recent hidden state

Let's make the training data a batch now

In [8]:
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time

# Add the extra 2nd dimension
inputs_batch = torch.cat(inputs).view(len(inputs), 1, -1)
# torch.cat(inputs) make 5 item lits of 3D vector a 5x3 tensor
# .view(len(inputs), 1, -1) makes the tensor 5x3 a 5x1x3

```python
inputs_batch = torch.cat(inputs).view(len(inputs), 1, -1)
torch.cat(inputs) # make 5 item lits of 3D vector a 5x3 tensor
.view(len(inputs), 1, -1) #makes the tensor 5x3 a 5x1x3
```

# Note we still have a single seq in the batch

# Now we forward in the network

# LSTM Units

<br><div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width='60%' ></div>

<small>Taken from [colah.github.io](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)</small>

In [9]:
hidden = (torch.randn(1, batch_size, hidden_size),
          torch.randn(1, batch_size, hidden_size))  #clean out hidden state
out, hidden = lstm(inputs_batch, hidden)

In [10]:
print(out)
print(hidden)

tensor([[[-0.1099, -0.1292]],

        [[ 0.0916, -0.1633]],

        [[ 0.2786, -0.2308]],

        [[ 0.1764, -0.1675]],

        [[ 0.0410, -0.1014]]], grad_fn=<MkldnnRnnLayerBackward0>)
(tensor([[[ 0.0410, -0.1014]]], grad_fn=<StackBackward0>), tensor([[[ 0.0480, -0.5339]]], grad_fn=<StackBackward0>))


# All hidden states and last hidden state
```python
type(out), type(hidden)
```
{{type(out), type(hidden)}}

# All hidden states `out` shape

`out.shape`

{{out.shape}}

5 elements in the sequence, only 1 sequence, of dimensionality 2

# `hidden` contains the hidden state $\mbf{h}$ and the cell state $\mbf{c}$
```python
out[-1], hidden[0] # are the same thing
```
{{out[-1].detach().numpy(), hidden[0].detach().numpy() }}

# Let us now batchify the data
### Batch size is 10

In [11]:
batch_size = 10
inputs_batch = torch.randn(5, batch_size, 3)
hidden = (torch.randn(1, batch_size, hidden_size), 
          torch.randn(1, batch_size, hidden_size))  # clean out hidden state
out, (H, C) = lstm(inputs_batch, hidden)

# Question: what is the size now of `out`, `H` and `C`?

In [12]:
out.shape

torch.Size([5, 10, 2])

In [13]:
H.shape, C.shape

(torch.Size([1, 10, 2]), torch.Size([1, 10, 2]))

# POS tagging with LSTM

In this section, we will use an LSTM to get part of speech tags. We will
not use Viterbi or Forward-Backward or anything like that.

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocab. 

Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

# There is NO POS tagging anymore yet seq2seq modeling

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $\mbf{h}_i$. Also, assign each tag a
unique index. Then our prediction rule for $\hat{y}_i$ is

$$\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(A\mbf{h}_i + \mbf{b}))_j$$

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.

# Prepare the data

In [14]:
training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# Now to each word we assign a unique ID

In [15]:
word_to_ix = {}
for seq_x, seq_y in training_data:
    for word in seq_x:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


# Now to each label we assign a unique ID

In [16]:
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index
ix_to_tag =  { v:k for k,v in tag_to_ix.items()}

# From text to tensor

In [17]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# LSTM + Input embedding + Output projection
<br><div align='center'><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width='60%' ></div>

<small>Taken from [colah.github.io](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)</small>

In [18]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 3
HIDDEN_DIM = 2

# The model

In [19]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        # NOTE that we could have used LSTM Pytorch API to implement this
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

# The training

In [20]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Sanity Check

In [21]:
# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    seq_x = training_data[0][0] # 1st training take the input seq
    inputs = prepare_sequence(seq_x, word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores) # num. of words (5) x num. of tags (3)

tensor([[-1.4133, -0.8714, -1.0838],
        [-1.3909, -0.8884, -1.0793],
        [-1.5004, -0.8467, -1.0552],
        [-1.4326, -0.8832, -1.0560],
        [-1.6684, -0.7974, -1.0189]])


# Fitting part (training)

```python
for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data: # 5 words --> 3 classes
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward() # get the gradients over params
        optimizer.step() # update the params
```

In [22]:
for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data: # 5 words --> 3 classes
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward() # get the gradients over params
        optimizer.step() # update the params

# Let's see now the score after training (on the training data)

In [23]:
# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)

tensor([[-0.0521, -3.6669, -3.6814],
        [-3.8780, -0.3398, -1.3189],
        [-3.0486, -0.5128, -1.0391],
        [-0.0356, -4.1671, -3.9390],
        [-4.4822, -0.2243, -1.6628]])


In [24]:
print(training_data[0])
print("predicted ", " -> ".join([ix_to_tag[i.item()] for i in torch.argmax(tag_scores, dim=1)]))

(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN'])
predicted  DET -> NN -> NN -> DET -> NN


# Remember: for evaluation you have to validate on the valid/test set!

# Homework

1. Study and reproduce [NLP From Scratch: Classifying Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial)
2. Study and reproduce [NLP From Scratch: Generating Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)

# Natural Language Processing

### Neural Machine Translation (NMT)
<br><br>
Prof. Iacopo Masi and Prof. Stefano Faralli

# Today's lecture
## - What is Machine Translation
## - Neural Machine Translation (NMT)
## - Beam Search
## - How to eval NMT

# This lecture material is taken from
📘 **Chapter 10 Jurafsky Book**

📘 **Chapter 18 Eisenstein Book**
- [Stanford Slide NMT](http://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture07-final-project.pdf)
- [Stanford Lecture NMT](https://www.youtube.com/watch?v=wzfWHP6SXxY&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ&index=8)


# [The Giant Language Model Test Room - GLTR](http://gltr.io/)

<br><div align='center'><img src="figs/gltr.png" width='40%' ></div>

# Machine Translation

Machine Translation (MT) is the task of translating a sentence $\mbf{x}$ from one language (the source language) to a sentence $\mbf{y}$ in another language (the target language).

$$ \mbf{x} \qquad \text{L'homme est né libre, et partout il est dans les fers} $$

# Machine Translation

Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language).

$$ \mbf{x} \qquad \text{L'homme est né libre, et partout il est dans les fers} $$

$$ \mbf{y} \qquad  \text{Man is born free, but everywhere he is in chains}$$

# The early history of MT: 1950s
- Machine translation research began in the early 1950s on machines less
powerful than high school calculators (before term “A.I.” coined!)
- Concurrent with foundational work on automata, formal languages,
probabilities, and information theory
- MT heavily funded by military, but basically just simple rule-based
systems doing word substitution
- Human language is more complicated than that, and varies more across
languages!
- Little understanding of natural language syntax, semantics, pragmatics
- Problem soon appeared **intractable**

# 1990s-2010s: Statistical Machine Translation

Core idea: Learn a probabilistic model from data
- Suppose we’re translating French → English.
- We want to find best English sentence y, given French sentence x

$$ \arg\max_{y} p(y|x)$$

- Use Bayes Rule to break this down into two components to be learned separately:
$$ = \arg\max_{y} p(x|y)p(y)$$

$p(y) \longrightarrow$ Models how to write good English **(fluency)**. Learned from monolingual data

$p(x|y)\longrightarrow$ Translation Model: Models how words and phrases should be translated **(fidelity)**.
Learned from aligned data.

# Translation is not trivial to model!


# There is not a clear alignment nor one-to-one mapping
<br><div align='center'><img src="figs/machine_trans_01.png" width='60%' ></div>


# Different symbols
<br><div align='center'><img src="figs/machine_trans_03.png" width='60%' ></div>


# Google Translate over time
<br><div align='center'><img src="figs/machine_trans_02.png" width='60%' ></div>


# 1990s-2010s: Statistical Machine Translation (SMT)

SMT was a huge research field
- The best systems were extremely complex
- Hundreds of important details
    - Systems had many separately-designed subcomponents
- Lots of feature engineering
    - Need to design features to capture particular language phenomena
    - Required compiling and maintaining extra resources
    - Lots of human effort to maintain
    - Repeated effort for each language pair!

# Then came Neural MT [circa 2014]

<br><div align='center'><img src="figs/asteroidi.jpg" width='60%' ></div>|


# Neural Machine Translation

Neural Machine Translation (NMT) is a way to do Machine Translation with **a single
end-to-end neural network.**

The neural network architecture is called a **sequence-to-sequence model (aka seq2seq)** and it involves **two RNNs**.

<br>
<div align='center'><img src="https://d2l.ai/_images/encoder-decoder.svg" width='60%' ></div>


# Neural Machine Translation

Here, the encoder RNN will take a **variable-length sequence** as input and transform it into a **fixed-shape hidden state**. Later, we will introduce attention mechanisms, which allow us to access encoded inputs without having to compress the entire input into a single fixed-length representation

<br>
<div align='center'><img src="https://d2l.ai/_images/seq2seq.svg" width='60%' ></div>

# Sequence-to-sequence is versatile!

- The general notion here is an **encoder-decoder** model
    - One neural network takes input and produces a neural representation
    - Another network produces output based on that neural representation
    - If the input and output are sequences, we call it a `seq2seq` model
- Sequence-to-sequence is useful for **more than just MT**



# NLP tasks $\approx$ seq2seq:
- Summarization (long text → short text)
- Dialogue (previous utterances → next utterance)
- Parsing (input text → output parsed as sequence)
- Code generation (natural language → Python code)

<div align='center'><img src="figs/python_NLP.png" width='100%' ></div>

# NTM is a Conditional LM

The sequence-to-sequence model is an example of a **Conditional Language Model**
- **Language Model** because the decoder is predicting the next word of the target sentence $y$
- **Conditional** because its predictions are also conditioned on the source sentence $x$

NMT calculates $p(y|x)$ where $y$ is the target sentence and $x$ is the input sentence. 
Generate likely samples of $y$ given that as input I had $x$.

$$p(y|x) = p(y_1|x)p(y_2|y_1,x)p(y_3|y_1,y_2,x) \ldots p(y_t|y_1,y_2,\ldots,y_{t-1},x)$$

# NTM is a Conditional LM


**Question: How to train an NMT system?**
- (Easy) Answer: Get a big parallel corpus…
- But there is now exciting work on “unsupervised NMT”, data augmentation, etc.

<div align='center'><img src="figs/nmt_conditioning.png" width='60%' ></div>



# Training NMT

<div align='center'><img src="figs/nmt_conditioning_02.png" width='60%' ></div>

# Training NMT

<div align='center'><img src="figs/nmt_conditioning_03.png" width='60%' ></div>

# Multi-layer Deep encoder-decoder Machine Translation Net
<br><div align='center'><img src="figs/nmt_deep.png" width='80%' ></div>

<small>[Sutskever et al. 2014; Luong et al. 2015]</small>
<br> <small>Slide from Stanford</small>

# How do we generate a sentence?

# Generating is also called "decoding"
# Decoding happens at inference [test] time

# Remember something already seen about decoding?

$p(w_{t-1}|w_t = \text{natural}) = 1 \cdot 0.9\cdot0.95\cdot0.65\cdot0.2 = 0.11115$

Loss is $-\log\big(p(w_{t-1}|w_t = \text{natural})\big) = -\log(0.11115)$

<div align='center'><img src="figs/hsoftmax_3g.png?2" width='35%' ></div>

# Decoding

**Important:** In inference with do not have the label!

1. Exhaustive search [too complex]
2. **Greedy search** (at each branch take the branch at maximum probability) [too greedy and deterministic] $\longleftarrow$
3. Beam search


# Decoding: Greedy decoding

- We saw how to generate (or "decode") the target sentence by taking `argmax` on each step of the decoder.
- This is greedy decoding (take most probable word on each step)
<br><div align='center'><img src="figs/greedy_decoding.png" width='45%' ></div>


# Decoding: Probabilistic Decoding

- Instead of doing `argmax` we sample from the probability at each layer. 
- This makes the algorithm randomized in case we want to generate multiple, similar sentences.
<br><div align='center'><img src="figs/nmt_conditioning_04.png" width='35%' ></div>

# The problem with Greedy Decoding

Greedy decoding has no way to undo decisions! Once a decision is taken, is taken!
- Input: `il a m’entarté` $\longrightarrow$ `he hit me with a pie`

Step 1: `he __________`

Step 2: `he hit_______`

Step 3: `he hit a_____` _(wrong prediction,  no way of going back)_

**Greedy is suboptimal:** at each local step, you just choose the maximum, without seeing the entire distribution

# Decoding

**Important:** In inference with do not have the label!

1. **Exhaustive search** [too complex] $\longleftarrow$
2. Greedy search (at each branch take the branch at maximum probability) [too greedy] 
3. Beam search (we will cover later on) 


# Exhaustive Search

Ideally, we want to find a (length $T$) translation/decoding $y$ that maximizes:

$$ p(y|x) = p(y_1 | x) p(y_2 | y_1, x) p(y_3 | y_2,y_1, x) 
\ldots(y_T | y_{T-1},\ldots,y_1, x) = \prod_{i=1}^T p(y_T|y_{T-1},\ldots,y_1,x)$$

We could try computing **all possible sequences y** and take globally the most likely.
- This means that on each step $t$ of the decoder, we are tracking $V^t$ possible partial translations, where $V$ is vocab size
- This $\mathcal{O}(|V|^T)$ is **far too expensive.**

# Decoding


1. Exhaustive search [too complex]
2. Greedy search (at each branch take the branch at maximum probability) [too greedy] 
3. <u>**Beam search**</u> $\longleftarrow$ ✅

# Beam search decoding 🔦

**Core idea:** take a trade-off approach between local (greedy) and global (exhaustive). On each step of decoder, keep track of the $k$ **most probable partial translations** (which we call **hypotheses**)

A **hypothesis** $\{y_1,\ldots,y_t\}$ has a score which is its log probability:

$\text{score}\{y_1,\ldots,y_t\} = \log p(y_1,\ldots,y_t|x) = \sum_{i=1}^t  \log p(y_i|y_1,\ldots,y_{i-1},x) $

- Scores are all negative (negative means low prob.), and higher score is better (high prob.)
- We search for high-scoring hypotheses, **tracking top-k on each step**

# Beam search decoding 🔦

Beam search is not guaranteed to find optimal solution but much more efficient than exhaustive search.
To better understand how **beam search** works we use a **search tree**.

<div align='center'><img src="figs/beam_search_01.png" width='65%' ></div>



# Beam search decoding 🔦
This **fixed-size beam width memory footprint** $k$ is called the **beam width**, on the metaphor of a flashlight beam that can be parameterized to be wider or narrower. In practice $k$ is 5 or 10.

<div align='center'><img src="figs/beam_search_02.png" width='45%' ></div>



# Beam search decoding 🔦

<div align='center'><img src="figs/beam_search_03.png" width='65%' ></div>

# Beam search decoding 🔦: stopping criterion

In greedy decoding, usually we decode until the model produces an `<END>` token
 - For example: `<START> he hit me with a pie <END>`

In beam search decoding, different hypotheses may produce `<END>` tokens on
different timesteps:
 - When a hypothesis produces `<END>`, that hypothesis is complete.
 - Place it aside and continue exploring other hypotheses via beam search.

Usually we continue beam search until:
 - We reach **timestep T (where T is some pre-defined cutoff)**, OR
 - We have at least $n$ completed hypotheses (where $n$ is pre-defined cutoff)

# Beam search decoding 🔦: finishing up

We have our list of completed hypotheses.
- How to select top one?

- Each hypothesis $\{y_1,\ldots,y_n\}$ on our list has a score:

$$ score(y_1,\ldots,y_n) = \log p(y_1,\ldots,y_n|x) = \sum_{i=i}^T \log p(y_i|y_1,\ldots,y_{i-1},x)$$

### Can you spot a problem with this?

# Shorter Sentences may have higher scores!

**Fix: Normalize by length. Use this to select top one instead**

$$\frac{1}{T}\sum_{i=i}^T \log p(y_i|y_1,\ldots,y_{i-1},x)$$

# How do we evaluate Machine Translation?

**BLEU (Bilingual Evaluation Understudy)**

**BLEU** compares the machine-written translation to one or several human-written translation(s), and **computes a similarity score** based on:
- Geometric mean of **n-gram precision** (usually for 1, 2, 3 and 4-grams)
- Plus a penalty for too-short system translations

BLEU is useful but imperfect:
- There are many valid ways to translate a sentence
- So a good translation can get a poor BLEU score because it has low n-gram overlap with the human translation

# BLEU
<div align='center'><img src="figs/bleu.png" width='45%' ></div>


# NMT Progress

<div align='center'><img src="figs/nmt_progress.png" width='50%' ></div>


# Advantages of NMT ✅

Compared to SMT, NMT has many **advantages**: 

- Better performance
    - More fluent
    - Better use of context
    - Better use of phrase similarities
    
- A single neural network to be optimized end-to-end
    - No subcomponents to be individually optimized
    
- Requires much less human engineering effort
    - No feature engineering
    - Same method for all language pairs

# Disadvantages of NMT ❌

Compared to SMT: 

- NMT is **less interpretable**
    - Hard to debug
    - Why this translation came up?
    
- NMT is **difficult to control**
    - For example, cannot easily specify rules or guidelines for translation
    - Safety concerns!
    - Invention of content not in source
    - Systematic gender biases

# NMT: the first big success story of NLP Deep Learning

Neural Machine Translation went from a **fringe research attempt** in 2014 to the leading
**standard method** in 2016

- 2014: First seq2seq paper published [Sutskever et al. 2014]
- 2016: Google Translate switches from SMT to NMT – and by 2018 everyone has.
<div align='center'><img src="figs/companies.png" width='50%' ></div>
- SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by small groups of engineers in a few months

# Using NMT to introduce Attention 🧐

# NTM is a Conditional LM


Do you see any problem with this architecture?

<div align='center'><img src="figs/attention_00.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Why attention? Sequence-to-sequence: the bottleneck problem

<div align='center'><img src="figs/attention_01.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Why attention? Sequence-to-sequence: the bottleneck problem

# Attention 🧐

Attention provides a **solution to the bottleneck problem.**

**Core idea:** on each step of the decoder, <u>**use direct connection to the encoder to focus
on a particular part**</u> of the source sequence

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_02.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_03.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_04.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_05.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_06.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_07.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_08.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_09.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_10.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_11.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_12.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Sequence-to-sequence with attention

<div align='center'><img src="figs/attention_13.png" width='60%' ></div>
<small>Picture from Stanford</small>

# Attention with Equations

- We have encoder hidden states $\mbf{h}_1,\ldots,\mbf{h}_N \in \mathbb{R}^h$
- **On timestep $t$**, we have decoder hidden state $\mbf{s}_t \in \mathbb{R}^h$
- We get the attention scores for this step $\mbf{e}^t$:

$$ \mbf{e}^t = [ \mbf{s}_t^{\top}\mbf{h}_1,\ldots,\mbf{s}_t^{\top}\mbf{h}_N] \in \mathbb{R}^N$$

- We take **softmax** to get the attention distribution for this step (this is a probability distribution and sums to 1)
$$ \alpha^t = \operatorname{softmax}(\mbf{e}^t) $$

- We use $\alpha^t$ to take a weighted sum of the encoder hidden states to get the attention output $\mbf{a}_t$:
 $$ \mbf{a}_t = \sum_{i=1}^N \alpha_i\mbf{h}_i \in \mathbb{R}^h $$
- Finally we concatenate the attention output $\mbf{a}_t$ with the decoder hidden state $\mbf{s}_t$ and proceed as in the non-attention seq2seq model: $$[\mbf{a}_t,\mbf{s}_t] \in \mathbb{R}^{2h}$$

# Attention 🧐 is great! 🦾

Attention provides **some interpretability**:
- By inspecting attention distribution, we see what the decoder was focusing on
- We get (soft) alignment for free!
- This is cool because we never explicitly trained an alignment system
- The network just learned alignment by itself

<div align='center'><img src="figs/attention_14.png" width='60%' ></div>
<small>Picture from Stanford</small>

# There are _several_ attention variants
- We have some **values** $h_1,\ldots, h_n \in \mathbb{R}^{D1}$ and query $\mbf{s} \in \mathbb{R}^{D2}$
- Attention always involves:
    1. Computing the attention scores $\mbf{e} \in \mathbb{R}^{N}$
    2. Taking `softmax` to get attention distribution $\alpha$:
     $$ \alpha = \operatorname{softmax}(\mbf{e}) \in \mathbb{R}^{N}$$
    3. Using attention distribution to take weighted sum of values:
     $$\mbf{a} = \sum_{i=1}^N \alpha_i\mbf{h}_i  \in \mathbb{R}^{D1}  $$
    4. thus obtaining the **attention output** $\mbf{a}$ (sometimes called the **context vector**)

# Attention variants


There are several ways you can compute $\mbf{e} \in \mathbb{R}^{N}$ from $h_1,\ldots, h_n \in \mathbb{R}^{D1}$ and $\mbf{s} \in \mathbb{R}^{D2}$.


## Dot product attention

Basic dot-product attention: $\mbf{e}_i = \mbf{s}^{\top}\mbf{h}_i $. Note: this assumes $D1=D2$ This is the version we saw earlier.

## Bilinear attention

Bilinear attention: $\mbf{e}_i = \mbf{s}^{\top}\mbf{W}\mbf{h}_i $. $\mbf{W} \in \mathbb{R}^{D1\times D2}$ is a weight matrix.

## Reduced Rank

As bilinear but low rank: $\mbf{e}_i = \mbf{s}^{\top}(\mbf{U}^{\top}\mbf{V})\mbf{h}_i =(\mbf{U}\mbf{s})^{\top}(\mbf{V}\mbf{h}_i)$. For low rank matrices $\mbf{U} \in \mathbb{R}^{k \times D2}$ and $\mbf{V} \in \mathbb{R}^{k \times D1}$ with $k \ll D1,D2$

# Attention is a _general_ Deep Learning technique

We have seen that attention is a great way to improve the sequence-to-sequence model for Machine Translation.
- However: **You can use attention in many architectures** (not just seq2seq) and **many tasks** (not just MT)

More general definition of attention:
 - Given a set of vector **values**, and a vector **query**, attention is a technique to compute
a <u>weighted sum of the values, dependent on the query</u>

- We sometimes say that **the query attends to the values**
- For example, in the `seq2seq + attention model`, each decoder hidden state (`query`) attends to all the encoder hidden states (`values`).


# Intuition about Attention 🧐

- The weighted sum is a **selective summary** of the information contained in the values, where the query determines which values to focus on.
- Attention is a way to obtain a **fixed-size representation** of an arbitrary set of representations (the values), **dependent on some other representation** (the query).

<div align='center'><img src="figs/attention_13.png" width='60%' ></div>