In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2021-10-08
# GitHub: https://github.com/jaaack-wang 
# About: Translation, Seq2Seq, Attention for Stanford CS224N- NLP with Deep Learning | Winter 2019

# Table of Contents
- [1. Pre-Neural Machine Translation](#1)
    - [1.1 Problem defined](#1-1)
    - [1.2 Early 50s: rule-based](#1-1)
    - [1.3 1990s-2010s: statistical](#1-1)
- [2. Neural Machine Translation](#2)
    - [2.1 Problem defined](#2-1)
    - [2.2 Seq2seq Model](#2-2)
    - [2.3 Training](#2-3)
    - [2.4 Decoing](#2-4)
        - [2.4.1 Greedy decoding](#2-4-1)
        - [2.4.2 Exhaustive search decoding](#2-4-1)
        - [2.4.3 Beam search decoding](#2-4-1)
    - [2.5 Tradeoff of NMT](#2-5)
        - [2.5.1 Advantages](#2-5-1)
        - [2.5.2 Disadvantages](#2-5-2)
    - [2.6 Eluvation: BLEU](#2-6)
- [3. Attention](#3)
    - [3.1 Background](#3-1)
    - [3.2 Graphic represenation](#3-2)
    - [3.3 Equations](#3-3)
    - [3.4 Benifits](#3-4)
    - [3.5 Attention as a general DL technique](#3-5)
    - [3.6 Remaining problems](#3-6)
    - [ 3.7 Trend](#3-7)
- [4. References](#4)

<a name='1'></a>
# 1. Pre-Neural Machine Translation

<a name='1-1'></a>
## 1.1 Problem defined

**Machine Translation (MT)** is the task of translating a sentence x from one language (the _source language_) to a sentence y in another language (the _target language_).


<a name='1-2'></a>
## 1.2 Early 50s: rule-based

Machine Translation research began in the early 1950s. Mostly, Russian → English (cold war)

Systems were mostly rule-based, using a bilingual dictionary to map Russian words to their English counterparts

<a name='1-3'></a>
## 1.3 1990s-2010s: statistical

- Translation model + language model (see the last lecture for LM)
- Parallel data is needed for building the translation model
- Parallel data breaks down --> words alignments
- **Problem**: some words have no counterparts
<img src='../images/8-statMTDesc.png' width='600' height='300'>

- Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003

<img src='../images/8-statMTAlignment.png' width='600' height='300'>

<img src='../images/8-statMTAlignment2.png' width='600' height='300'>

<img src='../images/8-statMTAlignment3.png' width='600' height='300'>

<img src='../images/8-statMTAlignment4.png' width='600' height='300'>

<img src='../images/8-statMTAlignment5.png' width='600' height='300'>

We learn   as a combination of many factors, including: 

- Probability of particular words aligning (also depends on position in sent)
- Probability of particular words having particular fertility (number of corresponding words)

<img src='../images/8-SMTDecoding.png' width='600' height='300'>

<img src='../images/8-SMT.png' width='600' height='300'>


<a name='2'></a>
# 2. Neural Machine Translation

<a name='2-1'></a>
## 2.1 Problem defined

- **Neural Machine Translation (NMT)** is a way to do Machine Translation with a single neural network
- The neural network architecture is called **sequence-to-sequence (aka seq2seq)** and it involves two RNNs.


<a name='2-2'></a>
## 2.2 Seq2seq Model

- Language model (Conditional)

<img src='../images/8-seq2seqModel.png' width='600' height='300'>
<img src='../images/8-seq2seqModel2.png' width='600' height='300'>

- Other applications:
    - Summarization (long text → short text)
    - Dialogue (previous utterances → next utterance)
    - Parsing (input text → output parse as sequence)
    - Code generation (natural language → Python code)


<a name='2-3'></a>
## 2.3 Training 

<img src='../images/8-NNMTtraining.png' width='600' height='300'>


<a name='2-4'></a>
## 2.4 Decoing

<a name='2-4-1'></a>
### 2.4.1 Greedy decoding
- **Greedy decoding**: generate (or “decode”) the target sentence by taking argmax on each step of the decoder. Problem: Greedy decoding has no way to undo decisions!

<a name='2-4-2'></a>
### 2.4.2 Exhaustive search decoding
- **Exhaustive search decoding**:find all possible translations and choose the one that has the highest probablity. Problem: too expensive to do exhaustively! The complexity = $O(V^T)$ where V is the vocab size and T is the length of squence/time step to translate. 


<a name='2-4-3'></a>
### 2.4.3 Beam search decoding
- **Beam search decoding**:On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses). k is the beam size (in practice around 5 to 10). Problem: not guaranteed to find optimal solution, but much more efficient and practical than exhaustive search!


<img src='../images/8-beamSearch.png' width='600' height='300'>

Stopping condition:
<img src='../images/8-beamSearchStoppingCondition.png' width='600' height='300'>

- Problem:
<img src='../images/8-beamSearchProblem.png' width='600' height='300'>


<a name='2-5'></a>
## 2.5 Tradeoff of NMT

<a name='2-5-1'></a>
### 2.5.1 Advantages

Compared to SMT, NMT has many advantages:
- Better performance 
    - More fluent
    - Better use of context
    - Better use of phrase similarities

- A single neural network to be optimized end-to-end 
    - No subcomponents to be individually optimized

- Requires much less human engineering effort
    - No feature engineering
    - Same method for all language pairs


<a name='2-5-2'></a>
### 2.5.2 Disadvantages

Compared to SMT:
- NMT is less interpretable 
    - Hard to debug

- NMT is difficult to control
    - For example, can’t easily specify rules or guidelines for translation
    - Safety concerns!
    
    
<a name='2-6'></a>
## 2.6 Eluvation: BLEU

- Reference: [Papineni et al, 2002. BLEU: a Method for Automatic Evaluation of Machine Translation.](https://aclanthology.org/P02-1040.pdf)


<a name='3'></a>
# 3. Attention

<a name='3-1'></a>
## 3.1 Background

Encoding cannot capture all important about the source sentence (see the last lecture where the vanishing gradient problem of RNN is introduced)

<img src='../images/8-seq2seqbottleneck.png' width='600' height='300'>


- **Attention** provides a solution to the bottleneck problem.
- **Core idea**: on each step of the decoder, use **direct connection to the encoder** to **focus on a particular part** of the source sequence


<a name='3-2'></a>
## 3.2 Graphic represenation

Sometimes we take the attention output from the previous step, and also feed it into the decoder (along with the usual decoder input).

<img src='../images/8-attention.png' width='600' height='300'>
<img src='../images/8-attention2.png' width='600' height='300'>



<a name='3-3'></a>
## 3.3 Equations

<img src='../images/8-attentionEq.png' width='600' height='300'>


<a name='3-4'></a>
## 3.4 Benifits

<img src='../images/8-attentionBenifits.png' width='600' height='300'>


<a name='3-5'></a>
## 3.5 Attention as a general DL technique

- References: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
- References: “Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf

<img src='../images/8-attentionGeneral.png' width='600' height='300'>

<img src='../images/8-attentionGeneral2.png' width='600' height='300'>

<img src='../images/8-attentionGeneral3.png' width='600' height='300'>


<a name='3-6'></a>
## 3.6 Remaining problems

- Further reading: “Has AI surpassed humans at translation? Not even close!” https://www.skynettoday.com/editorials/state_of_nmt

MT is not a sloved problem. Many difficulties remain:
 
- Out-of-vocabulary words
- Domain mismatch between train and test data 
- Maintaining context over longer text
- Low-resource language pairs
- Using common sense is still hard

<img src='../images/8-MTProblem.png' width='600' height='300'>

- NMT picks up biases in training data

<img src='../images/8-MTProblem2.png' width='600' height='300'>

- Uninterpretable systems do strange things

<img src='../images/8-MTProblem3.png' width='600' height='300'>

<img src='../images/8-NMTMovesFoward.png' width='600' height='300'>


<a name='3-7'></a>
## 3.7 Trend

- Reference: [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal], http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf

<img src='../images/8-MTOverTime.png' width='600' height='300'>

<img src='../images/8-MTOverTime2.png' width='600' height='300'>

<a name='4'></a>
# 4. References

- [Course website](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/index.html)

- [Lecture video](https://www.youtube.com/watch?v=XXtpJxZBa2c) 

- [Lecture slide](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture08-nmt.pdf)