# Attention Is All You Need

Let's try to reproduce the code described in the seminal paper
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) (which introduced the Transformer
architecture) and write ourselves a German-English translator!

## Dataset

Attention Is All You Need was a contribution to the [Conference on Neural Information Processing
Systems](https://neurips.cc/Conferences/2017) (NIPS, called NeurIPS since 2018) in 2017 and used
data of the [machine translation task](https://www.statmt.org/wmt14/translation-task.html) of the
2014 Workshop on Statistical Machine Translation (WMT14). The dataset for the German-English
language pair consists of three parallel corpora (collections of sentences/text snippets in multiple
languags):

* The European Parliament Proceedings Parallel Corpus (Europarl) v7. Documents produced by the
European Parlament are usually translated into all 24 EU languages. Europarl is a collection of
sentences in 11 languages taken from the proceedings of the European Parlament (ie. political
speeches).
* A parallel text corpus extracted from Common Crawl.
* News Commentary, mostly text taken from economic and political news articles.

The WMT14 task description page also lists the test set, "newstest2014" as well as the recommended
dev/validation set "newstest2013" (the test set of WMT13).

Our first step is to download the needed archives from the website and extract the relevant
German-English files.

In [None]:
!../sh/download_input_files.sh


The News Commentary files have different line break sequences, so we homogenize them using a script.

In [None]:
%run fix_line_breaks.py \
     ../1_input/news-commentary-v9.de-en.en ../1_input/processed/news-commentary-v9.de-en.en
%run fix_line_breaks.py \
     ../1_input/news-commentary-v9.de-en.de ../1_input/processed/news-commentary-v9.de-en.de

The newstest2014 files are in SGML. We'll extract the data into simple text files as well. There's a
slight issue with those files, they contain ampersands which have a special meaning in XML files
(SGML is XML), so we need to escape them before converting the files (they'll turn back into normal
ampersands during conversion).

In [None]:
%run escape_ampersands.py \
     ../1_input/newstest2014-deen-ref.de.sgm ../1_input/processed/newstest2014.de.sgm
%run escape_ampersands.py \
     ../1_input/newstest2014-deen-ref.en.sgm ../1_input/processed/newstest2014.en.sgm
%run convert_sgm.py \
     ../1_input/processed/newstest2014.de.sgm ../1_input/processed/newstest2014.en.sgm \
     ../1_input/processed/newstest2014.de ../1_input/processed/newstest2014.en

We now have about 4.5M sentences in German and English for training, and 3K each for validation and test.

In [None]:
!cd ../1_input && \
wc -l europarl-v7.de-en.de commoncrawl.de-en.de processed/news-commentary-v9.de-en.de && echo && \
wc -l europarl-v7.de-en.en commoncrawl.de-en.en processed/news-commentary-v9.de-en.en && echo && \
wc -l newstest2013.de && \
wc -l newstest2013.en && echo && \
wc -l processed/newstest2014.de && \
wc -l processed/newstest2014.en

## Cleaning the data

To make things easier for ourselves and the model, we will clean the data a bit.

In the test set, there is an accent character that seems to be where a space should be. We deal with
this "manually":

In [None]:
%run fix_testset.py ../1_input/processed/newstest2014.de ../tmp/newstest2014.de

Now we apply a cleaning script to all the input file pairs to do the following:

* NFKC Unicode normalization
* Replacing some characters, e.g. all sorts of quotation marks with simple double quotes
* Removing all pairs where one of the sentences is empty (no translation)
* Removing all pairs where at least one of the sentences contains characters not included in a
rather small set of standard European language characters

Removing here will mean setting the line to an empty string in both files (to keep line numbers
consistent with the original input files).

In [None]:
import multiprocessing
from pathlib import Path

from clean import clean

num_processes = multiprocessing.cpu_count()
src_base_path = Path("../1_input")
dst_base_path = Path("../2_clean")
for de_path, en_path in [
    ["europarl-v7.de-en.de", "europarl-v7.de-en.en"],
    ["commoncrawl.de-en.de", "commoncrawl.de-en.en"],
    ["processed/news-commentary-v9.de-en.de", "processed/news-commentary-v9.de-en.en"],
    ["newstest2013.de", "newstest2013.en"],
    ["../tmp/newstest2014.de", "processed/newstest2014.en"],
]:
    de_src_path = src_base_path / de_path
    de_dst_path = dst_base_path / de_src_path.name
    en_src_path = src_base_path / en_path
    en_dst_path = dst_base_path / en_src_path.name
    clean(
        str(de_src_path), str(en_src_path), str(de_dst_path), str(en_dst_path), num_processes, True
    )

The cleaned text files now still have the same number of lines as before:

In [None]:
!cd ../2_clean && \
wc -l europarl-v7.de-en.de commoncrawl.de-en.de news-commentary-v9.de-en.de && echo && \
wc -l europarl-v7.de-en.en commoncrawl.de-en.en news-commentary-v9.de-en.en && echo && \
wc -l newstest2013.de && \
wc -l newstest2013.en && echo && \
wc -l newstest2014.de && \
wc -l newstest2014.en

However, when we count only non-empty lines we see that we lost around 70K (1.5%) of sentence pairs
in the training set:

In [None]:
!cd ../2_clean && \
../sh/count_non_empty_lines.sh \
    europarl-v7.de-en.de commoncrawl.de-en.de news-commentary-v9.de-en.de && echo && \
../sh/count_non_empty_lines.sh \
    europarl-v7.de-en.en commoncrawl.de-en.en news-commentary-v9.de-en.en && echo && \
../sh/count_non_empty_lines.sh newstest2013.de && \
../sh/count_non_empty_lines.sh newstest2013.en && echo && \
../sh/count_non_empty_lines.sh newstest2014.de && \
../sh/count_non_empty_lines.sh newstest2014.en