# Attention Is All You Need

Let's try to reproduce the code described in the seminal paper
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) (which introduced the Transformer
architecture) and write ourselves a German-English translator!

## Dataset

Attention Is All You Need was a contribution to the [Conference on Neural Information Processing
Systems](https://neurips.cc/Conferences/2017) (NIPS, called NeurIPS since 2018) in 2017 and used
data of the [machine translation task](https://www.statmt.org/wmt14/translation-task.html) of the
2014 Workshop on Statistical Machine Translation (WMT14). The dataset for the German-English
language pair consists of three parallel corpora (collections of sentences/text snippets in multiple
languags):

* The European Parliament Proceedings Parallel Corpus (Europarl) v7. Documents produced by the
European Parlament are usually translated into all 24 EU languages. Europarl is a collection of
sentences in 11 languages taken from the proceedings of the European Parlament (ie. political
speeches).
* A parallel text corpus extracted from Common Crawl.
* News Commentary, mostly text taken from economic and political news articles.

The WMT14 task description page also lists the recommended dev/validation set, as well as the test
set, which seem to be taken from News Commentary.


Our first step is to download the needed archives from the website and extract the relevant
German-English files.

In [None]:
!../sh/download_input_files.sh


The News Commentary files have different line break sequences, so we homogenize them using a script.

In [None]:
%run fix_line_breaks.py ../0_download/news-commentary-v9.de-en.en ../tmp/news-commentary-v9.de-en.en
%run fix_line_breaks.py ../0_download/news-commentary-v9.de-en.de ../tmp/news-commentary-v9.de-en.de

We can now concatenate the training set files into one big file per language with each line
corresponding to the line with the same number in the other file. We'll store the validation set
files under new names but otherwise use them as they are.

In [None]:
%cat ../0_download/europarl-v7.de-en.en ../0_download/commoncrawl.de-en.en ../tmp/news-commentary-v9.de-en.en > ../1_input/train.en.txt
%cat ../0_download/europarl-v7.de-en.de ../0_download/commoncrawl.de-en.de ../tmp/news-commentary-v9.de-en.de > ../1_input/train.de.txt

%cp ../0_download/newstest2013.en ../1_input/val.en.txt
%cp ../0_download/newstest2013.de ../1_input/val.de.txt

The testset files are in SGML. We'll extract the data into simple text files as well. There's a
slight issue with those files, they contain ampersands which have a special meaning in XML files
(SGML is XML), so we need to escape them before converting the files (they'll turn back into normal
ampersands during conversion).

In [None]:
%run escape_ampersands.py ../0_download/newstest2014-deen-ref.en.sgm ../tmp/newstest2014.en.sgm
%run escape_ampersands.py ../0_download/newstest2014-deen-ref.de.sgm ../tmp/newstest2014.de.sgm
%run convert_sgm.py ../tmp/newstest2014.en.sgm ../tmp/newstest2014.de.sgm ../1_input/test.en.txt ../1_input/test.de.txt