# Practice session 2

## Preprocessing text data

**This notebook is recommended to run in [Google Colab](https://colab.research.google.com). Towards the end of the class we will use the GPUs provided there to make our model train faster. Check that your runtime type is GPU before you start training.**

**If you have a local GPU and a working Sockeye installation, feel free run it locally. (If you have a working Sockeye installation but no GPU, you can also run everything locally, but the last part - training the model - will be very slow.)**

We have learned how to setup our environment and use Sockeye to train a sequence-to-sequence model. However, it is not all there is to creating an actual machine translation system. Unlike sequences of integers, natural language texts are complicated and messy. We need to preprocess them before we can use them for training.

### 1. Get the data

Download the Estonian-English parallel corpus and move it into `/data`. 

In [None]:
!wget http://opus.nlpl.eu/download.php?f=Europarl/v8/moses/en-et.txt.zip

In [None]:
!mkdir data
!mv download.php\?f\=Europarl%2Fv8%2Fmoses%2Fen-et.txt.zip data/en-et.zip

Uncompress the data, delete unnecessary files, move the files we need into `data/`.

In [None]:
!unzip data/en-et.zip
!rm LICENSE
!rm README
!rm Europarl.en-et.xml
!mv Europarl* data/

Give our files shorter names.

In [None]:
!mv data/Europarl.en-et.en data/europarl.en
!mv data/Europarl.en-et.et data/europarl.et

### 2. Look at the data

You can see that the parallel corpus is named Europarl. It is one of several corpora that are commonly used.

Let's check how many lines our files contain:

In [None]:
!wc -l data/europarl*

As you may have guessed, "Europarl" comes from "European Parliament". This corpus contains translations of parliament proceedings. It is a convenient resource, as all parliament proceedings have to be traslated into all official languages of the European Union. Which is great for us!

Let's see what some random sentence pairs from this corpus look like. First, let's shuffle and merge the source and target files horizontally (each line of the resulting file will contain a source line and a target line, separated by a tab):

In [None]:
!paste data/europarl.et data/europarl.en | shuf > data/shuf-europarl.both

Now let's print out several sentence pairs in a readable format:

In [None]:
with open('data/shuf-europarl.both', 'r', encoding='utf8') as fh:
    for i in range(5):
        et_sentence, en_sentence = fh.readline().strip().split('\t')
        print('ET: {}\nEN: {}\n'.format(et_sentence, en_sentence))

**Question.** What kind of language do these sentences exhibit? Do the translations look good to you? If you do not speak Estonian, how did you judge that?

In this practice session, we will only use a small subset of the Europarl corpus for training. However, note that for training a real MT system, even the whole Europarl corpus would not be considered a lot of data.

Let's separate the data again, and keep 20000 lines for the training set, 1000 lines for the development set, and 500 lines for the test set.

In [None]:
!sed -n 1,20000p data/shuf-europarl.both | cut -f 1 > data/train.et
!sed -n 1,20000p data/shuf-europarl.both | cut -f 2 > data/train.en
!sed -n 20001,21000p data/shuf-europarl.both | cut -f 1 > data/dev.et
!sed -n 20001,21000p data/shuf-europarl.both | cut -f 2 > data/dev.en
!sed -n 21001,21500p data/shuf-europarl.both | cut -f 1 > data/test.et
!sed -n 21001,21500p data/shuf-europarl.both | cut -f 2 > data/test.en

### 3. Cleaning

Now that we have looked at the data, we can start preprocessing it.

Real data is always messy. Some sentence pairs may be misaligned, some tranlations may be missing. It is always a good idea to take a look at your data and make sure that it is mostly OK.

We can get rid of sentence pairs that are almost defnitely bad. These pairs are those where at least one side (source or target) is empty, at least one side is longer than 100 words, or one side contains at least 9 times more words that the other. You are provided with a Python script which does that, let's apply it to our files.

If you are using Colab, don't forget to upload the script `cleaning.py` first. The script is included in the practice session materials.

In [None]:
!python cleaning.py --corpora data/train --output data/cleaned-train --src_lang et --tgt_lang en

Let's check how many sentence pairs are left:

In [None]:
!wc -l data/cleaned-train*

### 4. Truecasing

Now we need to truecase the text.

**Question.** Truecasing has been mentioned in the lectures. What is it? Why do we need it?

We will use our own [TartuNLP truecaser](https://github.com/TartuNLP/truecaser). (If you happen to find any bugs, feel free to open an issue or a pull request.)

First, get the truecaser code:

In [None]:
!git clone https://github.com/TartuNLP/truecaser.git

Let's create a directory where we'll keep all the preprocessing models.

In [None]:
!mkdir preproc-models

Now let's train two truecasing models, one for Estonian and one for English.

In [None]:
!python truecaser/learntc.py data/cleaned-train.et preproc-models/tc-et
!python truecaser/learntc.py data/cleaned-train.en preproc-models/tc-en

Now we can apply the models:

In [None]:
!python truecaser/applytc.py preproc-models/tc-et data/cleaned-train.et > data/tc-cleaned-train.et
!python truecaser/applytc.py preproc-models/tc-en data/cleaned-train.en > data/tc-cleaned-train.en

Compare the first sentences of the English training file before and after truecasing:

In [None]:
!head -5 data/cleaned-train.en

In [None]:
!head -5 data/tc-cleaned-train.en

**Question.** Did our model make any unexpected changes in the first examples? If it has, can you guess why this happened?

### 5. Subword segmentation

The last preprocessing step is subword segmentation. Words will be split into smaller parts based on character co-occurrence frequency. The most common words will remain in one piece, and rare words will be broken into several units.

**Question.** Why do we need subword segmentation?

**Note.** In a typical natural language processing pipeline, the first pre-processing step would be **tokenization**. Its main task is to turn a string into a list of tokens, in other words, to separate words from punctuation marks (e.g. `Hi, Mary!` $\rightarrow$ `["Hi", ",", "Mary", "!"]`). We are not doing tokenization in this tutorial because SentencePiece can handle untokenized text. One benefit of using SentencePiece for both tokenization and subword segmentation is that it is fully reversible. Our current pipeline may not be typical for other NLP tasks.

First, install SentencePiece, one of the popular options for subword segmentation:

In [None]:
!pip install sentencepiece

Now we can learn a model for splitting our text into subwords. Note that it is common to have a joint vocabulary for source and target languages.

SentencePiece is not very straightforward to use, so we have provided you with a script that does everything you need.

If you are using Colab, don't forget to upload the script `word-pieces.py` first. The script is included in the practice session materials.

In [None]:
!python word-pieces.py -h

Let's train a model with 4000 subwords. This means that SentencePiece will split the words until the size of vocabulary reduces to 4000. 

In [None]:
!python word-pieces.py --action train --size 4000 --corpora data/tc-cleaned-train.* --model preproc-models/sp

Now we can apply our model to the training files:

In [None]:
!python word-pieces.py --action split --corpora data/tc-cleaned-train.* --model preproc-models/sp

### 5. Repeat for dev sets

Now, repeat the preprocessing steps for the development sets. Note that you do not need to learn new models, but only to apply the ones that were learned using the training sets.

In [None]:
### YOUR CODE ###

### 6. Data preparation

If you are working in Colab, install Sockeye. Check that your runtime type is GPU before you install the GPU version.

In [None]:
!wget https://raw.githubusercontent.com/awslabs/sockeye/master/requirements/requirements.gpu-cu100.txt
!pip install sockeye --no-deps -r requirements.gpu-cu100.txt
!rm requirements.gpu-cu100.txt

There is one final preprocessing step left. It is not necessary, but with large datasets it makes life easier. If you do data preparation beforehand, you will not have to spend time preparing data for Sockeye when you start training. For big models this means that if training fails, you will know about it right away and not in a few hours. The following command will serialize data in matrix format and split it into shards, if necessary:

In [None]:
!python -m sockeye.prepare_data --source data/sp-tc-cleaned-train.et \
                                --target data/sp-tc-cleaned-train.en \
                                --shared-vocab \
                                --output prepared_data

### 7. Train a translation model

Now that we have preprocessed some texts, we are finally ready to train a translation model. It will not be very good, because we are only using 20,000 sentence pairs for training and we do not have a lot of time, but nevertheless it should learn something useful.

The following command will train a 1-layer bi-LSTM encoder, 1-layer LSTM decoder with dot attention for 30 epochs.

If you are using Colab, before you start training, check that you have selected runtime type 'GPU'.

In [None]:
import time
start = time.time()
!python -m sockeye.train --prepared-data prepared_data \
                         --validation-source data/sp-tc-cleaned-dev.et \
                         --validation-target data/sp-tc-cleaned-dev.en \
                         --shared-vocab \
                         --output et2en_model \
                         --encoder rnn \
                         --decoder rnn \
                         --num-embed 256 \
                         --rnn-num-hidden 512 \
                         --num-layers 1:1 \
                         --rnn-attention-type dot \
                         --max-seq-len 60 \
                         --decode-and-evaluate 500 \
                         --batch-type word \
                         --batch-size 8000 \
                         --max-num-epochs 30 \
                         --checkpoint-interval 200 \
                         --initial-learning-rate 0.002

print(time.time() - start)

You probably got best validation BLEU below 0.05, which indicates that our model did not learn to translate well yet.

### 8. Translate something

Before we can translate a sentence, we need to preprocess it in the same way as we did the training and development sets.

First, create a file containing the lines "Tere!" and "See on lause."

In [None]:
with open('data/input.et', 'w', encoding='utf8') as fh:
    fh.write('\n'.join(["Tere!", "See on lause."]) + '\n')

Then preprocess it: truecase and split into subwords.

In [None]:
!python truecaser/applytc.py preproc-models/tc-et data/input.et > data/tc-input.et
!python word-pieces.py --action split --corpora data/tc-input.et --model preproc-models/sp

Now we can translate it. To get a readable sentence, we also need to reverse SentencePiece splitting afterwards.

In [None]:
!python -m sockeye.translate --models et2en_model --input data/sp-tc-input.et --output data/hypothesis.en
!python word-pieces.py --action restore --corpora data/hypothesis.en --model preproc-models/sp

Let's see what we got:

In [None]:
with open('data/de-sp-hypothesis.en', 'r', encoding='utf8') as fh:
    print(fh.read())

You can probably see that our model generates readable English text, but it is not necessarily a translation of the input. The language model component is already OK, but the conditioning part is not working yet. You will fix it when you train a bigger baseline with more data.