<a href="https://colab.research.google.com/github/joeynmt/joeynmt/blob/master/joey_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Joey NMT Demo

In this notebook, we'll train a Transformer model for translating between simple sentences in Esperanto (*epo*) and English (*eng*). You'll have the option to choose your own languages as well.

**Important:** Before you start, set runtime type to GPU.

Author: Julia Kreutzer

## Installation

Install the right PyTorch version for Joey NMT. Might have to restart the colab after installing Joey NMT.

In [1]:
!pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (735.4MB)
[K     |████████████████████████████████| 735.4MB 25kB/s 
[31mERROR: torchvision 0.8.1+cu101 has requirement torch==1.7.0, but you'll have torch 1.7.1+cu101 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 1.7.0+cu101
    Uninstalling torch-1.7.0+cu101:
      Successfully uninstalled torch-1.7.0+cu101
Successfully installed torch-1.7.1+cu101


In [2]:
!pip install joeynmt

Collecting joeynmt
[?25l  Downloading https://files.pythonhosted.org/packages/19/28/9e2df8769162c911955015a381ea76f4a7248d0fe2169a4ed8b7b5193cd6/joeynmt-1.2-py3-none-any.whl (80kB)
[K     |████                            | 10kB 13.3MB/s eta 0:00:01[K     |████████                        | 20kB 20.6MB/s eta 0:00:01[K     |████████████▏                   | 30kB 21.5MB/s eta 0:00:01[K     |████████████████▏               | 40kB 20.3MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 22.1MB/s eta 0:00:01[K     |████████████████████████▎       | 61kB 15.4MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 15.0MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 8.1MB/s 
[?25hCollecting torchtext==0.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/13/80/046f0691b296e755ae884df3ca98033cb9afcaf287603b2b7999e94640b8/torchtext-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (7.0MB)
[K     |████████████████████████████████| 7.0MB 13.3MB/s 


# Data Preparation

We'll use English - Esperanto translations from the Tatoeba challenge. 

If you want to use a different language pair, you'll need to replace all instances of `eng` and `epo` language identifiers to your languages of choice. You can find all available languages [here](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/Data.md). Note that if the training data size is larger, data preparation and training might take much longer than the example with only 400k sentence pairs.

## Download

In [3]:
!wget https://object.pouta.csc.fi/Tatoeba-Challenge/eng-epo.tar

--2021-02-24 07:48:53--  https://object.pouta.csc.fi/Tatoeba-Challenge/eng-epo.tar
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35747840 (34M) [application/x-tar]
Saving to: ‘eng-epo.tar’


2021-02-24 07:48:56 (16.5 MB/s) - ‘eng-epo.tar’ saved [35747840/35747840]



In [4]:
!tar -xvf  'eng-epo.tar'

data/eng-epo/
data/eng-epo/train.src.gz
data/eng-epo/dev.trg
data/eng-epo/train.id.gz
data/eng-epo/test.trg
data/eng-epo/test.id
data/eng-epo/dev.src
data/eng-epo/dev.id
data/eng-epo/test.src
data/eng-epo/train.trg.gz


In [5]:
!gunzip 'data/eng-epo/train.src.gz'
!gunzip 'data/eng-epo/train.trg.gz'

We'll only use a subset of dev and test data.

In [6]:
!mv 'data/eng-epo/train.src' 'data/eng-epo/train.eng'
!head -n 1000 'data/eng-epo/dev.src' > 'data/eng-epo/dev.eng'
!head -n 1000 'data/eng-epo/test.src' > 'data/eng-epo/test.eng'
!mv 'data/eng-epo/train.trg' 'data/eng-epo/train.epo'
!head -n 1000 'data/eng-epo/dev.trg' > 'data/eng-epo/dev.epo'
!head -n 1000 'data/eng-epo/test.trg' > 'data/eng-epo/test.epo'

The data is sentence-aligned, that means that source and target file contain one sentence per line which correspond to each other.

In [7]:
! head data/eng-epo/dev.eng


I made you something.
Twice two is four.
That hurts! Stop it!
It is too much for me. I need to slow down.
I never want to see him again.
At what hour was she born?
The traffic jam lasted one hour.
After dinner, we played cards till eleven.
I doubt if he is a lawyer.
Who's on duty today?


In [8]:
! head data/eng-epo/dev.epo

Mi faris ion por vi.
Du oble du faras kvar.
Tio suferigas min! Ĉesu!
Estas tro multe por mi. Mi devas malrapidiĝi.
Mi volas neniam plu vidi lin.
Je kioma horo ŝi naskiĝis?
La trafikmalfluo daŭris unu horon.
Post la vespermanĝo ni kartludis ĝis la dudek tria.
Mi dubas ĉu li estas advokato.
Kiu deĵoras hodiaŭ?


In [9]:
! wc -l data/eng-epo/*

    1000 data/eng-epo/dev.eng
    1000 data/eng-epo/dev.epo
  235834 data/eng-epo/dev.id
  235834 data/eng-epo/dev.src
  235834 data/eng-epo/dev.trg
    1000 data/eng-epo/test.eng
    1000 data/eng-epo/test.epo
   10000 data/eng-epo/test.id
   10000 data/eng-epo/test.src
   10000 data/eng-epo/test.trg
  402180 data/eng-epo/train.eng
  402180 data/eng-epo/train.epo
    1389 data/eng-epo/train.id.gz
 1547251 total


## Subword model training

We will use the `subword_nmt` library to split words into subwords (BPE) according to their frequency in the training corpus.

In [10]:
import os

In [11]:
src_lang = 'epo'
trg_lang = 'eng'
bpe_size = 4000
datadir = '/content/data/eng-epo/'
name = f'{src_lang}_{trg_lang}_bpe{bpe_size}'


train_src_file = os.path.join(datadir, f'train.{src_lang}')
train_trg_file = os.path.join(datadir, f'train.{trg_lang}')
train_joint_file = os.path.join(datadir, f'train.{src_lang}-{trg_lang}')
dev_src_file = os.path.join(datadir, f'dev.{src_lang}')
dev_trg_file = os.path.join(datadir, f'dev.{trg_lang}')
test_src_file = os.path.join(datadir, f'test.{src_lang}')
test_trg_file = os.path.join(datadir, f'test.{trg_lang}')
src_files = {'train': train_src_file, 'dev': dev_src_file, 'test': test_src_file}
trg_files = {'train': train_trg_file, 'dev': dev_trg_file, 'test': test_trg_file}


vocab_src_file = os.path.join(datadir, f'vocab.{bpe_size}.{src_lang}')
vocab_trg_file = os.path.join(datadir, f'vocab.{bpe_size}.{trg_lang}')
bpe_file = os.path.join(datadir, f'bpe.codes.{bpe_size}')

Train a BPE model with 4000 symbols for both languages jointly.

In [12]:
! cat $train_src_file $train_trg_file > $train_joint_file

! subword-nmt learn-bpe \
  --input $train_joint_file \
  -s $bpe_size \
  -o $bpe_file

This file contains the merges of character sequences that subwords are made of.

In [13]:
! head $bpe_file

#version: 0.2
t h
a n
o n
e r
i n
e n
s t
l a</w>
o r


We apply the learned BPE merges to training, development and test data.


In [14]:
src_bpe_files = {}
trg_bpe_files = {}
for split in ['train', 'dev', 'test']:
  src_input_file = src_files[split]
  trg_input_file = trg_files[split]
  src_output_file = src_input_file.replace(split, f'{split}.{bpe_size}.bpe')
  trg_output_file = trg_input_file.replace(split, f'{split}.{bpe_size}.bpe')
  src_bpe_files[split] = src_output_file
  trg_bpe_files[split] = trg_output_file

  ! subword-nmt apply-bpe \
    -c $bpe_file \
    < $src_input_file > $src_output_file

  ! subword-nmt apply-bpe \
    -c $bpe_file \
    < $trg_input_file > $trg_output_file


The subword-split data contains `@@ ` to indicate where words were split into subwords.

In [15]:
! head data/eng-epo/dev.4000.bpe.eng

I made you some@@ thing.
T@@ w@@ ice two is f@@ our@@ .
That h@@ ur@@ t@@ s! Stop it@@ !
It is too much for me. I need to s@@ low dow@@ n.
I never want to see him again@@ .
A@@ t what h@@ our was she bor@@ n@@ ?
The traf@@ fi@@ c jam last@@ ed one h@@ our@@ .
Af@@ ter d@@ in@@ n@@ er, we played c@@ ards til@@ l el@@ ev@@ en.
I d@@ ou@@ b@@ t if he is a law@@ y@@ er.
Wh@@ o@@ 's on d@@ ut@@ y t@@ od@@ ay@@ ?


In [16]:
! head data/eng-epo/dev.4000.bpe.epo

Mi faris ion por vi.
D@@ u o@@ ble du faras kvar@@ .
Tio sufer@@ igas min@@ ! Ĉes@@ u!
Estas tro multe por mi. Mi devas mal@@ rapi@@ di@@ ĝ@@ i.
Mi volas neniam plu vidi lin.
J@@ e ki@@ om@@ a h@@ oro ŝi n@@ aski@@ ĝ@@ is?
La tra@@ fik@@ mal@@ fl@@ uo daŭ@@ ris unu hor@@ on.
Post la vesper@@ man@@ ĝo ni kart@@ lud@@ is ĝis la dudek tri@@ a.
Mi du@@ b@@ as ĉu li estas ad@@ vok@@ at@@ o.
Kiu de@@ ĵ@@ or@@ as hodi@@ aŭ@@ ?


## Prepare the vocabulary

From the pre-processed training data, we extract the final vocabulary for the translation model. It should contain all subwords needed for representing the source and target training data.

In [17]:
! wget https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/build_vocab.py

--2021-02-24 07:49:54--  https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2034 (2.0K) [text/plain]
Saving to: ‘build_vocab.py’


2021-02-24 07:49:54 (52.8 MB/s) - ‘build_vocab.py’ saved [2034/2034]



In [18]:
vocab_src_file = src_bpe_files['train']
vocab_trg_file = trg_bpe_files['train']
bpe_vocab_file = os.path.join(datadir, f'joint.{bpe_size}bpe.vocab')

! python build_vocab.py  \
  $vocab_src_file $vocab_trg_file \
  --output_path $bpe_vocab_file

# Model configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places. 

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [19]:
# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "{datadir}/train.{bpe_size}.bpe"
    dev:   "{datadir}/dev.{bpe_size}.bpe"
    test:  "{datadir}/test.{bpe_size}.bpe"
    level: "bpe"
    lowercase: False                
    max_sent_length: 30             # Extend to longer sentences.
    src_vocab: "{vocab_src_file}"
    trg_vocab: "{vocab_trg_file}"

testing:
    beam_size: 5
    alpha: 1.0
    sacrebleu:                      # sacrebleu options
        remove_whitespace: True     # `remove_whitespace` option in sacrebleu.corpus_chrf() function (defalut: True)
        tokenize: "intl"            # `tokenize` option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc) 

training:
    #load_model: "models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # Alternative: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 2000          # Set to at least once per epoch.
    logging_freq: 200
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True       # Requires joint vocabulary.
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # Increase to 512 for larger data.
        ff_size: 1024            # Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, source_language=src_lang, target_language=trg_lang,
           datadir=datadir, vocab_src_file=bpe_vocab_file, 
           vocab_trg_file=bpe_vocab_file, bpe_size=bpe_size)
with open("transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Training

This will take a while. The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality. 

The log is also stored in the model directory within this runtime (inspect files in the menu on the left). There you can also find a summary report of all validations. We'll also use TensorBoard to visualize the training progress on the go. This requires enabling Cookies in the browser.

After 12h at the latest, Colab will disconnect, so to make sure you're progress is not lost, download the checkpoints from the model directory from time to time. You'll later be able to reload them if model hyperparameters match.

In [33]:
# Load the TensorBoard notebook extension. It will be empty at first.
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [32]:
%tensorboard --logdir models/epo_eng_bpe4000_transformer/tensorboard 

<IPython.core.display.Javascript object>

In [20]:
!python -m joeynmt train transformer_epo_eng_bpe4000.yaml

2021-02-24 07:49:59,195 - INFO - root - Hello! This is Joey-NMT (version 1.2).
2021-02-24 07:49:59,253 - INFO - joeynmt.data - loading training data...
2021-02-24 07:50:04,524 - INFO - joeynmt.data - building vocabulary...
2021-02-24 07:50:04,825 - INFO - joeynmt.data - loading dev data...
2021-02-24 07:50:04,870 - INFO - joeynmt.data - loading test data...
2021-02-24 07:50:04,877 - INFO - joeynmt.data - data loaded.
2021-02-24 07:50:05.292394: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-24 07:50:07,345 - INFO - joeynmt.training - Total params: 12223488
2021-02-24 07:50:12,490 - INFO - joeynmt.helpers - cfg.name                           : epo_eng_bpe4000_transformer
2021-02-24 07:50:12,490 - INFO - joeynmt.helpers - cfg.data.src                       : epo
2021-02-24 07:50:12,490 - INFO - joeynmt.helpers - cfg.data.trg                       : eng
2021-02-24 07:50:12,490 - INFO - joeynmt.helpers - cfg.dat

## Continue training after interruption

To continue after an interruption, the configuration needs to be modified in 2 places: 
1. `load_model` to point to the checkpoint to load.
2. `model_dir` to create a new directory.


In [21]:
ckpt_number = 8000
reload_config = config.replace(
    f'#load_model: "models/{name}_transformer/1.ckpt"', f'load_model: "models/{name}_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/{name}_transformer"', f'model_dir: "models/{name}_transformer_continued"')
with open("transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

Joey NMT then picks up training from there.

In [22]:
!python -m joeynmt train transformer_epo_eng_bpe4000_reload.yaml

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/joeynmt/__main__.py", line 3, in <module>
    from joeynmt.training import train
  File "/usr/local/lib/python3.7/dist-packages/joeynmt/training.py", line 18, in <module>
    import torch
  File "/usr/local/lib/python3.7/dist-packages/torch/__init__.py", line 190, in <module>
    from torch._C import *
RuntimeError: KeyboardInterrupt: 


## Let's Translate!

The `test` mode can be used to translate (and evaluate on) the test set specified in the configuration. We usually do this only once after we've tuned hyperparameters on the dev set.

In [23]:
!python -m joeynmt test models/epo_eng_bpe4000_transformer/config.yaml

2021-02-24 08:00:19,593 - INFO - root - Hello! This is Joey-NMT (version 1.2).
2021-02-24 08:00:19,594 - INFO - joeynmt.data - building vocabulary...
2021-02-24 08:00:19,897 - INFO - joeynmt.data - loading dev data...
2021-02-24 08:00:19,905 - INFO - joeynmt.data - loading test data...
2021-02-24 08:00:19,912 - INFO - joeynmt.data - data loaded.
2021-02-24 08:00:19,947 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-02-24 08:00:23,099 - INFO - joeynmt.prediction - Decoding on dev set (/content/data/eng-epo//dev.4000.bpe.eng)...
2021-02-24 08:00:36,302 - INFO - joeynmt.prediction -  dev bleu[intl]:   1.97 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-02-24 08:00:36,302 - INFO - joeynmt.prediction - Decoding on test set (/content/data/eng-epo//test.4000.bpe.eng)...
2021-02-24 08:00:49,635 - INFO - joeynmt.prediction - test bleu[intl]:   1.99 [Beam search decoding with beam size = 5 and alpha = 1.0]


The `translate` mode is more interactive and takes prompts to translate interactively. Warning: it requires applying the same pre-processing steps to the new input as you've applied before model training (i.e. splitting into subwords).

In [24]:
from subword_nmt import apply_bpe

with open(bpe_file, "r") as merge_file:
  bpe = apply_bpe.BPE(codes=merge_file)

preprocess = lambda x: bpe.process_line(x.strip())

In [25]:
my_sentence = 'Esperanto, origine la Lingvo Internacia, estas la plej disvastiĝinta internacia planlingvo.'   # From https://eo.wikipedia.org/wiki/Esperanto

In [26]:
preprocess(my_sentence)

'E@@ sper@@ ant@@ o, or@@ ig@@ ine la L@@ ingv@@ o In@@ tern@@ aci@@ a, estas la plej dis@@ v@@ ast@@ iĝ@@ inta intern@@ ac@@ ia plan@@ lingv@@ o.'

In [27]:
!python -m joeynmt translate models/epo_eng_bpe4000_transformer/config.yaml

2021-02-24 08:00:51,240 - INFO - root - Hello! This is Joey-NMT (version 1.2).

Please enter a source sentence (pre-processed): 
E@@ sper@@ ant@@ o, or@@ ig@@ ine la L@@ ingv@@ o In@@ tern@@ aci@@ a, estas la plej dis@@ v@@ ast@@ iĝ@@ inta intern@@ ac@@ ia plan@@ lingv@@ o.
JoeyNMT: Ever, we have to the Lule of the Taper of the Toug.

Please enter a source sentence (pre-processed): 

Bye.
^C
