# Masakhane - Machine Translation for African Languages (Using JoeyNMT)

## Note before beginning:
### - The idea is that you should be able to make minimal changes to this in order to get SOME result for your own translation corpus. 

### - The tl;dr: Go to the **"TODO"** comments which will tell you what to update to get up and running

### - If you actually want to have a clue what you're doing, read the text and peek at the links

### - With 100 epochs, it should take around 7 hours to run in Google Colab

### - Once you've gotten a result for your language, please attach and email your notebook that generated it to masakhanetranslation@gmail.com

### - If you care enough and get a chance, doing a brief background on your language would be amazing. See examples in  [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

## Retrieve your data & make a parallel corpus

So the easiest way to get your data is to use some unix tools such as `wget` to fetch it from a url, and then `gunzip` to extract it if it's zipped. 

Parallel corpuses come in many formats. The ideal corpus comes with 2 files: `file.source` and `file.target` where `"source"` is your source language, such as `en` and `"target"` is your target language, such as `xh` (Xhosa)

Sometimes they come in a **.tmx** file a.k.a a **translation memory file**. This is an xml structure which will include the sentences in your target language and your source language in a single file. Thankfully, Jade wrote a silly `tmx2dataframe` package which converts your tmx file to a pandas dataframe. #

In [0]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "xh"
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

In [18]:
# Downloading and unzipping our xhosa corpus
# TODO: You'll need to download & extract your own corpus here! 
! wget "http://opus.nlpl.eu/download.php?f=XhosaNavy/v1/tmx/en-xh.tmx.gz" -O en-xh.tmx.gz
! gunzip -k  en-xh.tmx.gz
! ls -lh

# This is useful if you end up having to use a tmx file,
! pip install tmx2dataframe

--2019-08-11 12:25:42--  http://opus.nlpl.eu/download.php?f=XhosaNavy/v1/tmx/en-xh.tmx.gz
Resolving opus.nlpl.eu (opus.nlpl.eu)... 193.166.25.9
Connecting to opus.nlpl.eu (opus.nlpl.eu)|193.166.25.9|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://object.pouta.csc.fi/OPUS-XhosaNavy/v1/tmx/en-xh.tmx.gz [following]
--2019-08-11 12:25:42--  https://object.pouta.csc.fi/OPUS-XhosaNavy/v1/tmx/en-xh.tmx.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.0, 86.50.254.1
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3526058 (3.4M) [application/gzip]
Saving to: ‘en-xh.tmx.gz’


2019-08-11 12:25:44 (3.12 MB/s) - ‘en-xh.tmx.gz’ saved [3526058/3526058]

gzip: en-xh.tmx already exists; do you wish to overwrite (y or n)? y
total 39M
-rw-r--r--  1 root root  34K Aug 11 10:01 bpe.codes.4000
-rw-r--r--  1 root root 131K Aug 11 10:02 dev.bpe.en
-rw-r--r--  1

In [0]:
import pandas as pd
from tmx2dataframe import tmx2dataframe

# TODO:
# If your source is a translation memory file (tmx file), then the one file contains both your target and source language. If so, set tmx_file = "your file here"
# Comment if you have 2 files, instead of the tmx file
tmx_file = "en-xh.tmx"
source_file = None
target_file = None 

# Uncomment if you have 2 files and set source_file and target_file to the path of your parallel corpus files
# tmx_file = None
# source_file = 'file.src'
# target_file = 'file.tgt'

# Read in the files so we have an appropriate python dataframe
if tmx_file is not None:
    # tmx files
    metadata, df = tmx2dataframe.read(tmx_file)
else:
    # For 2 parallel files
    df_src = pd.read("file.src", header=None, names=["source_sentence"])
    df_tgt = pd.read("file.tgt", header=None, names=["target_sentence"])
    df = pd.concat([df_src, df_tgt], axis=1)
    df["source_language"] = source_language
    df["target_language"] = target_language

# Have a peak at the data
df.head(3)

In [0]:
# This section does the split between train/test/dev for the parallel corpora then saves them as separate files
# We use 1000 dev test and 1000 test set. In practice, it's useful to use an external test set


# Do the split between dev/test/train and create parallel corpora
num_dev_patterns = 1000
num_test_patterns = 1000

# Lower case the corpora
df["source_sentence"] = df["source_sentence"].str.lower()
df["target_sentence"] = df["target_sentence"].str.lower()


devtest = df.tail(num_dev_patterns + num_test_patterns)
test = devtest.tail(num_test_patterns) # Herman
dev = devtest.head(num_dev_patterns)  # Herman: Error in original
stripped = df.drop(df.tail(num_dev_patterns + num_test_patterns).index)

stripped[["source_sentence"]].to_csv("train.en", index=False)
stripped[["target_sentence"]].to_csv("train.xh", index=False)

dev[["source_sentence"]].to_csv("dev.en", index=False)
dev[["target_sentence"]].to_csv("dev.xh", index=False)

test[["source_sentence"]].to_csv("test.en", index=False)
test[["target_sentence"]].to_csv("test.xh", index=False)




---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [22]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

fatal: destination path 'joeynmt' already exists and is not an empty directory.
Processing /content/joeynmt
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-0.0.1-cp36-none-any.whl size=69350 sha256=2d7f527068a659fa9eb13672d62544cc8b5404053ce4c5ade54abb43114e9d09
  Stored in directory: /tmp/pip-ephem-wheel-cache-im26xdyk/wheels/db/01/db/751cc9f3e7f6faec127c43644ba250a3ea7ad200594aeda70a
Successfully built joeynmt
Installing collected packages: joeynmt
  Found existing installation: joeynmt 0.0.1
    Uninstalling joeynmt-0.0.1:
      Successfully uninstalled joeynmt-0.0.1
Successfully installed joeynmt-0.0.1


# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [24]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import pathos.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) # Herman! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Xhosa Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/enxh/vocab.txt

mkdir: missing operand
Try 'mkdir --help' for more information.
cp: target 'train.xh' is not a directory
cp: target 'test.xh' is not a directory
cp: target 'dev.xh' is not a directory
cp: missing destination file operand after 'bpe.codes.4000'
Try 'cp --help' for more information.
bpe.codes.4000	dev.xh	      sample_data  test.xh	 train.xh
dev.bpe.en	en-xh.tmx     test.bpe.en  train.bpe.en  vocab.en
dev.bpe.xh	en-xh.tmx.gz  test.bpe.xh  train.bpe.xh  vocab.xh
dev.en		joeynmt       test.en	   train.en
BPE Xhosa Sentences
is@@ enzo soku@@ jika inqanawa kwelinye icala kwindawo yayo yokum@@ isa (@@ uku@@ jika kwisiphelo ukuya kwes@@ inye isiph@@ el@@ o@@ ).
"um@@ linganiselo w@@ enyawo ezint@@ andathu uling@@ ana ne ft ezint@@ and@@ athu. umlinganiselo w@@ enyawo ezint@@ andathu ing@@ um@@ v@@ o jikelele wom@@ linganiselo w@@ entamb@@ o, intambo y@@ entsimb@@ i, ubun@@ zulu b@@ amanzi kunye nes@@ and@@ i."
"xa ints@@ on@@ el@@ e@@ o yokubophelela inqanawa iling@@ ana nem@@ ay@@ ile ezil@@ i

# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [27]:


# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)
name = '%s%s' % (source_language, target_language)

config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "models/{name}_transformer/12000.ckpt" # if given, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "noam"            # Try switching from plateau to Noam scheduling
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    patience: 8
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 100 # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 4000 # Decrease this for testing
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 2048
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 2048
        dropout: 0.3
""".format(name=name, source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)


name: "enxh_transformer"

data:
    src: "en"
    trg: "xh"
    train: "data/enxh/train.bpe"
    dev:   "data/enxh/dev.bpe"
    test:  "data/enxh/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/enxh/vocab.txt"
    trg_vocab: "data/enxh/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "models/enxh_transformer/12000.ckpt" # if given, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "noam"            # Try switching from plateau to Noam scheduling
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    patience: 8
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [0]:
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2019-08-11 13:02:03,892 Hello! This is Joey-NMT.
2019-08-11 13:02:03,898 Total params: 46273024
2019-08-11 13:02:03,899 Trainable parameters: ['decoder.layer_norm.bias', 'decoder.layer_norm.weight', 'decoder.layers.0.dec_layer_norm.bias', 'decoder.layers.0.dec_layer_norm.weight', 'decoder.layers.0.feed_forward.layer_norm.bias', 'decoder.layers.0.feed_forward.layer_norm.weight', 'decoder.layers.0.feed_forward.pwff_layer.0.bias', 'decoder.layers.0.feed_forward.pwff_layer.0.weight', 'decoder.layers.0.feed_forward.pwff_layer.3.bias', 'decoder.layers.0.feed_forward.pwff_layer.3.weight', 'decoder.layers.0.src_trg_att.k_layer.bias', 'decoder.layers.0.src_trg_att.k_layer.weight', 'decoder.layers.0.src_trg_att.output_layer.bias', 'decoder.layers.0.src_trg_att.output_layer.weight', 'decoder.layers.0.src_trg_att.q_layer.bias', 'decoder.layers.0.src_trg_att.q_layer.weight', 'decoder.layers.0.src_trg_att.v_layer.bias', 'decoder.layers.0.src_trg_att.v_layer.weight', 'decoder.layers.0.trg_trg_att.k_l

In [14]:
! cat joeynmt/models/$src$tgt_transformer/validations.txt

00016000.hyps.dev   16000.ckpt	8000.hyps      tensorboard
00016000.hyps.test  16000.hyps	best.ckpt      train.log
12000.ckpt	    4000.hyps	config.yaml    trg_vocab.txt
12000.hyps	    8000.ckpt	src_vocab.txt  validations.txt
Steps: 4000	Loss: 97521.18750	PPL: 24.69145	bleu: 0.53349	LR: 0.00020000	*
Steps: 8000	Loss: 72234.91406	PPL: 10.75160	bleu: 1.73058	LR: 0.00020000	*
Steps: 12000	Loss: 56225.00781	PPL: 6.35127	bleu: 3.27729	LR: 0.00020000	*
Steps: 16000	Loss: 44070.66797	PPL: 4.25896	bleu: 6.30347	LR: 0.00020000	*


In [15]:
! cd joeynmt; python3 -m joeynmt test models/$src$tgt_transformer/config.yaml


2019-08-11 12:22:49,070 -  dev bleu:   7.18 [Beam search decoding with beam size = 5 and alpha = 1.0]
2019-08-11 12:25:02,652 - test bleu:   1.74 [Beam search decoding with beam size = 5 and alpha = 1.0]
