# Supervised Neural Machine Translation (Using JoeyNMT)

## Note before beginning:
### - This notebook was provided for the subject of Machine Translation of the Master in Language Analysis and Procesing at UPV/EHU. In this notebook, we used two files (from Europarl: https://www.statmt.org/europarl/), one per language (EN-FR), where the lines in the files are corresponding translations.

### - We performed minimal changes to this in order to get SOME result for our own translation corpus. 



## Pre-process the data

We already have a data set (from Europarl, as previously stated). The format in which we will process it here requires that 
1. there are two files, one for each language (EN-FR)
2. the files are sentence-aligned, which means that each line should correspond to the same line in the other file.


In [None]:
# WE DOWNGRADE THE PYTHON VERSION BC JOEYNMT IS NOT WORKING ANYMORE WITH THE COMMANDS FOR INSTALLING IT THAT WE USED IN CLASS

!sudo apt-get update -y
!sudo apt-get install python3.8
from IPython.display import clear_output
clear_output()
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1

!sudo update-alternatives --config python3

!python3 --version

!sudo apt install python3-pip

There are 2 choices for the alternative python3 (providing /usr/bin/python3).

  Selection    Path                 Priority   Status
------------------------------------------------------------
* 0            /usr/bin/python3.10   2         auto mode
  1            /usr/bin/python3.10   2         manual mode
  2            /usr/bin/python3.8    1         manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/bin/python3.8 to provide /usr/bin/python3 (python3) in manual mode
Python 3.8.10
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  python-pip-whl python3-setuptools python3-wheel
Suggested packages:
  python-setuptools-doc
The following NEW packages will be installed:
  python-pip-whl python3-pip python3-setuptools python3-wheel
0 upgraded, 4 newly installed, 0 to remove and 25 not upgraded.
Need to get 2,389 kB of archi

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "fr" 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
!mkdir -p "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/%s-%s-%s" % (source_language, target_language, tag)

In [None]:
!echo "$gdrive_path"

/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline


In [None]:
# Install opus-tools
! pip install opustools-pkg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opustools-pkg
  Downloading opustools_pkg-0.0.52-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 3.2 MB/s 
[?25hInstalling collected packages: opustools-pkg
Successfully installed opustools-pkg-0.0.52


In [None]:
# TODO: specify the file paths here 
corpora_prefix = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/europarl-v7.fr-en"
source_file = corpora_prefix+"."+source_language
target_file = corpora_prefix+"."+target_language

! wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/training/clean-corpus-n.perl
! perl clean-corpus-n.perl $corpora_prefix $source_language $target_language $corpora_prefix".clean" 1 75

source_file = corpora_prefix+".clean."+source_language
target_file = corpora_prefix+".clean."+target_language

# They should both have the same length.
! wc -l "$source_file"
! wc -l "$target_file"

--2023-05-06 12:25:56--  https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/training/clean-corpus-n.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4257 (4.2K) [text/plain]
Saving to: ‘clean-corpus-n.perl’


2023-05-06 12:25:57 (41.7 MB/s) - ‘clean-corpus-n.perl’ saved [4257/4257]

clean-corpus.perl: processing /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/europarl-v7.fr-en.en & .fr to /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/europarl-v7.fr-en.clean, cutoff 1-75, ratio 9
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000).........

In [None]:
# TODO: Pre-processing! (OPTIONAL)

# If your data contains weird symbols or the like, you might want to do some cleaning and normalization.
# We don't have the code in the notebook for that, but you can use sacremoses "normalize" for example for normalization punctuation: https://github.com/alvations/sacremoses.

# We apply tokenization to separate punctuation marks from the actual words, split words at hyphens etc.
# If you're data is already tokenized, that's great! Skip this cell.
# Otherwise we can use sacremoses to do the tokenization for us. 
# We need the data to be tokenized such that it matches the global test set.

! pip install sacremoses

tok_source_file = source_file+".tok"
tok_target_file = target_file+".tok"

# Tokenize the source
! sacremoses -l "$source_language" tokenize < "$source_file" > "$tok_source_file"
# Tokenize the target
! sacremoses -l "$target_language" tokenize < "$target_file" > "$tok_target_file"

# Let's take a look what tokenization did to the text.
! tail "$source_file"*
! tail "$target_file"*
! wc -l "$target_file"* "$source_file"*

# Change the pointers to our files such that we continue to work with the tokenized data.
source_file = tok_source_file
target_file = tok_target_file


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 4.4 MB/s 
[?25hCollecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 5.9 MB/s 
[?25hCollecting joblib
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[K     |████████████████████████████████| 297 kB 82.4 MB/s 
[?25hCollecting regex
  Downloading regex-2023.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
[K     |████████████████████████████████| 771 kB 60.7 MB/s 
[?25hCollecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting tqdm
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.8 MB/s 
[?25hBuilding wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone

==> /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/europarl-v7.fr-en.clean.en <==
Having said that, Parliament has reached the end of the agenda.
The Minutes of the present sitting will be subject to Parliament' s approval at the beginning of the next part-session.
Mr Manders has the floor for a procedural motion.
Mr President, I would like to take this opportunity to wish you, the Bureau and all Members, a good transition into the new year.
Ladies and gentlemen, before you leave me alone, I would like on behalf of the Bureau, once again, to thank all the Members, all the services, officials, assistants and other co-workers and, if you will allow me - although all the co-workers work for us - perhaps a special mention should go to the language services which help us to understand each other here.
I would also like, although they are absent, to mention the Commission and the Council.
I am not going to re-open the 'Millennium or not the Millennium' debate, but I 

In [None]:
# SKIPPING THIS PART BC WE HAVE OUR OWN TEST SET
# Download FLORES test set for this language pair (https://github.com/facebookresearch/flores).
# TODO: update languages codes as used in FLORES testset (eng instead of en)
#os.environ["ltrg"] = "fra"
#os.environ["lsrc"] = "eng"
#os.environ["trg"] = target_language 
#os.environ["src"] = source_language 

#! rm -fr flores101_dataset*
#! wget https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz
#! tar -xvzf flores101_dataset.tar.gz

#! mv flores101_dataset/devtest/$lsrc.devtest test.$src
#! mv flores101_dataset/devtest/$ltrg.devtest test.$trg

# If this fails it means that there is NO test set for your language. 
# Anayway, we will split your corpora on train/dev/test partitions

In [None]:
# Using our own test set downloaded from the ECHR website: https://hudoc.echr.coe.int/eng#{%22documentcollectionid2%22:[%22GRANDCHAMBER%22,%22CHAMBER%22]}
# CLEANING TEST SET
# TODO: specify the file paths here 
test_prefix = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test"
source_test = test_prefix+"."+source_language
target_test = test_prefix+"."+target_language

! wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/training/clean-corpus-n.perl
! perl clean-corpus-n.perl $test_prefix $source_language $target_language $test_prefix".clean" 1 75

source_test = test_prefix+".clean."+source_language
target_test = test_prefix+".clean."+target_language

# They should both have the same length.
! wc -l "$source_test"
! wc -l "$target_test"

--2023-05-06 12:48:38--  https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/training/clean-corpus-n.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4257 (4.2K) [text/plain]
Saving to: ‘clean-corpus-n.perl.1’


2023-05-06 12:48:38 (57.9 MB/s) - ‘clean-corpus-n.perl.1’ saved [4257/4257]

clean-corpus.perl: processing /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test.en & .fr to /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test.clean, cutoff 1-75, ratio 9

Input sentences: 340  Output sentences:  330
330 /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test.clean.en
330 /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-tes

In [None]:
# TOKENIZING TEST SET
echr_tok_test_en = source_test+".tok"
echr_tok_test_fr = target_test+".tok"

# Tokenize the source
! sacremoses -l "$source_language" tokenize < "$source_test" > "$echr_tok_test_en"
# Tokenize the target
! sacremoses -l "$target_language" tokenize < "$target_test" > "$echr_tok_test_fr"

# Let's take a look what tokenization did to the text.
! tail "$source_test"*
! tail "$target_test"*
! wc -l "$target_test"* "$source_test"*

# Change the pointers to our files such that we continue to work with the tokenized data.
source_test = echr_tok_test_en
target_test = echr_tok_test_fr

==> /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test.clean.en <==
I offer no evidence.
Verdicts must follow.”
The applicant was duly acquitted.
On his counsel's application for a defendant's costs order, the judge refused to make an order and stated:
“That order will be refused.
There is clear evidence on the court papers.
The Crown have taken the view that they are not going to compel this witness although there is compelling evidence in respect of those matters.
It is a discretion which I am afraid I am not going to exercise in your favour.”
The applicant's attempted appeal was dismissed on 14 November 2003 as “to be appealable as a sentence, the order must be contingent upon conviction.
As the defendant's costs order only arises when a prosecution is unsuccessful, it cannot be a sentence and cannot be appealed at the Court of Appeal Criminal Division”.

==> /content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/NMT-test.clean.en.tok <

In [None]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = echr_tok_test_en
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 330 global test sentences to filter from the training/dev data.


In [None]:
import pandas as pd

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df.tail(3)

Loaded data and skipped 0/1977577 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
1977575,Adjournment of the session,Interruption de la session
1977576,I declare the session of the European Parliame...,Je déclare interrompue la session du Parlement...
1977577,( The sitting was closed at 10.50 a.m. ),( La séance est levée à 10h50 )


In [None]:
# TEST TO DF

source = []
target = []
#skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_test) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        #if line.strip() not in en_test_sents:
        source.append(line.strip())
        #else:
            #skip_lines.append(i)             
with open(target_test) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        #if j not in skip_lines:
        target.append(line.strip())
    
#print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df_test = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df_test.tail(3)

Unnamed: 0,source_sentence,target_sentence
327,It is a discretion which I am afraid I am not ...,Il s&apos; agit d&apos; un pouvoir discrétionn...
328,The applicant &apos;s attempted appeal was dis...,Le requérant interjeta appel et il fut débouté...
329,As the defendant &apos;s costs order only aris...,Or la décision de rembourser les dépens du déf...


## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [None]:
# IN THIS NOTEBOOK WE ARE NOT DROPPING DUPLICATES BC FUZZYWUZZY IS NOT WORKING WITH THE PYTHON VERSION THAT WE ARE USING

# drop duplicate translations
df_pp=df
df_pp.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=seed).reset_index(drop=True)

In [None]:
# This section does the split between train/dev(/test) for the parallel corpora then saves them as separate files
# We use 1000 dev test and the given test set.
import csv

# TODO: if your corpus is smaller than 1000, reduce this number. With a corpus that small you might not obtain good results with NMT though :/
# Do the split between dev/train and create parallel corpora
num_dev_patterns = 1000

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # making lowercasing optional
    df_pp["source_sentence"] = df_pp["source_sentence"].str.lower()
    df_pp["target_sentence"] = df_pp["target_sentence"].str.lower()

# TODO: change to True if doesn't exist a FLORES test sets for your language pairs
createTestset = True
if createTestset:
  #test = df_pp.tail(num_dev_patterns)
  test = df_test
  with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as trg_file:
    for index, row in test.iterrows():
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")
  #dev = df_pp.tail(2*num_dev_patterns).head(num_dev_patterns)
  #stripped = df_pp.drop(df_pp.tail(2*num_dev_patterns).index)
  dev = df_pp.tail(num_dev_patterns)
  stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)
else:
  dev = df_pp.tail(num_dev_patterns)
  stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in stripped.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  
stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=False)  

dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)


# TODO: Doublecheck the format below. There should be no extra quotation marks or weird characters. It should also not be empty.
! head train.*
! head dev.*
! wc {train,dev,test}.*

==> train.en <==
We can therefore conduct this debate again with those who really know the ropes and who do not believe it is enough simply to impose standards .
"The European Parliament should work hard to make an ambitious , Europeanist response worthy of our citizens ."
"The compromise at first reading , though challenging democratically , is a sensible solution for an update such as this one , and reaching it was in itself an environmental accomplishment ."
"It is at this point that I will address Mr Balkenende and say that it could have been much better than it is if your government , Mr Balkenende , had acted to resolve an issue about which we have had very lively discussions and will continue to do so ."
"First , the scope and structure of the programme ."
The new legislation should affect all the communications infrastructures and related services by means of recommendations and codes of conduct and other vehicles .
Yet today this country is preparing to host the World Conferen



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [None]:
! pip install -e git+https://github.com/joeynmt/joeynmt.git@1.5#egg=joeynmt
! rm -fr joeynmt
! git clone https://github.com/joeynmt/joeynmt.git --branch 1.5 --single-branch
! pip install torch==1.10.1+cu102 torchtext==0.11.1 -f https://download.pytorch.org/whl/torch_stable.html
! pip install setuptools==59.5.0 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining joeynmt from git+https://github.com/joeynmt/joeynmt.git@1.5#egg=joeynmt
  Cloning https://github.com/joeynmt/joeynmt.git (to revision 1.5) to ./src/joeynmt
  Running command git clone -q https://github.com/joeynmt/joeynmt.git /content/src/joeynmt
  Running command git checkout -q 092c504cb3d7b25b91cc37af4fbfe55af4faf64f
Collecting future
  Downloading future-0.18.3.tar.gz (840 kB)
[K     |████████████████████████████████| 840 kB 4.4 MB/s 
[?25hCollecting matplotlib
  Downloading matplotlib-3.7.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (9.2 MB)
[K     |████████████████████████████████| 9.2 MB 64.3 MB/s 
[?25hCollecting numpy>=1.19.5
  Downloading numpy-1.24.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[K     |████████████████████████████████| 17.3 MB 70.7 MB/s 
[?25hCollecting pillow
  Downloading Pillow-9.5.0-cp38-cp38-manylinux_2_1

Cloning into 'joeynmt'...
remote: Enumerating objects: 3292, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 3292 (delta 10), reused 9 (delta 9), pack-reused 3278[K
Receiving objects: 100% (3292/3292), 8.10 MiB | 21.21 MiB/s, done.
Resolving deltas: 100% (2279/2279), done.
Note: switching to '092c504cb3d7b25b91cc37af4fbfe55af4faf64f'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/

# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for open vocabulary NMT is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the number of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [None]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

# Learn BPEs on the training data.
os.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) 
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

# Create directory, move everyone we care about to the correct location
! mkdir -p "$data_path"
! cp train.* "$data_path"
! cp test.* "$data_path"
! cp dev.* "$data_path"
! cp bpe.codes.4000 "$data_path"
! ls "$data_path"

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Test language Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt 

100% 4000/4000 [00:26<00:00, 149.49it/s]
bpe.codes.4000	dev.en	     test.bpe.fr  train.bpe.en	train.fr
dev.bpe.en	dev.fr	     test.en	  train.bpe.fr
dev.bpe.fr	test.bpe.en  test.fr	  train.en
bpe.codes.4000	dev.en	     test.bpe.fr  train.bpe.en	train.fr
dev.bpe.en	dev.fr	     test.en	  train.bpe.fr
dev.bpe.fr	test.bpe.en  test.fr	  train.en
BPE Test language Sentences
Il y a des éléments de preuve clai@@ rs dans les pi@@ è@@ ces du dossi@@ er .
Le minis@@ t@@ ère public est d&apos; avis qu&apos; il n&apos; ent@@ end pas contra@@ indre cette personne à té@@ mo@@ ign@@ er bien qu&apos; il existe des éléments de preuve convain@@ c@@ ants sur ces questions .
Il s&apos; agit d&apos; un pouvoir dis@@ cré@@ tionn@@ aire que , je le cra@@ ins , je n&apos; exer@@ cer@@ ai pas en votre faveur . »
Le requ@@ ér@@ ant inter@@ jet@@ a appe@@ l et il fu@@ t déb@@ ou@@ té le 1@@ 4 nov@@ embre 2003 avec l&apos; expli@@ cation suiv@@ ante : « P@@ our pouvoir être con@@ t@@ est@@ ée en appe@@ l en tant q

In [None]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp echr_test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

cp: cannot stat 'echr_test.*': No such file or directory
bpe.codes.4000	dev.en	     test.bpe.fr  train.bpe.en	train.fr
dev.bpe.en	dev.fr	     test.en	  train.bpe.fr
dev.bpe.fr	test.bpe.en  test.fr	  train.en


# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [None]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0
    sacrebleu:                      # sacrebleu options
        remove_whitespace: True     # `remove_whitespace` option in sacrebleu.corpus_chrf() function (defalut: True)
        tokenize: "none"            # `tokenize` option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc) 

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "noam"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 1                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [None]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2023-05-06 13:16:39,436 - INFO - root - Hello! This is Joey-NMT (version 1.5).
2023-05-06 13:16:39,493 - INFO - joeynmt.data - Loading training data...
2023-05-06 13:17:40,182 - INFO - joeynmt.data - Building vocabulary...
2023-05-06 13:17:40,420 - INFO - joeynmt.data - Loading dev data...
2023-05-06 13:17:40,545 - INFO - joeynmt.data - Loading test data...
2023-05-06 13:17:40,550 - INFO - joeynmt.data - Data loaded.
2023-05-06 13:17:40,550 - INFO - joeynmt.model - Building an encoder-decoder model...
2023-05-06 13:17:41,015 - INFO - joeynmt.model - Enc-dec model built.
2023-05-06 13:17:41,026 - INFO - joeynmt.training - Total params: 12211712
2023-05-06 13:17:48,160 - INFO - joeynmt.helpers -                           cfg.name : enfr_transformer
2023-05-06 13:17:48,160 - INFO - joeynmt.helpers -                       cfg.data.src : en
2023-05-06 13:17:48,160 - INFO - joeynmt.helpers -                       cfg.data.trg : fr
2023-05-06 13:17:48,160 - INFO - joeynmt.helpers -           

In [None]:
# Copy the created models from the notebook storage to google drive for persistent storage 
! mkdir -p "$gdrive_path/models/${src}${tgt}_transformer/"
! cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [None]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 1000	Loss: 169931.17188	PPL: 53.49845	bleu: 1.64857	LR: 0.00098821	*
Steps: 2000	Loss: 139320.68750	PPL: 26.12204	bleu: 4.10519	LR: 0.00069877	*
Steps: 3000	Loss: 125223.37500	PPL: 18.77698	bleu: 6.70591	LR: 0.00057054	*
Steps: 4000	Loss: 117516.96875	PPL: 15.67636	bleu: 7.59546	LR: 0.00049411	*
Steps: 5000	Loss: 111279.71094	PPL: 13.54588	bleu: 9.46025	LR: 0.00044194	*
Steps: 6000	Loss: 106292.13281	PPL: 12.05256	bleu: 10.87148	LR: 0.00040344	*
Steps: 7000	Loss: 102204.84375	PPL: 10.95237	bleu: 12.51629	LR: 0.00037351	*
Steps: 8000	Loss: 98163.17188	PPL: 9.96325	bleu: 13.75842	LR: 0.00034939	*
Steps: 9000	Loss: 94783.03125	PPL: 9.20496	bleu: 14.87648	LR: 0.00032940	*
Steps: 10000	Loss: 92127.85156	PPL: 8.65001	bleu: 15.92254	LR: 0.00031250	*
Steps: 11000	Loss: 89756.49219	PPL: 8.18273	bleu: 16.88392	LR: 0.00029796	*
Steps: 12000	Loss: 87553.21094	PPL: 7.77121	bleu: 17.76840	LR: 0.00028527	*
Steps: 13000	Loss: 85942.00781	PPL: 7.48345	bleu: 18.57028	LR: 0.00027408	*
Steps: 14000

In [None]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2023-05-06 15:55:53,061 - INFO - root - Hello! This is Joey-NMT (version 1.5).
2023-05-06 15:55:53,062 - INFO - joeynmt.data - Building vocabulary...
2023-05-06 15:55:53,305 - INFO - joeynmt.data - Loading dev data...
2023-05-06 15:55:53,326 - INFO - joeynmt.data - Loading test data...
2023-05-06 15:55:53,334 - INFO - joeynmt.data - Data loaded.
2023-05-06 15:55:53,366 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 3600
2023-05-06 15:55:53,367 - INFO - joeynmt.prediction - Loading model from models/enfr_transformer/42000.ckpt
2023-05-06 15:55:57,589 - INFO - joeynmt.model - Building an encoder-decoder model...
2023-05-06 15:55:57,824 - INFO - joeynmt.model - Enc-dec model built.
2023-05-06 15:55:57,893 - INFO - joeynmt.prediction - Decoding on dev set (data/enfr/dev.bpe.fr)...
2023-05-06 15:56:58,479 - INFO - joeynmt.prediction -  dev bleu[none]:  28.05 [Beam search decoding with beam size = 5 and alpha = {beam_alpha}]
2023-05-06 15:56:58,479 - INF

In [None]:
# Copy translations to Drive
!cp joeynmt/models/enfr_transformer/00042000.hyps.test "gdrive_path"
!cp joeynmt/models/enfr_transformer/00042000.hyps.dev "gdrive_path"