# Data Gathering and Processing

To build a Machine Translation system, you need bilingual data, i.e. source sentences and their translations. You can use public bilingual corpora/datasets or you can use your translation memories (TMs). However, NMT requires a lot of data to train a good model, that is why most companies start with training a strong baseline model using public bilingual datasets, and then fine-tune this baseline model on their TMs. Sometimes also you can use pre-trained models directly for fine-tuning.

The majority of public bilingual datasets are collected on OPUS: https://opus.nlpl.eu/

Most of the datasets can be used for both commercial and non-commercial uses; however, some of them have more restricted licences. So you have to double-check the licence of a dataset before using it.

On OPUS, go to “Search & download resources” and choose two languages from the drop-down lists. You will see how it will list the available language datasets for this language pair. Try to use non-variant language codes like “en” for English and “fr” for French to get all the variants under this language. To know more details about a specific dataset, click its name.

In Machine Translation, we use the “Moses” format. Go ahead and try to download the “tico-19 v2020-10-28” by clicking “moses”. This will download a *.zip file; when you extract it, the two files that you care about are those whose names ending by the language codes. For example, for English to French, you will have “tico-19.en-fr.en” and “tico-19.en-fr.fr“. You can open these files with any text editor. Each file has a sentence/segment per line, and it is matching translation in the same line in the other file. This is what the "Moses" file format means.

Note that not all datasets are of the same quality. Some datasets have lower quality, especially big corpora crawled from the web. Check the provided “sample” before using the dataset. Nevertheless, even high-quality datasets, like those from the UN and EU, require filtering.



In [1]:
!cd ..
%ls

[0m[01;34mdrive[0m/  [01;34mnmt[0m/  [01;34msample_data[0m/


In [23]:
# Create a directory and clone the Github MT-Preparation repository
'''!mkdir -p /content/drive/MyDrive/Colab\ Notebook/Fictional\ Neural\ Translation/nmt
%cd /content/drive/MyDrive/Colab\ Notebook/Fictional\ Neural\ Translation/nmt
!git clone https://github.com/ymoslem/MT-Preparation.git'''

/content/drive/MyDrive/Colab Notebook/Fictional Neural Translation/nmt
Cloning into 'MT-Preparation'...
remote: Enumerating objects: 268, done.[K
remote: Counting objects: 100% (268/268), done.[K
remote: Compressing objects: 100% (159/159), done.[K
remote: Total 268 (delta 133), reused 189 (delta 97), pack-reused 0[K
Receiving objects: 100% (268/268), 69.06 KiB | 1.11 MiB/s, done.
Resolving deltas: 100% (133/133), done.


In [2]:
# Install the requirements
!pip3 install -r /content/drive/MyDrive/Colab\ Notebook/Fictional\ Neural\ Translation/nmt/MT-Preparation/requirements.txt



In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# Datasets

Example datasets:

* EN-AR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/ar-en.txt.zip
* EN-ES: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-es.txt.zip
* EN-FR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
* EN-RU: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-ru.txt.zip
* EN-ZH: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip

# Data Filtering

Filtering out low-quality segments can help improve the translation quality of the output MT model. This might include misalignments, empty segments, duplicates, among other issues.

In [21]:
# Filter the dataset
# Arguments: source file, target file, source language, target language
!python3 /content/drive/MyDrive/Colab\ Notebooks/Fictional\ Neural\ Translation/nmt/MT-Preparation/filtering/filter.py /content/drive/MyDrive/Colab\ Notebooks/Fictional\ Neural\ Translation/nmt/en-tkn/UN.en-tkn.tkn /content/drive/MyDrive/Colab\ Notebooks/Fictional\ Neural\ Translation/nmt/en-tkn/UN.en-tkn.en tkn en

Dataframe shape (rows, columns): (1038, 2)
--- Rows with Empty Cells Deleted	--> Rows: 1034
--- Duplicates Deleted			--> Rows: 1025
--- Source-Copied Rows Deleted		--> Rows: 1025
--- Too Long Source/Target Deleted	--> Rows: 621
--- HTML Removed			--> Rows: 621
--- Rows will remain in true-cased	--> Rows: 621
--- Rows with Empty Cells Deleted	--> Rows: 621
--- Rows Shuffled			--> Rows: 621
--- Source Saved: /content/drive/MyDrive/Colab Notebooks/Fictional Neural Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn
--- Target Saved: /content/drive/MyDrive/Colab Notebooks/Fictional Neural Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en


# Tokenization / Sub-wording

To build a vocabulary for any NLP model, you have to tokenize (i.e. split) sentences into smaller units. Word-based tokenization used to be the way to go; in this case, each word would be a token. However, an MT model can only learn a specific number of vocabulary tokens due to limited hardware resources. To solve this issue, sub-words are used instead of whole words. At translation time, when the model sees a new word/token that looks like a word/token it has in the vocabulary, it still can try to continue the translation instead of marking this word as “unknown” or “unk”.

There are a few approaches to sub-wording such as BPE and the unigram model. One of the famous toolkits that incorporates the most common approaches is [SentencePiece](https://github.com/google/sentencepiece). Note that you have to train a sub-wording model and then use it. After translation, you will have to “desubword” or “decode” your text back using the same SentencePiece model.



In [22]:
!ls /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/MT-Preparation/subwording/

1-train_bpe.py	1-train_unigram.py  2-subword.py  3-desubword.py


In [29]:
# Train a SentencePiece model for subword tokenization
!python3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/MT-Preparation/subwording/1-train_unigram.py /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  require

In [31]:
!ls /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt

compute-bleu.py  en-fr	 models		 README  source.model  target.model  train.log
config.yaml	 en-tkn  MT-Preparation  run	 source.vocab  target.vocab


In [32]:
# Subword the dataset
!python3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/MT-Preparation/subwording/2-subword.py /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/source.model /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/target.model /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en

Source Model: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/source.model
Target Model: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/target.model
Source Dataset: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn
Target Dataset: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en
Done subwording the source file! Output: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword
Done subwording the target file! Output: /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword


In [33]:
# First 3 lines before subwording
!head -n 3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn && echo "-----" && head -n 3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en

25 Ar yé! şanyengolmo oronte, tyastien se, ar eques: “*Peantar, mana caruvan náven aryon oira coiviéno?” 26 Quentes senna: “Mana técina i Şanyesse? Manen hental?” 27 Hanquentasse quentes: “Alye mele i Héru Ainolya quanda endalyanen ar quanda fealyanen ar quanda poldorelyanen ar quanda sámalyanen, ar armarolya ve imle.” 28 Yésus quente senna: “Hanquentes mai; cara sie ar samuval coivie.”
	5 Ar íre ennoli caramper pa i corda, i samis írime ondor ar netyana annainen ná, quentes: 6 “Nati sine yar yétalde – aureli tuluvar yassen ondo ua lemyuva ondosse ya ua nauva hátina undu.”
Á vanta tienyasse, ve inye vanta i Hristo tiesse.
-----
21 At that time he was most happy by the Holy Spirit and said: “I praise you, Father, Lord over heaven and earth, for you have hidden these things from wise ones and from intelligent ones, and you have revealed them to babes. Yes, Father, for doing so was good in your eyes. 22 All things have been given over to me by my Father, and who the Son is nobody knows ex

In [34]:
# First 3 lines after subwording
!head -n 3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword && echo "---" && head -n 3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword

▁ 2 5 ▁Ar ▁y é ! ▁ ş an y eng ol mo ▁o ront e , ▁t ya s tien ▁se , ▁a r ▁e ques : ▁ “* P e ant ar , ▁ man a ▁car u van ▁n á ve n ▁a r yon ▁ oir a ▁co iv i éno ?” ▁ 2 6 ▁ Q ue ntes ▁s en na : ▁ “ M ana ▁ té ci na ▁i ▁ Ş an y es s e ? ▁Man en ▁h ent al ?” ▁ 2 7 ▁H anque nt asse ▁que ntes : ▁ “ Al y e ▁me le ▁i ▁Hé r u ▁A in ol ya ▁qu anda ▁en d al yan en ▁a r ▁qu anda ▁f e al yan en ▁a r ▁qu anda ▁pol do re l yan en ▁a r ▁qu anda ▁s á ma l yan en , ▁a r ▁ arm a rol ya ▁ve ▁im le . ” ▁ 2 8 ▁ Y és us ▁que nt e ▁s en na : ▁ “ H anque ntes ▁mai ; ▁car a ▁si e ▁a r ▁sa m u val ▁co iv ie . ”
▁ 5 ▁Ar ▁ í re ▁en n oli ▁car am per ▁pa ▁i ▁ cord a , ▁i ▁sa mi s ▁ í rime ▁ ond or ▁a r ▁net yan a ▁an na in en ▁n á , ▁que ntes : ▁ 6 ▁ “ N ati ▁si ne ▁y ar ▁y étal de ▁ – ▁au re li ▁t ulu va r ▁y asse n ▁on do ▁ u a ▁le m y u va ▁on do sse ▁y a ▁ u a ▁n au va ▁h á tin a ▁un du . ”
▁ Á ▁ vant a ▁ tien y asse , ▁ve ▁in y e ▁ vant a ▁i ▁H rist o ▁ ties s e .
---
▁ 2 1 ▁At ▁that ▁time ▁he ▁was ▁most ▁h app

# Data Splitting

We usually split our dataset into 3 portions:

1. training dataset - used for training the model;
2. development dataset - used to run regular validations during the training to help improve the model parameters; and
3. testing dataset - a holdout dataset used after the model finishes training to finally evaluate the model on unseen data.

In [40]:
# Split the dataset into training set, development set, and test set
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python3 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/MT-Preparation/train_dev_split/train_dev_test_split.py 600 600 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword

Dataframe shape: (621, 2)
--- Empty Cells Deleted --> Rows: 621
--- Wrote Files
Done!
Output files
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.train
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.train
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.dev
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.dev
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.test
/content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.test


In [42]:
# Line count for the subworded train, dev, test datatest
!wc -l /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/*.subword.*

    600 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.dev
    600 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.test
     17 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.train
    600 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.dev
    600 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.test
     17 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.train
   2434 total


In [45]:
# Check the first and last line from each dataset

# -------------------------------------------
# Change this cell to print your name
!echo -e "My name is: FirstName SecondName \n"
# -------------------------------------------

!echo "---First line---"
!head -n 1 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/*.{train,dev,test}

!echo -e "\n---Last line---"
!tail -n 1 /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/*.{train,dev,test}

My name is: FirstName SecondName 

---First line---
==> /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.train <==
▁ 2 5 ▁Wo e ▁to ▁ y ou ▁that ▁are ▁full ▁now , ▁for ▁ y ou ▁will ▁be ▁hung ry !

==> /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.tkn-filtered.tkn.subword.train <==
▁H or ro ▁l en ▁i ▁la la r ▁s í , ▁an ▁sa m u val de ▁n y é re ▁a r ▁n í ri !

==> /content/drive/MyDrive/Colab_Notebooks/Fictional_Neural_Translation/nmt/en-tkn/UN.en-tkn.en-filtered.en.subword.dev <==
▁ 3 2 ▁The ▁P har ise es ▁hear d ▁the ▁ c row d ▁mu rm uring ▁the se ▁things ▁ about ▁him , ▁and ▁the ▁chief ▁pr ies ts ▁and ▁the ▁P har ise es ▁sent ▁some ▁officers ▁to ▁seize ▁him . ▁ 3 3 ▁T herefore ▁Je sus ▁ s aid : ▁ “ I ▁will ▁ still ▁be ▁with ▁ y ou ▁a ▁short ▁time , ▁before ▁I ▁shall ▁go ▁a way ▁to ▁ [ the ▁one ] ▁who ▁sent ▁me . ▁ 3 4 ▁You ▁will ▁seek ▁me , ▁ but ▁ y ou ▁will ▁not ▁find ▁me , ▁for 