<a href="https://colab.research.google.com/github/n00blet/OpenNMT-Machine-Translation/blob/master/notebook/NeuralMachineTranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Neural Machine Translation using Seq2Seq Encoder-Decoder**


<br>



**Description**

Language translation has been an open problem for many years now. Even though there are many tools like Google translate, DeepL they are still evolving and far near perfect to human translation. And languages in general are vast, as in a statement can be expressed or written in more than one way. In this tutorial we will try to build a simple and basic machine translation system using [OpenNMT](http://opennmt.net/).


<br>



**Pre-Requisite **

This article assumes that you have some basic knowledge in Python programming and Neural Machine Translation.

<br>

**Requirements**

*  Python Virtual Environment (Recommended)

*  [TensorFlow](https://www.tensorflow.org/install) (GPU version)

*   [CUDA](https://developer.nvidia.com/cuda-downloads) compatible Nvidia Graphics Card

* [OpenNMT](https://github.com/OpenNMT/OpenNMT-tf)


*   [Sentence Piece](https://github.com/google/sentencepiece#c-from-source)(Recommended Installation from Source)


<br>




**Overview**


*   PreProcessing the Data
*   Training with different hyperparameters
*   Inference or Translation

<br>


**Diving into OpenNMT**

For the tasks further below, we will download one of the dataset from Machine Translation Research websites [WMT Translation Task](http://www.statmt.org/wmt16/translation-task.html)  or  [OPUS Parallel Corpus](http://opus.nlpl.eu/) .



<br>


**Step 1: Preprocessing**

Assuming that we have our environment ready and configured for training using GPU, the first step in this task is to preprocess the raw data.

Here we build **source** and **target** word vocabularies using an Unsupervised Text Tokenizer [SentencePiece](https://github.com/google/sentencepiece). 

<br>

Before we go into the next task, we can split the dataset(news-commentary) into **train**, **test** and **validation** using SkLearn or we can use the validation and test set from data folder. 
 
Read more about train-test-split here [Data Splitting](https://cs230-stanford.github.io/train-dev-test-split.html).

Now, let's go and train a sentencepiece model for our dataset. We will use this model to generate vocabs and to tokenize all the files (train,test and val).




In [0]:
spm_train --input=news-commentary-v11.de-en.en --model_prefix=english --vocab_size=32000 --character_coverage=1.0 --model_type=bpe

In [0]:
spm_train --input=news-commentary-v11.de-en.de --model_prefix=german --vocab_size=32000 --character_coverage=1.0 --model_type=bpe

<br>
After successfully running those two commands, following four files should have been generated.

*   english.model
*   english.vocab
*   german.model
*   german.vocab



<br>

# Task 1

The first task is to tokenize each of the sentence from English and German dataset using the model generated by sentencepiece and save it to another file.

<br>
For example *train_en_tok.txt* , *train_de_tok.txt*

<br>

Hint : You can also use  **spm_encode**  to fininsh the task.

In [0]:
import sentencepiece as spm
def tokenize():
  #Complete the function below
  pass

# Task 2

If you look at the validation and test files, the main content are encoded within SGML tags. 

<br>

Or if you don't want to write the code, you can look at the existing perl script [here](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/input-from-sgm.perl).



In [0]:
def remove_tags():
  pass

Once we have tokenized all the six files, let us build our configuration file for training using OpenNMT.

<br>
Let us create *data.yml* file with the following configuration, this can be changed anytime based on your task and requirement.



```
model_dir: en_de_translation

data:
  train_features_file: train.en    #tokenized english data
  train_labels_file: train.de      #tokenized german data
  eval_features_file: valid.en     #tokenized english val data
  eval_labels_file: valid.de       #tokenized german val data
  
  source_words_vocabulary: english.vocab   #vocab generated by sp
  target_words_vocabulary: german.vocab

train:
  save_checkpoints_steps: 1000
  
  #Keeping 10 checkpoint in storage
  keep_checkpoint_max: 10
  
  train_steps : 200000
  batch_size : 32
 
params:
  #If the model is not able to find the actual word, it tries replacing with the nearest possible word
  replace_unknown_target: True
  optimizer: AdamOptimizer
  learning_rate : 0.001
  

eval:
  #Perform evaluation after every 10 minutes (helps in Tensorboard loss log)
  eval_delay: 600
  save_eval_predictions: True
  

infer:
batch_size: 32


```

<br>

Now that we have all the files ready, let's go ahead and train our model.



\\

**Step 2: Training  and Evaluating  the data**



In [0]:
onmt-main train_and_eval --model_type Transformer --auto_config --config data.yml

**Step 3: Translation and Testing the results**

In [0]:
onmt-main infer --auto_config --config data.yml --features_file test.en --predictions_file output.de

Infer will take the test file and tries to predict a relevant translation and saves it to output.de. 
The predictions might not be very accurate as the dataset we are training is very small. 

<br>
TensorFlow here by default selects the last checkpoint for translation, but if you want to check the accuracy score from a different checkpoint you can try this.


In [0]:
onmt-main infer --auto_config --config data.yml --features_file test.en --predictions_file output.de --checkpoint_path en_de_translation/model.ckpt-100000