# **Machine Translation**

Machine translation (MT) is the task of automatically translating text or speech from
one natural language to another (Wang et al., 2019). MT is a subfield of NLP that leverages
the disciplines of artificial intelligence, information theory, computer science, and
statistics.

In this example the ``OpenNMT-py`` library is used to demnstrate a **N**eural **M**achine **T**ranslation task.
The code follows the instruction on the ``quickstart`` example found on [OpenNMT-py:Quickstart](https://github.com/OpenNMT/OpenNMT-py#quickstart)

## **Installation**

In [None]:
# Install OpenNMT-py 2.x
# NOTE: By the end of the insatallation, it might ask for restarting the runtime...
# In this case, just click the "RESTART RUNTIME" button.

!pip3 install git+https://github.com/OpenNMT/OpenNMT-py.git

In [None]:
# On Google Colab ONLY
# Reinstall Torch to avoid incompatibility with Cuda 10.1

# NOTE: By the end of the insatallation, it might ask for restarting the runtime...
# In this case, just click the "RESTART RUNTIME" button.

!pip3 install --ignore-installed torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

## **Download files**

In [1]:
# Download the files of the QuickStart

!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz

--2021-03-26 08:58:13--  https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.206.85
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.206.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1662081 (1.6M) [application/x-gzip]
Saving to: ‘toy-ende.tar.gz’


2021-03-26 08:58:13 (3.38 MB/s) - ‘toy-ende.tar.gz’ saved [1662081/1662081]



In [2]:
# Optional: List the extracted files

!cd toy-ende/ && ls

src-test.txt   src-val.txt   tgt-train.txt
src-train.txt  tgt-test.txt  tgt-val.txt


In [3]:
# Optional: Print the first 3 lines of the source file

!head -n 3 toy-ende/src-train.txt

It is not acceptable that , with the help of the national bureaucracies , Parliament &apos;s legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
&quot; Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .


In [4]:
# Optional: Check the number of lines in the source file

!echo "Number of lines:" && wc -l toy-ende/src-train.txt

Number of lines:
10000 toy-ende/src-train.txt


## **Prepare data**

In [5]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano

config = '''# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example

## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

## Where the model will be saved
save_model: model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

world_size: 1
gpu_ranks: [0]

# Remove or modify these lines for bigger files
train_steps: 1000
valid_steps: 200
'''
# Look at the file content
with open("toy_en_de.yaml", "w+") as config_yaml:
  config_yaml.write(config)

!cat toy_en_de.yaml

# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example

## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

## Where the model will be saved
save_model: model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

world_size: 1
gpu_ranks: [0]

# Remove or modify these lines for bigger files
train_steps: 1000
valid_steps: 200


In [6]:
# Build Vocabulary

!onmt_build_vocab -config toy_en_de.yaml -n_sample -1

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-26 08:59:38,054 INFO] Counter vocab from -1 samples.
[2021-03-26 08:59:38,054 INFO] n_sample=-1: Build vocab on full datasets.
[2021-03-26 08:59:38,063 INFO] corpus_1's transforms: TransformPipe()
[2021-03-26 08:59:38,064 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2021-03-26 08:59:38,366 INFO] Counters src:24995
[2021-03-26 08:59:38,366 INFO] Counters tgt:35816


In [7]:
# Check if GPU is active
# If not, go to "Runtime" menu > "Change runtime type" > "GPU"

!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-39210352-32c6-3d62-accf-350493941670)


In [8]:
# Make sure the GPU is visable to PyTorch

import torch

gpu_id = torch.cuda.current_device()
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(gpu_id))

True
Tesla T4


## **Train model**

In [None]:
# Train the NMT model  -> will take ~ 5min

!onmt_train -config toy_en_de.yaml

## **Translate**

In [None]:
# Translate

!onmt_translate -model model/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

In [26]:
# Look at some of the translations 

!head -n 5 toy-ende/pred_1000.txt

Die Aussprache ist ein
Die Aussprache ist von der Europäischen Union , die auf der USA .
Das ist jedoch für den Vorschlag der Europäischen Union .
Die Aussprache ist von der Europäischen Union .
Die Aussprache ist von der Europäischen Union , die auf der USA .


Copyright © 2021 IU International University of Applied Sciences