# AI4D Yorùbá Machine Translation Challenge

File name: AI4DYorubaMT.ipynb

Author: kogni7

Date: April/Mai 2021

## Contents
* 1 Preparation
    * 1.1 GPU
    * 1.2 Time
    * 1.3 Installation
    * 1.4 Libraries and Seed
    * 1.5 Working directory
* 2 Data
    * 2.1 Validation set
    * 2.2 Tokenization
    * 2.3 Datasets
* 3 Training
    * 3.1 Model and Parameters
    * 3.2 Train!
* 4 Prediction and Submission

This notebook uses only the data sets provided by ZINDI. These data sets contain sentences in Yoruba and English. These sentences are the only used features in this notebook. The task is to translate Yoruba to English.

The file system for this project is:
* AI4DYorubaMT (root)
    * AI4DYorubaMT.ipynb (this notebook)
    * Data
        * Train.csv
        * Test.csv
        * SampleSubmission.csv
    * Submission
        * 1 - x: Submission directions, named by the version number
            * submission.csv

This jupyter notebook runs in Google Colab without special configuration. GPU is enabled.

The notebook uses a pretrained MarianMT transformer from HuggingFace (huggingface.co) which is trained on the JW300 dataset (https://huggingface.co/Helsinki-NLP/opus-mt-yo-en).

## 1 Preparation
### 1.1 GPU
Make sure the GPU is the one which is stated below, otherwise restart the environment.

In [1]:
!nvidia-smi

Sun May 30 19:36:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 1.2 Time

In [2]:
import time
start_time = time.time()

### 1.3 Installation

In [3]:
!pip install git+https://github.com/huggingface/transformers
!pip install rouge-score
!pip install sentencepiece

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-70drmrqe
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-70drmrqe
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 13.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e

### 1.4 Libraries and Seed

In [4]:
!python --version

SEED = 42

# Math
import numpy as np
print("Numpy Version: " + str(np.__version__))

import random
import os
os.environ['PYTHONHASHSEED'] = str(SEED)

np.random.seed(SEED + 1)

random.seed(SEED + 2)

# PyTorch
import torch
print("PyTorch Version: " + str(torch.__version__))
torch.manual_seed(SEED + 3)
torch.cuda.manual_seed_all(SEED + 4)

# Time
import time

# CSV
import pandas as pd
print("Pandas Version: " + str(pd.__version__))

# Machine Learning
import sklearn
from sklearn.model_selection import train_test_split
print("SciKit-Learn Version: " + str(sklearn.__version__))

# Transformers
import transformers
from transformers import MarianTokenizer, MarianMTModel, Seq2SeqTrainingArguments, Seq2SeqTrainer

print("Transformers Version: " + str(transformers.__version__))

# Rouge
from rouge_score import rouge_scorer

from tqdm import tqdm
import gc

Python 3.7.10
Numpy Version: 1.19.5
PyTorch Version: 1.8.1+cu101
Pandas Version: 1.1.5
SciKit-Learn Version: 0.22.2.post1
Transformers Version: 4.7.0.dev0


### 1.5 Working directory

In [5]:
# The Version
VERSION = '9'

# for use in Google Colab
from google.colab import drive
drive.mount('/content/drive')
 
# Working Directory
WD = os.getcwd() + '/drive/My Drive/Colab Notebooks/AI4DYorubaMT'

Mounted at /content/drive


## 2 Data

In [6]:
train_csv = pd.read_csv(WD + '/Data/Train.csv')
test_csv = pd.read_csv(WD + '/Data/Test.csv')
sample_submission_csv = pd.read_csv(WD + '/Data/SampleSubmission.csv')
train_csv.head()

Unnamed: 0,ID,Yoruba,English
0,ID_AAJEQLCz,A ṣètò Ìgbìmọ̀ Tó Ń Ṣètò Ìrànwọ́ Nígbà Àjálù l...,A Disaster Relief Committee was formed to orga...
1,ID_AASNedba,"Ìrọ̀lẹ́ May 22, 2018 ni wọ́n fàṣẹ ọba mú Arákù...",Brother Solovyev was arrested on the evening o...
2,ID_AAeQrhMq,Iléeṣẹ́ Creative Commons náà,Creative Commons the Organization
3,ID_AAxlMgPP,"Pè̩lú Egypt, Morocco àti Tunisia tí wó̩n ti lo...","With Egypt, Morocco and Tunisia out of the Wor..."
4,ID_ABKuMKSx,Adájọ́ àgbà lórílẹ̀ èdè Náíjíríà (Attorney Gen...,"The Attorney General of the Federation, Justic..."


In [7]:
# Check for rows, where Yoruba is equal to English.
rows = []
for i in range(len(train_csv)):
    if train_csv["Yoruba"].iloc[i] == train_csv["English"].iloc[i]:
        print(i)
        rows.append(i)

# Remove these rows.
train_csv = train_csv.drop(rows)
print("Rows removed!")

# Check again.
for i in range(len(train_csv)):
    if train_csv["Yoruba"].iloc[i] == train_csv["English"].iloc[i]:
        print(i)

225
3816
5997
6157
6282
9106
Rows removed!


In [8]:
test_csv.head()

Unnamed: 0,ID,Yoruba
0,ID_AAAitMaH,"Nínú ìpè kan lẹ́yìn ìgbà náà, wọ́n sọ fún aṣoj..."
1,ID_AAKKdQwr,Nítorí kò sí nǹkan tí ọkùnrin ò lè ṣe láì náán...
2,ID_ABgAyEOp,Bí i kó pariwo. Kí ó kígbe mọ́ ẹ?
3,ID_ACFgfKQs,"Tí ó ń lé e lọ sọ́nà etí odò Akókurà, tí ó bẹ̀..."
4,ID_ACNPmlhf,Èṣúńiyì mọ̀ iṣẹ́ rẹ̀ dunjú. Màmá tirí bí ó ṣe ...


In [9]:
sample_submission_csv.head()

Unnamed: 0,ID,Label
0,ID_ABgAyEOp,0
1,ID_ACFgfKQs,0
2,ID_ACNPmlhf,0
3,ID_ACqxiSuP,0
4,ID_ADPgGOCq,0


### 2.1 Validation set

In [10]:
X_train, X_val, y_train, y_val = train_test_split(list(train_csv["Yoruba"]),
                                                  list(train_csv["English"]),
                                                  test_size=0.2, random_state=SEED)

X_test = list(test_csv["Yoruba"])

### 2.2 Tokenization

In [11]:
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-yo-en')

train_data = tokenizer(X_train, return_tensors="pt", padding=True, truncation=True)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(y_train, return_tensors="pt", padding=True, truncation=True)
train_data["labels"] = labels["input_ids"]

val_data = tokenizer(X_val, return_tensors="pt", padding=True, truncation=True)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(y_val, return_tensors="pt", padding=True, truncation=True)
val_data["labels"] = labels["input_ids"]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=790674.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=820528.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1392950.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




### 2.3 Datasets

In [12]:
class MakeDataSet(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}

    def __len__(self):
        return len(self.data["input_ids"])

train_dataset = MakeDataSet(train_data)
val_dataset = MakeDataSet(val_data)

## 3 Training
### 3.1 Model and Parameters

In [13]:
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-yo-en')

batch_size = 8
args = Seq2SeqTrainingArguments(
        output_dir="output",
        evaluation_strategy = "steps",
        learning_rate=15e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        save_total_limit=3,
        num_train_epochs=3,
        load_best_model_at_end=True,
        save_strategy="steps",
        logging_steps=1000,
        save_steps=1000,
        predict_with_generate=True,
        seed=SEED)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1133.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=295332429.0, style=ProgressStyle(descri…




In [14]:
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)

def compute_metrics(eval_preds):

    preds, labels = eval_preds

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]

    scores = []
    for i in range(len(decoded_labels)):
        scores.append(scorer.score(decoded_labels[i], decoded_preds[i]))

    scores = [s['rouge1'].fmeasure for s in scores]
 
    result = {}
    result["Rouge"] = np.mean(scores)

    return result

### 3.2 Train!

In [15]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Step,Training Loss,Validation Loss,Rouge
1000,0.3413,0.632457,0.401037
2000,0.2382,0.596538,0.420478
3000,0.1883,0.588299,0.430537


TrainOutput(global_step=3015, training_loss=0.25555948475700113, metrics={'train_runtime': 2804.562, 'train_samples_per_second': 8.598, 'train_steps_per_second': 1.075, 'total_flos': 4354298102218752.0, 'epoch': 3.0})

In [16]:
del trainer

## 4 Prediction and Submission

In [17]:
b = 24

with torch.no_grad():
    predictions = []

    constant = 0

    for i in tqdm(range(int(len(X_test) / b))):
        if i == int(len(X_test) / b) - 1:
            end = len(X_test)
        else:
            end = constant + b

        test_data = tokenizer(X_test[constant:end], return_tensors="pt", padding=True, truncation=True)
        test_data.to('cuda')
        generation = model.generate(**test_data)
        prediction = tokenizer.batch_decode(generation, skip_special_tokens=True)
        predictions += prediction

        del test_data, generation, prediction

        constant += b
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

100%|██████████| 284/284 [19:29<00:00,  4.12s/it]


In [18]:
sample_submission_csv.ID = test_csv.ID
sample_submission_csv.Label = list(predictions)
sample_submission_csv.head()

Unnamed: 0,ID,Label
0,ID_AAAitMaH,"During a call that followed, representatives o..."
1,ID_AAKKdQwr,For there is nothing that a man cannot do with...
2,ID_ABgAyEOp,If he throws a noise. What would he cry to you?
3,ID_ACFgfKQs,The person who is on the way to the river Abuj...
4,ID_ACNPmlhf,She was familiar with her job. Mother has conf...


In [19]:
# Avoid empty cells.
for i in range(len(sample_submission_csv.Label)):
    if sample_submission_csv.Label[i] == "":
        sample_submission_csv.Label[i] = "untranslated"

In [20]:
os.mkdir(WD + '/Submission/' + str(VERSION))

In [21]:
sample_submission_csv.to_csv(WD + '/Submission/' + str(VERSION) + '/submission.csv', index=False)

In [22]:
drive.flush_and_unmount()

In [23]:
end_time = time.time()
print("Runtime of the Notebook: {} min".format(np.round((end_time - start_time) / 60, 2)))

Runtime of the Notebook: 67.77 min
