# Neural Machine Translation using a Transformer Model

Finetune [MT5](https://huggingface.co/google/mt5-small) to translate English to Akuapem Twi, dataset available on [Zenodo-AfricanNLP](https://zenodo.org/records/4432117).


### Install all required libraries

In [None]:
# Installs all necessary libraries
!pip install torch
!pip install nlgeval
!pip install torchvision
!pip install nltk
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install sacrebleu
!pip install --upgrade --no-cache-dir gdown
!pip install transformers sentencepiece
!pip install accelerate>=0.21.0
!pip install transformers[torch]

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

### Import Libraries

In [None]:
# Imports all necessary libraries
import re
import torch
import pandas as pd
import numpy as np
import evaluate

from transformers import pipeline
from torch.utils.data import Dataset, DataLoader
from transformers import DataCollatorForSeq2Seq
from transformers import AutoTokenizer, MarianTokenizer, MarianMTModel
from sklearn.model_selection import train_test_split
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from datasets import load_dataset, load_metric, Dataset, DatasetDict, load_metric

## Download the dataset
Loads the English-Akuapem Twi datasets (training and validation):

In [None]:
# Loads the training dataset
df = pd.read_csv("verified_data.csv")
df.head()

Unnamed: 0,English,Akuapem Twi
0,What she lacks in charisma she makes up for wi...,Nea onni ho adwempa no de adwumaden na ɛba.
1,There was nothing I could do about it.,Na biribiara nni hɛ a metumi ayɔ
2,Kwaku saw John and Abena holding hands.,Kwaku hui se John ne Abena kurakura wɛn nsa.
3,Can you stay till 2:30?,So wubetumi atena ha akosi nnɛnmienu npaamu ad...
4,You haven't got much time.,Wonni mmre


In [None]:
# Loads the test dataset
df_test = pd.read_csv("crowdsourced_testdata.csv")
df_test.head()

Unnamed: 0,English Sentence,Twi Translation
0,What is going on here?,Ɛdeɛn na ɛrekɔso wɔ aha?
1,Wake up,Sɔre
2,She comes here every Friday,Ɔba ha Fiada biara
3,Learn to be wise,Sua nyansa
4,I didn’t think you would loose your way,Mannwene da sɛ wo bɛyera


## Data Preprocessing

This section aimed to prepare the two datasets (training and testing) and transform both into clean, structured, and useful formats. The collected datasets were mostly clean, therefore, the pipeline did not require alot of cleaning. The datasets had no missing values in the training dataset, two missing values in the test dataset, and a few duplicate values in both sets of which one of the duplicate values was dropped.  

#### Data Cleaning

In [None]:
# Checks for missing values in the training dataset
df.isnull().sum()

English        0
Akuapem Twi    0
dtype: int64

In [None]:
# Checks for missing values in the testing datset
df_test.isnull().sum()

English Sentence    0
Twi Translation     2
dtype: int64

In [None]:
# Drops the two rows with missing values in the test set
df_test = df_test.dropna()

In [None]:
# Checks for duplicate values in each column in the train set
train_duplicate_counts = df.apply(lambda x: x.duplicated().sum())
print("Total duplicate values in each column:")
print(train_duplicate_counts)

Total duplicate values in each column:
English        281
Akuapem Twi    656
dtype: int64


In [None]:
# Checks for duplicate values in each column in the test set
test_duplicate_counts_test = df_test.apply(lambda x: x.duplicated().sum())
print("Total duplicate values in each column:")
print(test_duplicate_counts_test)

Total duplicate values in each column:
English Sentence    34
Twi Translation     33
dtype: int64


In [None]:
# Drops duplicate rows in the train set
df_updated = df.drop_duplicates()

In [None]:
# Drops duplicate rows in the test set
df_test_updated = df_test.drop_duplicates()

In [None]:
# Gets the total number of data points before and after dropping duplicates from the train set
total_rows_before_dropping = len(df)
print("Total number of rows before dropping duplicates:", total_rows_before_dropping)
total_rows_after_dropping = len(df_updated)
print("Total number of rows after dropping duplicates:", total_rows_after_dropping)

Total number of rows before dropping duplicates: 25420
Total number of rows after dropping duplicates: 25171


In [None]:
# Gets the total number of data points before and after dropping duplicates from the test set
total_rows_before_dropping = len(df_test)
print("Total number of rows before dropping duplicates:", total_rows_before_dropping)
total_rows_after_dropping = len(df_test_updated)
print("Total number of rows after dropping duplicates:", total_rows_after_dropping)

Total number of rows before dropping duplicates: 695
Total number of rows after dropping duplicates: 682


####Data Formatting and Train Dataset Split

In [None]:
# Converts the dataframe columns to python lists for train sets
English_phrases = df['English'].tolist()
Twi_translations = df['Akuapem Twi'].tolist()


The training dataset is split into two sets: train set (80%) and validation set (20%). The pipeline has a separate test dataset that was separately gathered to be utilized as an evaluation set, hence, no need to split the train dataset any further.

In [None]:
# Splits the dataset into training and validation sets
train_english, val_english, train_twi, val_twi = train_test_split(English_phrases, Twi_translations, test_size=0.2)

####Tokenization

The pipeline then loads the ``MT5`` tokenizer to process the English-Akuapem Twi language pairs:

The tokenizer aims to convert text into a format that the model can process by splitting text into tokens (subwords or words), converts the tokens into numerical IDs and adds any necessary special tokens. This pipeline uses the MT5 Tokenizer pre-trained on 'MT5' tokenizer.


In [None]:
# Loads and initializes the MT5 tokenizers
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
# Defines source and target languages and the translation prefix
source_lang = "en"
target_lang = "twi"
prefix = "translate English to Twi: "

The preprocess_function below prepares the data (ensure that all input data is uniformly processed) for model training and evaluation by:

* Adding a translation prefix to the source sentences;
* Tokenizing both the source and target sentences with appropriate padding and truncation; and,
* Returning the tokenized data in a format that the model can understand.

The defined below function named preprocess_function takes three (3) parameters;
* examples - A dictionary containing source and target language sentences;
* tokenizer - A tokenizer object from Hugging Face Transformers library; and,
* max_length - An optional parameter that sets the maximum length for the tokenized sequences with default set as 128.

The main steps involve preparing input sentences by prefixing each source language sentence with a translation prefix, preparing target sentences by listing the corresponding target language sentences, and tokenizing both inputs and targets into a suitable format for the model. Finally, the tokenized inputs, including token IDs and attention masks, are returned for model training and evaluation.

In [None]:
# Defines the preprocessing function
def preprocess_function(examples, tokenizer, max_length=128):
    inputs = [prefix + ex for ex in examples[source_lang]]
    targets = [ex for ex in examples[target_lang]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=max_length, truncation=True, padding='max_length')
    return model_inputs

The used Dataset class in the below cell is from the dataset library provided by Hugging Face. This library is used to handle and preprocess datasets efficiently for machine learning tasks, particularly in the context of natural language processing (NLP).

The from_dict method is a class method that creates a Dataset object from a dictionary. The keys of the dictionary are the column names, and the values are lists of data corresponding to those columns.

Variables and Data:
source_lang and target_lang are variables holding the string values representing the source and target language keys, respectively.
train_english and train_twi are lists containing English phrases and their corresponding Twi translations for the training set.
val_english and val_twi are lists containing English phrases and their corresponding Twi translations for the validation set.

In [None]:
# Creates datasets
train_dataset = Dataset.from_dict({source_lang: train_english, target_lang: train_twi})
val_dataset = Dataset.from_dict({source_lang: val_english, target_lang: val_twi})

The below step maps the preprocess_function over both the training and validation datasets, tokenizing the examples in batches. This prepares the datasets for input into a machine learning model by converting text data into a format that the model can understand (e.g., token IDs, attention masks

In [None]:
# Preprocess datasets
tokenized_train_dataset = train_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
tokenized_val_dataset = val_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)

Map:   0%|          | 0/20336 [00:00<?, ? examples/s]

Map:   0%|          | 0/5084 [00:00<?, ? examples/s]

The code below ensures that only relevant columns remain in the tokenized datasets by removing the source language and target language columns which helps streamline the datasets for better performance and avoids unnecessary warnings related to unused columns.

In [None]:
# Removes unused columns to avoid warnings
tokenized_train_dataset = tokenized_train_dataset.remove_columns([source_lang, target_lang])
tokenized_val_dataset = tokenized_val_dataset.remove_columns([source_lang, target_lang])

The tokenized datasets are then formatted to be compatible with PyTorch tensors, commonly used for training neural network models. The code specifies which columns should be included in the format, typically containing input IDs, attention masks, and labels necessary for model training

In [None]:
# Sets format for PyTorch
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

To facilitate convenient access and management of multiple datasets within a single object for seamless exploration during the training and evaluation process, the following code creates a unified object (tokenized_data). This object encompasses both the tokenized training and validation datasets, structured into a DatasetDict.

In [None]:
# Combines datasets into a DatasetDict
tokenized_data = DatasetDict({
    "train": tokenized_train_dataset,
    "validation": tokenized_val_dataset
})

##Loading Model

The pipeline then prepares a pre-trained MT5 model, specifically designed for machine translation tasks, to translate English text into Twi. By using the from_pretrained method with the identifier [MT5](https://huggingface.co/google/mt5-small), the model is loaded from the Hugging Face model hub, making it readily available for translation tasks without the need for extensive training. Once initialized, the model is assigned to the variable model, enabling seamless translation from English to Twi.

In [None]:
# Initializes the model
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

##Loading Evaluation Metrics

Defines a function named 'compute_metrics' used for evaluating the performance of the model during validation or testing. The function utilizes the BLEU scores to evaluate the performance of a machine translation model using predicted and ground truth translations.

In [None]:
# Defines compute_metrics function for evaluation
metric_bleu = evaluate.load("sacrebleu")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculates BLEU score
    bleu = metric_bleu.compute(predictions=decoded_preds, references=decoded_labels)

    return {"bleu": bleu["score"]}

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

##Model Training & Evaluation of Validation Set

To efficiently train the model for machine translation, the below cell prepares the training environment for a sequence-to-sequence (seq2seq) model used in machine translation tasks. Firstly, it sets up a data collator to handle batching and preprocessing of training data. Then, it defines training arguments specifying parameters like output directory, evaluation strategy, learning rate, number of epochs, and batch sizes. Finally, it initializes the trainer with the model, training arguments, tokenized training and validation datasets, tokenizer, a function for computing evaluation metrics, and the data collator.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Defines training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
)

# Initializes the trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

'trainer.train()' initiates the training process, during which the model learns from the provided data to improve its performance. After training, 'trainer.save_model("mt5_translation_model")' saves the trained model to disk with the specified name "mt5_translation_model". This allows for future use of the model without the need for retraining.

In [None]:
# Train the model with early stopping
trainer.train()

# Save the model
trainer.save_model("mt5_translation_model")

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss,Bleu
1,6.9602,1.556889,0.007381
2,1.223,0.851719,0.00615




##Testing the Model

The below cell converts specific columns from a dataframe to Python lists and ensure that all elements within these lists are strings, which can be useful for further processing or analysis.

####Data Preprocessing - Test Set

In [None]:
# Converts dataframe columns to python lists and ensure all entries are strings
test_english_sentences = [str(sentence) for sentence in df_test['English Sentence'].tolist()]
actual_twi_translations = [str(translation) for translation in df_test['Twi Translation'].tolist()]


####Tokenization - Test Set

The pipeline then prepares the test dataset for evaluation by tokenizing English sentences and their corresponding Twi translations, then formatting them into PyTorch tensors. This ensures that the dataset is appropriately processed and ready for assessment of the model's performance on unseen data.

In [None]:
# Preprocess test dataset
test_dataset = Dataset.from_dict({"en": test_english_sentences, "twi": actual_twi_translations})
tokenized_test_dataset = test_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


Map:   0%|          | 0/695 [00:00<?, ? examples/s]

####Evaluation on Test Dataset

Evaluates the model on the preprocessed test dataset, and prints the evaluation results, allowing the assessessment of the performance of the model on unseen data.

In [None]:
# Evaluate the model
results = trainer.evaluate(eval_dataset=tokenized_test_dataset)

# Print evaluation results

print(results)



{'eval_loss': 0.6620588898658752, 'eval_bleu': 0.02147630233008602, 'eval_runtime': 24.8873, 'eval_samples_per_second': 27.926, 'eval_steps_per_second': 1.768, 'epoch': 2.0}


The below code defines a function predict() that takes an English sentence, a model, a tokenizer, and a device as input parameters. Inside the function:

It tokenizes the input English sentence using the provided tokenizer and ensures the tensors are on the specified device (GPU).
The model generates predictions for the tokenized input using the generate() method.
The predictions are decoded using the tokenizer to obtain the corresponding Twi sentence, which is returned as the output of the function.
Before generating predictions, the code ensures that the model is on the correct device (GPU).

Then, predictions are generated for each English sentence in the test_english_sentences list using the predict() function. The English sentence, predicted Twi sentence, and the actual Twi translation are printed for each pair in the dataset, allowing comparison between the predicted and actual translations.

In [None]:
# Defines a function to predict Twi sentences from English sentences
def predict(eng_sentence, model, tokenizer, device):
    # Tokenizes the input English sentence
    inputs = tokenizer(eng_sentence, return_tensors="pt").to(device)

    # Generates predictions using the model
    with torch.no_grad():
        outputs = model.generate(**inputs)

    # Decodes the predictions to get the Twi sentence
    predicted_twi_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return predicted_twi_sentence

# Ensures the model is on the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Generates predictions
predicted_twi_sentences = [predict(eng, model, tokenizer, device) for eng in test_english_sentences]

# Prints English sentences, predicted Twi sentences, and actual Twi translations
for eng, twi_pred, twi_actual in zip(test_english_sentences, predicted_twi_sentences, actual_twi_translations):
    print(f"English: {eng}")
    print(f"Predicted Twi: {twi_pred}")
    print(f"Actual Twi: {twi_actual}")
    print()


English: What is going on here?
Predicted Twi: <extra_id_0>.ɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛ
Actual Twi: Ɛdeɛn na  ɛrekɔso wɔ aha?

English: Wake up
Predicted Twi: <extra_id_0>.
Actual Twi: Sɔre

English: She comes here every Friday
Predicted Twi: <extra_id_0>s.ɛɛɛ
Actual Twi: Ɔba ha Fiada biara

English: Learn to be wise
Predicted Twi: <extra_id_0>.ɛɛɛɛɛ
Actual Twi: Sua nyansa

English: I didn’t think you would loose your way
Predicted Twi: <extra_id_0>.ɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛ
Actual Twi: Mannwene da sɛ wo bɛyera

English: If you like to enter more than five pair of sentences for english
Predicted Twi: <extra_id_0>.ɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛ
Actual Twi: Sɛ wo pɛ sɛ wo tintim brɔfo ne twi nsɛmfua nnum ne akyire a

English: My kids are worrying me.
Predicted Twi: <extra_id_0>.ɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛ
Actual Twi: Me mma reha m'adwene

English: I am Simeon
Predicted Twi: <extra_id_0>.ɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛɛ
Actual Twi: Me ne Simeon

English: Think about yourself
Predicted Twi: <extra_id_0>.. ɛ
Actual Twi: Dwene wo ho

English: Thi