# English to Twi Translation Model

This notebook demonstrates the fine-tuning of a pretrained Hugging Face model to translate English sentences into Twi, a native language of Ghana. By leveraging a state-of-the-art transformer-based model, I aim to understand and recongnize the value of machine translation - a task that presents unique linguistic challenges due to the differences in grammatical structure and vocabulary between the two languages.



In [1]:
# import libraries

import pandas as pd
from sklearn.model_selection import train_test_split


## Data Sourcing

The data used in this task is sourced from Zenodo and consists of a dataset containing over 20,000 English sentences and their corresponding translations in the Twi language. However, due to the size of the dataset and the computational resources required to process the entire dataset, running the full dataset posed challenges and led to numerous failed trials. As a result, for the purposes of this project, I have opted to use a subset of 5,000 data points from the dataset, ensuring a more manageable and efficient fine-tuning process while still capturing the essence of the translation task.

In [2]:
# Get data

!wget -O data.csv https://zenodo.org/records/4432117/files/verified_data.csv?download=1

--2024-10-05 12:12:10--  https://zenodo.org/records/4432117/files/verified_data.csv?download=1
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1795088 (1.7M) [text/plain]
Saving to: ‘data.csv’


2024-10-05 12:12:12 (2.31 MB/s) - ‘data.csv’ saved [1795088/1795088]



In [3]:
# load csv data

df = pd.read_csv('data.csv')
df = df.rename(columns={"English": "input", "Akuapem Twi": "target"})

df = df[0:5000]

In [4]:
df

Unnamed: 0,input,target
0,What she lacks in charisma she makes up for wi...,Nea onni ho adwempa no de adwumaden na ɛba.
1,There was nothing I could do about it.,Na biribiara nni hɛ a metumi ayɔ
2,Kwaku saw John and Abena holding hands.,Kwaku hui se John ne Abena kurakura wɛn nsa.
3,Can you stay till 2:30?,So wubetumi atena ha akosi nnɛnmienu npaamu ad...
4,You haven't got much time.,Wonni mmre
...,...,...
4995,Asamoah doesn't necessarily have to go there b...,Ɛho nhia ankasa sɛ Asamoah ankasa kɔ hɔ.
4996,She came to my defence when I was accused of p...,Ɔbaa me sukuu dan mu bere a wɔbɔɔ me sobo sɛ m...
4997,We both laughed.,Yɛn baanu nyinaa serewee.
4998,Who's your favorite painter?,Hena ne obi a ɔka nneɛma ho aduro a w'ani gye ...


### Train test validation splitting using sklearn

In [5]:
# Split the data into train, validation, and test sets (80%, 10%, 10%)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)

print(f"Train size: {len(train_df)}, Validation size: {len(val_df)}, Test size: {len(test_df)}")

Train size: 4000, Validation size: 500, Test size: 500


## Using a Pre-trained Model - MarianMTModel

I am using a pretrained model because it allows me to build on the extensive language knowledge the model has already acquired from large-scale multilingual datasets. Pretrained models like those from Hugging Face's library have learned general language patterns, structures, and representations from diverse text corpora. By starting with a model that already understands the fundamentals of language translation, I can focus on fine-tuning it for the specific task of translating English to Twi, which significantly reduces the data requirements and training time compared to training a model from scratch.

Why I Chose MarianMTModel

I chose MarianMTModel because it is a specialized transformer-based model designed specifically for machine translation tasks. Here’s why it is ideal for this project:

- Multilingual Translation: MarianMTModel supports a wide range of language pairs, including low-resource languages like Twi. This makes it a perfect fit for translating to a language with limited training data.

- Pretrained for Translation: MarianMTModel is already optimized for translation tasks, unlike general models such as BERT or GPT. This makes it more effective for sentence-level translations, capturing the linguistic differences between English and Twi.

- Efficiency: MarianMTModel is lightweight and efficient, making it suitable for fine-tuning even on a smaller subset of the dataset. Given my choice to work with 5,000 data points, its efficiency is crucial for handling the task without excessive computational overhead.

Using MarianMTModel enables me to take advantage of its pre-existing translation capabilities while fine-tuning it for the specific needs of English to Twi translation.

In [6]:
from transformers import MarianMTModel, MarianTokenizer

# Load a pre-trained MarianMT model and tokenizer
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/779k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/799k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.46M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [7]:
def tokenize_sentences(df, tokenizer, max_len=128):
    '''
    Tokenize the input and target columns

    Args:
        df -> pandas dataframe containing the input and the target
        tokenizer -> tokenizer instance used to tokenize the corpus
        max_len -> maximum number of tokens

    Returns:
        input encodings
        target encodings
    '''
    input_encodings = tokenizer(list(df['input']), padding=True, truncation=True, max_length=max_len, return_tensors='pt')
    target_encodings = tokenizer(list(df['target']), padding=True, truncation=True, max_length=max_len, return_tensors='pt')

    return input_encodings, target_encodings

train_input_encodings, train_target_encodings = tokenize_sentences(train_df, tokenizer)
val_input_encodings, val_target_encodings = tokenize_sentences(val_df, tokenizer)

In [8]:
import torch
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    """
    A custom Dataset class for handling translation tasks.

    This class inherits from torch.utils.data.Dataset and is used to prepare
    tokenized input and target data for a translation model
    """

    def __init__(self, input_encodings, target_encodings):
        '''
        class constructor

        Args:
            input_encodings (dict): A dictionary containing the tokenized input (source language) data.
            target_encodings (dict): A dictionary containing the tokenized target (translated language) data.
        '''
        self.input_encodings = input_encodings
        self.target_encodings = target_encodings

    def __len__(self):
        """ Returns the number of samples in the dataset (based on input encodings length) """
        return len(self.input_encodings['input_ids'])

    def __getitem__(self, idx):
        """ Retrieves a single sample of input and target encodings as tensors, for use by the model. """
        item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
        item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
        return item

train_dataset = TranslationDataset(train_input_encodings, train_target_encodings)
val_dataset = TranslationDataset(val_input_encodings, val_target_encodings)

In [9]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=1000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.6223,0.833669
2,0.5204,0.698539
3,0.4507,0.657159


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}
  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])


TrainOutput(global_step=1500, training_loss=0.5634977019627889, metrics={'train_runtime': 11213.1217, 'train_samples_per_second': 1.07, 'train_steps_per_second': 0.134, 'total_flos': 85805236224000.0, 'train_loss': 0.5634977019627889, 'epoch': 3.0})

In [10]:
test_input_encodings, test_target_encodings = tokenize_sentences(test_df, tokenizer)
test_dataset = TranslationDataset(test_input_encodings, test_target_encodings)

# Evaluate the model
eval_results = trainer.evaluate(eval_dataset=test_dataset)
print(f"Evaluation Results: {eval_results}")

  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])


Evaluation Results: {'eval_loss': 0.6341087222099304, 'eval_runtime': 126.3968, 'eval_samples_per_second': 3.956, 'eval_steps_per_second': 0.498, 'epoch': 3.0}


## Text Translation

The following code defines the translate function, which takes an English sentence as input and returns its Twi translation. The function tokenizes the input text, uses the fine-tuned MarianMTModel to generate the translation, and then decodes the output back into readable Twi text. This function is essential for performing the actual translation task using the model and tokenizer.

In [13]:
def translate(text, trainer, tokenizer):
    '''
    Function that translates a given text to Twi

    Args:
        text -> the text to be translated
        trainer -> trainer instance that contains the model
        tokenizer -> tokenizer instance to tokenize the text

    Returns:
        Transalted text
    '''
    # Extract the model from the trainer
    model = trainer.model

    input_encodings = tokenizer(text, return_tensors='pt', padding=True)

    # Generate translation
    translated_tokens = model.generate(**input_encodings)

    # Decode the output
    translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

    return translated_text

# Test
text_to_translate = ["Kwaku saw John and Abena holding hands."]
translated_text = translate(text_to_translate, trainer, tokenizer)
print(f"Translation: {translated_text}")

Translation: ['Kwaku sɛ John ne Abena wɔn nsa.']


In [14]:
trainer.save_model('./saved_model')
tokenizer.save_pretrained('./saved_model')

Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[65000]], 'forced_eos_token_id': 0}


('./saved_model/tokenizer_config.json',
 './saved_model/special_tokens_map.json',
 './saved_model/vocab.json',
 './saved_model/source.spm',
 './saved_model/target.spm',
 './saved_model/added_tokens.json')

In [15]:
from transformers import MarianMTModel, MarianTokenizer

# Load the pretrained model and tokenizer
model_name = "./saved_model"
tokenizer_name = "./saved_model"

model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(tokenizer_name)

def translate(text, model, tokenizer):
    '''
    Function that translates a given text to Twi

    Args:
        text -> the text to be translated
        trainer -> trainer instance that contains the model
        tokenizer -> tokenizer instance to tokenize the text

    Returns:
        Transalted text
    '''
    input_encodings = tokenizer(text, return_tensors='pt', padding=True)

    # Generate translation
    translated_tokens = model.generate(**input_encodings)

    # Decode the output
    translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

    return translated_text

# Example translation
text_to_translate = ["We both laughed at my favorite painter"]
translated_text = translate(text_to_translate, model, tokenizer)
print(f"Translation: {translated_text}")




Translation: ["Yɛn baanu m'ani gye m'ani gye"]
