<a href="https://colab.research.google.com/github/joshsalako/yoruba/blob/main/yo_eng_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## English-to-Yoruba Neural Machine Translation using MT5

This project focuses on fine-tuning a pre-trained MT5 (Multilingual T5) model for English-to-Yoruba Neural Machine Translation. The goal is to leverage the power of a large multilingual model to achieve high-quality translations between these languages.

**Key Features:**

- **MT5 Model:** Utilizes the "google/mt5-small" pre-trained model, providing a strong foundation for multilingual translation tasks.
- **Preprocessing:** Includes normalization of Yoruba text to NFC form and filtering of single-word sentences to improve data quality.
- **Custom Dataset and Dataloader:** Implements a custom `Seq2SeqDataset` class and dataloaders for efficient handling of the translation data.
- **Fine-tuning:** Fine-tunes the MT5 model on the Menyo-20k_MT dataset, a dataset specifically designed for English-Yoruba translation.
- **Custom Trainer:** Uses a custom `CustomSeq2SeqTrainer` to ensure tensor contiguity during training and saving.
- **Evaluation with BLEU:** Employs the BLEU (Bilingual Evaluation Understudy) score, a standard metric for evaluating machine translation quality.
- **Model Saving:** Saves the fine-tuned model and tokenizer for future use.

**Potential Applications:**

- **Bridging the Language Gap:** Facilitating communication and information access between English and Yoruba speakers.
- **Language Preservation:** Contributing to the digitization and accessibility of resources in the Yoruba language.
- **NLP Research:** Providing a baseline model for further research and development in English-Yoruba machine translation.

**Future Improvements:**

- **Larger Dataset:** Training on a larger and more diverse dataset to enhance translation accuracy and fluency.
- **Hyperparameter Optimization:** Exploring different hyperparameter settings to potentially improve model performance.
- **Evaluation Metrics:** Considering additional evaluation metrics, such as METEOR and ROUGE, for a more comprehensive assessment of translation quality.


# Import Packages and Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
%%capture
!pip install --upgrade datasets

In [None]:
%%capture
!pip install evaluate
!pip install sacrebleu

In [None]:
%%capture
!pip install git+https://github.com/csebuetnlp/normalizer

In [None]:
%%capture
!pip install sentencepiece
!pip install googletrans==4.0.0-rc1

In [None]:
import pandas as pd
import torch
import unicodedata
from datasets import Dataset
from transformers import MT5ForConditionalGeneration, T5Tokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoTokenizer
import evaluate
import os
import re

In [None]:
bleu = evaluate.load("sacrebleu")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Import data and preprocess

In [None]:
# Preprocessing functions
def standardize_to_NFC(text_list):
    """Normalize the text to NFC form for consistent diacritic handling."""
    return [unicodedata.normalize('NFC', text) for text in text_list]

def filter_single_word_sentence(eng_sents, yor_sents):
    """Filter out sentences that are single words in either language."""
    eng_inds = set([i for i, sent in enumerate(eng_sents) if len(sent.split()) > 1])
    yor_inds = set([i for i, sent in enumerate(yor_sents) if len(sent.split()) > 1])
    common_inds = sorted(list(eng_inds & yor_inds))

    eng_filtered = [eng_sents[i] for i in common_inds]
    yor_filtered = [yor_sents[i] for i in common_inds]

    return eng_filtered, yor_filtered

def preprocess_text(text):
    """Applies basic text cleaning and normalization to a text string."""
    text = text.lower() # Lowercase
    text = re.sub(r"([?.!,¬ø])", r" \1 ", text) # Add spaces around punctuation
    text = re.sub(r'[" "]+', " ", text) # Remove extra spaces
    text = text.strip()
    return text

In [None]:
def load_and_preprocess_data(input_dir):
    """Load data from CSV, normalize and filter it."""
    # Example dataset paths
    train_file = os.path.join(input_dir, 'train.tsv')  # Adjust the path to your training data
    val_file = os.path.join(input_dir, 'dev.tsv')      # Adjust the path to your validation data
    test_file = os.path.join(input_dir, 'test.csv')

    # Load the training data
    train_df = pd.read_csv (train_file, delimiter='\t', names=['English', 'Yoruba'])
    val_df = pd.read_csv(val_file, delimiter='\t', names=['English', 'Yoruba'])
    test_dataset = pd.read_csv(test_file, names=['English', 'Yoruba'])

    # Normalize Yor√πb√° sentences to NFC
    train_df['Yoruba'] = standardize_to_NFC(train_df['Yoruba'])
    val_df['Yoruba'] = standardize_to_NFC(val_df['Yoruba'])

    # # Filter out single-word sentences
    train_en, train_yo = filter_single_word_sentence(train_df['English'], train_df['Yoruba'])
    val_en, val_yo = filter_single_word_sentence(val_df['English'], val_df['Yoruba'])

    # # Create Hugging Face dataset from pandas DataFrame
    train_dataset = pd.DataFrame({'English': train_en, 'Yoruba': train_yo})
    val_dataset = pd.DataFrame({'English': val_en, 'Yoruba': val_yo})

    return train_dataset, val_dataset, test_dataset

In [None]:
# 1. Load and preprocess the Menyo-20k_MT dataset
input_dir = '/content/drive/MyDrive/MachineTranslation/data'  # Specify the path to your dataset directory
train_dataset, val_dataset, test_dataset = load_and_preprocess_data(input_dir)

In [None]:
train_dataset.head()

Unnamed: 0,English,Yoruba
0,Unit 1: What is Creative Commons?,Ôªø√åd√° 1: K√≠n ni Creative Commons?
1,This work is licensed under a Creative Commons...,I·π£·∫πÃÅ y√¨√≠ w√† l√°b·∫πÃÅ √†·π£·∫π Creative Commons Attribu...
2,"Creative Commons is a set of legal tools, a no...",Creative Commons j·∫πÃÅ √†w·ªçn ·ªçÃÄkan-√≤-j·ªçÃÄkan ohun-...
3,Creative Commons began in response to an outda...,Creative Commons b·∫πÃÄr·∫πÃÄ l√°ti w√° w·ªçÃÄr·ªçÃÄk·ªçÃÄ fi ·π£...
4,CC licenses are built on copyright and are des...,√Äw·ªçn √†·π£·∫π CC j·∫π m·ªçÃÅ √†·π£·∫π ·∫πni t√≠ √≥ n√≠ i·π£·∫πÃÅ-√†tin√∫d...


In [None]:
# Rename the columns to match the expected format
train_dataset.rename(columns={'English': 'input_text', 'Yoruba': 'labels'}, inplace=True)
train_dataset.head()

Unnamed: 0,input_text,labels
0,Unit 1: What is Creative Commons?,Ôªø√åd√° 1: K√≠n ni Creative Commons?
1,This work is licensed under a Creative Commons...,I·π£·∫πÃÅ y√¨√≠ w√† l√°b·∫πÃÅ √†·π£·∫π Creative Commons Attribu...
2,"Creative Commons is a set of legal tools, a no...",Creative Commons j·∫πÃÅ √†w·ªçn ·ªçÃÄkan-√≤-j·ªçÃÄkan ohun-...
3,Creative Commons began in response to an outda...,Creative Commons b·∫πÃÄr·∫πÃÄ l√°ti w√° w·ªçÃÄr·ªçÃÄk·ªçÃÄ fi ·π£...
4,CC licenses are built on copyright and are des...,√Äw·ªçn √†·π£·∫π CC j·∫π m·ªçÃÅ √†·π£·∫π ·∫πni t√≠ √≥ n√≠ i·π£·∫πÃÅ-√†tin√∫d...


In [None]:
val_dataset.head()

Unnamed: 0,English,Yoruba
0,"We prepare the saddle, and the goat presents i...",A di g√†√°r√¨ s√≠l·∫πÃÄ ew√∫r·∫πÃÅ ≈Ñ y·ªçj√∫; ·∫πr√π √¨ran r·∫πÃÄ ni?
1,"You have been crowned a king, and yet you make...",A fi ·ªçÃÅ j·ªçba √≤ ≈Ñ ·π£√†w√∫re o f·∫πÃÅ j·∫π ·ªål·ªçÃÅrun ni?
2,By dancing we take possession of Aw√†; through ...,"A fij√≥ gba Aw√†; a f√¨j√† gba Aw√†; b√≠ a √≤ b√° j√≥, ..."
3,We lift a saddle and the goat (kin) scowls; it...,A gb√© g√†√°r√¨ ·ªçm·ªç ew√∫r·∫πÃÅ ≈Ñ roj√∫; k√¨ √≠ ·π£e ·∫πr√π √†g√π...
4,One does not share a farm boundary with a king...,A k√¨ √≠ b√° ·ªçba p√†l√† k√≠ ·ªçk·ªçÃÅ ·ªçba m√° ·π£√°nni l·∫πÃÅs·∫πÃÄ.


In [None]:
# Rename the columns to match the expected format
val_dataset.rename(columns={'English': 'input_text', 'Yoruba': 'labels'}, inplace=True)
val_dataset.head()

Unnamed: 0,input_text,labels
0,"We prepare the saddle, and the goat presents i...",A di g√†√°r√¨ s√≠l·∫πÃÄ ew√∫r·∫πÃÅ ≈Ñ y·ªçj√∫; ·∫πr√π √¨ran r·∫πÃÄ ni?
1,"You have been crowned a king, and yet you make...",A fi ·ªçÃÅ j·ªçba √≤ ≈Ñ ·π£√†w√∫re o f·∫πÃÅ j·∫π ·ªål·ªçÃÅrun ni?
2,By dancing we take possession of Aw√†; through ...,"A fij√≥ gba Aw√†; a f√¨j√† gba Aw√†; b√≠ a √≤ b√° j√≥, ..."
3,We lift a saddle and the goat (kin) scowls; it...,A gb√© g√†√°r√¨ ·ªçm·ªç ew√∫r·∫πÃÅ ≈Ñ roj√∫; k√¨ √≠ ·π£e ·∫πr√π √†g√π...
4,One does not share a farm boundary with a king...,A k√¨ √≠ b√° ·ªçba p√†l√† k√≠ ·ªçk·ªçÃÅ ·ªçba m√° ·π£√°nni l·∫πÃÅs·∫πÃÄ.


In [None]:
train_dataset[0:1]

Unnamed: 0,input_text,labels
0,Unit 1: What is Creative Commons?,Ôªø√åd√° 1: K√≠n ni Creative Commons?


In [None]:
test_dataset.head()

Unnamed: 0,English,Yoruba
0,English,Yoruba
1,Her false nails! Then Labakes Small lock of ha...,
2,The dialogue between the two peony lasted for ...,
3,"What a European sees is, on television, every ...",
4,"As cases have been confirmed all over China, a...",


In [None]:
# Rename the columns to match the expected format
test_dataset.rename(columns={'English': 'input_text', 'Yoruba': 'labels'}, inplace=True)
test_dataset.head()

Unnamed: 0,input_text,labels
0,English,Yoruba
1,Her false nails! Then Labakes Small lock of ha...,
2,The dialogue between the two peony lasted for ...,
3,"What a European sees is, on television, every ...",
4,"As cases have been confirmed all over China, a...",


In [None]:
!pip show transformers

Name: transformers
Version: 4.44.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 


# Modelling and dataset creation

In [None]:
#model_name = 'Davlan/m2m100_418M-eng-yor-mt'
model_name = 'Davlan/mt5_base_eng_yor_mt'
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained(model_name).to(device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
from normalizer import normalize
from torch.utils.data import Dataset, DataLoader
class Seq2SeqDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128, has_labels=True):
        self.input_text = data['input_text'].astype(str).apply(normalize).tolist()
        self.has_labels = has_labels
        if has_labels:
            self.labels = data['labels'].astype(str).apply(normalize).tolist()
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.input_text)

    def __getitem__(self, idx):
        input_text = self.input_text[idx]
        label_text = None
        if self.has_labels:
            label_text = self.labels[idx]

        # Tokenize the input text
        input_encodings = self.tokenizer(
            input_text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        # Tokenize the label text to get its 'input_ids' and 'attention_mask'
        label_encodings = None
        if self.has_labels:
            label_encodings = self.tokenizer(
                label_text,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt'
            )

        output = {
            'input_ids': input_encodings['input_ids'].squeeze(),
            'attention_mask': input_encodings['attention_mask'].squeeze(),
        }
        if self.has_labels:
            output['labels'] = label_encodings['input_ids'].squeeze()
        return output

In [None]:
import torch
from transformers import DataCollatorForSeq2Seq


class MyDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    """
    Custom data collator for sequence-to-sequence models.
    """

    def __call__(self, features: list) -> dict:
        """
        Collates a list of features into a batch.

        Args:
            features (list): List of feature dictionaries.

        Returns:
            dict: Collated batch.
        """
        if not features:
            raise ValueError("Features list is empty.")

        required_keys = ["input_ids", "attention_mask", "labels"]
        for feature in features:
            if not all(key in feature for key in required_keys):
                raise ValueError("All features must contain 'input_ids', 'attention_mask', and 'labels'.")

        batch = {}
        batch["input_ids"] = torch.stack([feature["input_ids"] for feature in features])
        batch["attention_mask"] = torch.stack([feature["attention_mask"] for feature in features])

        # Labels should be processed differently for PyTorch tensors
        if isinstance(features[0]["labels"], torch.Tensor):
            batch["labels"] = torch.stack([feature["labels"] for feature in features])
        else:
            # Convert the list of lists to a PyTorch tensor
            batch["labels"] = torch.tensor([feature["labels"] for feature in features])

        return batch

In [None]:
from transformers import Trainer


class CustomSeq2SeqTrainer(Trainer):
    """
    Custom Trainer class to ensure tensors are contiguous during training.
    """

    def _ensure_contiguous_tensors(self):
        """
        Ensure all model tensors are contiguous.
        """
        for param in self.model.parameters():
            if not param.is_contiguous():
                param.data = param.contiguous()

    def save_model(self, output_dir: str = None, **kwargs) -> None:
        """
        Override save_model to ensure all model tensors are contiguous before saving.

        Args:
            output_dir (str, optional): Directory to save the model. Defaults to None.
        """
        if output_dir is None:
            output_dir = self.args.output_dir
        self._ensure_contiguous_tensors()
        super().save_model(output_dir, **kwargs)

    def training_step(self, model, inputs):
        """
        Override training_step to ensure tensors are contiguous during gradient updates.

        Args:
            model: Model being trained.
            inputs: Input batch.

        Returns:
            dict: Training step output.
        """
        self._ensure_contiguous_tensors()
        return super().training_step(model, inputs)

In [None]:
# Create train , test and validation datasets
train_dataset = Seq2SeqDataset(train_dataset, tokenizer)
#train_dataset = Seq2SeqDataset(augmented_df, tokenizer)
val_dataset = Seq2SeqDataset(val_dataset, tokenizer)
test_dataset = Seq2SeqDataset(test_dataset, tokenizer, has_labels=False)
# validation_dataset = Seq2SeqDataset(validation_data, tokenizer)

# Create train , test and validation dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)  #batch_size=32
val_dataloader = DataLoader(val_dataset, batch_size=32) #batch_size=32
test_dataloader = DataLoader(test_dataset, batch_size=32) #batch_size=32
# validation_dataloader = DataLoader(validation_dataset, batch_size=32) #batch_size=32

In [None]:
model.to(device)

MT5ForConditionalGeneration(
  (shared): Embedding(250112, 768)
  (encoder): MT5Stack(
    (embed_tokens): Embedding(250112, 768)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
         

In [None]:
# Create a custom optimizer using torch.optim.AdamW
custom_optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    eps=1e-8,
    weight_decay=0.01,
)

In [None]:
from transformers import Trainer, TrainingArguments
# Define the TrainingArguments for fine-tuning
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=5,
    gradient_accumulation_steps=8,
    evaluation_strategy="steps",
    save_total_limit=0,
    eval_steps=50,
    save_steps=15000,
    learning_rate=1e-3,
    do_train=True,
    do_eval=True,
    remove_unused_columns=False,
    push_to_hub=False,
    report_to="none",
    load_best_model_at_end=False,
    lr_scheduler_type="cosine_with_restarts",
    warmup_steps=100,
    weight_decay=0.01,
    #logging_dir='D:\\Datasets\\Thesis Data Test',
    logging_steps=50,

)



In [None]:
# Create a data collator for sequence-to-sequence tasks
data_collator = MyDataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=False,
    max_length=80,
    label_pad_token_id=tokenizer.pad_token_id,
)

In [None]:
# Create Trainer
trainer = CustomSeq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    #train_dataset=augmented_df,
    eval_dataset=val_dataset,
    optimizers=(custom_optimizer, None),
)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss
50,10.9102,1.554232
100,1.2018,1.044562
150,0.902,0.954446
200,0.8319,0.918307
250,0.7964,0.882791
300,0.7098,0.855705
350,0.6683,0.835691
400,0.6339,0.81893
450,0.6383,0.809156
500,0.6121,0.798633


TrainOutput(global_step=753, training_loss=1.3694043072413005, metrics={'train_runtime': 5043.4791, 'train_samples_per_second': 5.99, 'train_steps_per_second': 0.149, 'total_flos': 9028817371791360.0, 'train_loss': 1.3694043072413005, 'epoch': 2.991062562065541})

# Model evaluation

In [None]:
from transformers import AutoModelForSeq2SeqLM

# Correct directory paths
model_output_dir = "/content/drive/MyDrive/MachineTranslation"
tokenizer_output_dir = "/content/drive/MyDrive/MachineTranslation"

# Save the model to the specified directory
model.save_pretrained(model_output_dir)

# Save the tokenizer to the specified directory
tokenizer.save_pretrained(tokenizer_output_dir)

print(f"Model saved to {model_output_dir}")
print(f"Tokenizer saved to {tokenizer_output_dir}")


Model saved to /content/drive/MyDrive/MachineTranslation
Tokenizer saved to /content/drive/MyDrive/MachineTranslation


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Correct directory paths
model_output_dir = "/content/drive/MyDrive/MachineTranslation"
tokenizer_output_dir = "/content/drive/MyDrive/MachineTranslation"

# Load the model
translate_model = AutoModelForSeq2SeqLM.from_pretrained(model_output_dir)

# Load the tokenizer
translate_tokenizer = AutoTokenizer.from_pretrained(tokenizer_output_dir)

print("Model and tokenizer have been loaded successfully.")


You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


Model and tokenizer have been loaded successfully.


In [None]:
torch.cuda.empty_cache()  # Clear cache

In [None]:
# Define the translation function
def translate_text_to_yoruba(text):
    model_inputs = translate_tokenizer(text, return_tensors="pt")
    # Use translate_tokenizer.bos_token_id to get the ID of the beginning-of-sentence token
    gen_tokens = translate_model.generate(**model_inputs, forced_bos_token_id=translate_tokenizer.bos_token_id)
    translation = translate_tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]
    return translation

In [None]:
english_text = input("Enter an English text: ")
yoruba_translation = translate_text_to_yoruba(english_text)
print(f"Yoruba Translation: {yoruba_translation}")

Enter an English text: Good morning




Yoruba Translation: Oj√≥ aar√≤


In [None]:
from tqdm import tqdm
import evaluate

def evaluate_model(model, tokenizer, eval_dataloader, device):
    model.eval()  # Set model to evaluation mode
    model.to(device)

    predictions = []
    references = []

    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            # Move batch to device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            torch.cuda.empty_cache()  # Clear cache

            # Generate translations
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=80,  # Adjust max_length according to your data
                num_beams=5,  # Beam search for better results
                early_stopping=True
            )

            # Decode predictions
            decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(batch["labels"], skip_special_tokens=True)

            # Store results
            predictions.extend(decoded_preds)
            references.extend(decoded_labels)

    return predictions, references

In [None]:
predictions, references = evaluate_model(translate_model, translate_tokenizer, val_dataloader, device)

# Display some sample results
for i in range(5):  # Display first 5 samples
    print(f"Input: {val_dataset.input_text[i]}")
    print(f"Prediction: {predictions[i]}")
    print(f"Reference: {references[i]}")
    print("-" * 30)

# Load the BLEU metric for evaluation using the new library
bleu_metric = evaluate.load("bleu")

# Format predictions and references for BLEU metric calculation
bleu_metric.add_batch(
    #predictions=[pred.split() for pred in predictions],
    predictions=predictions,

    #references=[[ref.split()] for ref in references]
    references=[[ref] for ref in references]
)

# Calculate BLEU score
bleu_score = bleu_metric.compute()
print(f"BLEU Score: {bleu_score['bleu'] * 100:.2f}")

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 107/107 [13:41<00:00,  7.68s/it]


Input: We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?
Prediction: A n se adie, ewur√© n gbe; o j√© √≤kan-o-j√≤kan ewur√©?
Reference: A di gaari sil√® ewur√© n yoju; eru iran r√® ni?
------------------------------
Input: You have been crowned a king, and yet you make good-luck charms; would you be crowned God?
Prediction: W√≥n ti yan o g√©g√© bi oba, o n se √≤le; w√≥n maa yan o ni Ol√≥run?
Reference: A fi √≥ joba o n sawure o f√© je Ol√≥run ni?
------------------------------
Input: By dancing we take possession of Awa; through fighting we take possession of Awa; if we neither dance nor fight, but take possession of Awa anyway, is the result not the same?
Prediction: Bi a ba n ji a n ji Awa; bi a o ji a n ji Awa; bi a o ji a n ji a n ji a n ji; bi a o ji a n ji, sugb√≥n a n ji Awa nigbakan naa?
Reference: A fijo gba Awa; a fija gba Awa; bi a o ba jo, bi a o ba ja, bi a ba ti gba Awa, ko tan bi?
------------------------------
Input: We lift a

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU Score: 12.29


In [None]:
from tqdm import tqdm

def predict_model(model, tokenizer, test_dataloader, device):
    model.eval()  # Set model to evaluation mode
    model.to(device)

    predictions = []

    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Predicting"):
            # Move batch to device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            torch.cuda.empty_cache()  # Clear cache

            # Generate translations
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=80,  # Adjust max_length according to your data
                num_beams=5,  # Beam search for better results
                early_stopping=True
            )

            # Decode predictions
            decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            # Store results
            predictions.extend(decoded_preds)

    return predictions

In [None]:
# Translate and save
translated_text = predict_model(translate_model, translate_tokenizer, test_dataloader, device)

Predicting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 208/208 [27:08<00:00,  7.83s/it]


In [None]:
response = pd.read_csv('/content/drive/MyDrive/MachineTranslation/data/test.csv', names=['English', 'Yoruba'])
response['Yoruba'] = translated_text
response.to_csv('/content/drive/MyDrive/MachineTranslation/data/final_result.csv', sep='\t', index=False)
response.head(10)

Unnamed: 0,English,Yoruba
0,English,G√®√©si
1,Her false nails! Then Labakes Small lock of ha...,"Iwo √®k√© r√®! Labakes Iwo √®k√© r√®, ti o ti wa ni ..."
2,The dialogue between the two peony lasted for ...,Nnkan bii is√©ju marun-un to wa laaarin awon ee...
3,"What a European sees is, on television, every ...","Ohun ti il√® Yuroopu n ri ni, lori √®ro ayelujar..."
4,"As cases have been confirmed all over China, a...","G√©g√© bi √≤r√≤ naa se ri loril√®-ede China, gbogbo..."
5,The Super Falcons of Nigeria have qualified fo...,Super Falcons ti Naijiiria ti kopa ninu idije ...
6,What explanation would Alamu give to them all?...,Akosil√® wo ni alamu maa fun won? Akosil√® naa g...
7,Education has to be part of our response as well.,√àk√≥ gbod√≤ j√© √≤kan lara ohun ti a n se.
8,Law is our educator.,Ofin j√© oluk√≥ wa.
9,"Today, our team has grown, and we are using th...","Lonii, awon ara wa ti p√≤ si i, a si n lo ipa t..."
