<h1 align=center> FINETUNING OF mT5 MODEL FOR ENGLISH TO HAUSA TRANSLATION </h1>
<h4 align=center> By Muhammad Jamil Abdulhamid </h4>

<img src="translation.jpg" alt="image_caption" >
<h4 align-left>Image by <a href="https://pixabay.com/users/falarcompaulo-1769737/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=1092128">falarcompaulo</a> from <a href="https://pixabay.com//?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=1092128">Pixabay</a>

## Table of Contents
<ul>
<li><a href="#intro"> 1. Introduction</a></li>
<li><a href="#dataprep">2. Data Preparation</a></li>
<li><a href="#pretrained">3. Load Pretrained</a></li>    
<li><a href="#preprocess">4. Preprocessing</a></li>
<li><a href="#training">5. Training</a></li>
<li><a href="#model">6. Model Saving</a></li>
<li><a href="#inference">7. Inference</a></li>
<li><a href="#push">8. Push to Hugginface Hub</a></li>    
</ul>

<div id='intro'></div>

## 1.0 Introduction

A language barrier can impede effective communication in learning, trade, and other aspects of life. In this project, the multilingual variant of the text-to-text transfer transformer model (mT5) was fine-tuned on a dataset with English as the source language and Hausa as the target language to build a fine-tuned model that translates English to Hausa.

## 1.1 Setup Environment

In [None]:
!nvidia-smi
!pip install transformers sentencepiece datasets accelerate -q

### 1.2 Import libraries

In [None]:
from transformers import MT5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset, Dataset

from tqdm import tqdm
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm

### 1.3 Check gpu

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

<div id='dataprep'></div>

## 2.0 Dataset Preparation

In [None]:
import kagglehub
from pathlib import Path
local_path = Path(kagglehub.dataset_download("gigikenneth/englishhausa-corpus", "en-ha.csv"))
local_path

In [None]:
df = pd.read_csv(local_path)
df.drop(columns="Unnamed: 0", inplace=True)
df.head()

In [None]:
# Convert to a Hugging Face Dataset
sample_data = df.sample(frac=0.03, random_state=42)
dataset = Dataset.from_pandas(sample_data)

In [None]:
dataset

<div id='pretrained'></div>

## 3.0 Load Pretrained

In [None]:
model_name = "google/mt5-small"

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

model = model.to(device)

<div id='preprocess'></div>

## 4.0 Preprocessing

In [None]:
# Preprocessing function
def preprocess(batch):
    # Fix the construction of inputs to apply the prefix to each sentence
    inputs = ["translate English to Hausa: " + str(s) for s in batch["source_sentence"]]

    # Ensure targets are strings. Handle potential NaN values (which are floats)
    # by converting them to empty strings. Other non-string types are converted to string.
    processed_targets = []
    for s in batch["target_sentence"]:
        if isinstance(s, float) and pd.isna(s):  # Check for NaN specifically
            processed_targets.append("")
        else:
            processed_targets.append(str(s))  # Convert other types to string

    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        processed_targets,
        max_length=128,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Apply preprocessing
tokenized_ds = dataset.map(preprocess, batched=True)
tokenized_ds = tokenized_ds.remove_columns(["source_sentence", "target_sentence"])
tokenized_ds.set_format(type="torch")
tokenized_ds

In [None]:
# Create dataloader
train_loader = DataLoader(
    tokenized_ds,
    batch_size=4,
    shuffle=True
    )

<div id='training' </div>

## 5.0 Training

In [None]:
# Setup
optimizer = AdamW(model.parameters(), lr=3e-4)
EPOCHS = 3

model.train()

for epoch in range(EPOCHS):
    print(f"Epoch {epoch+1}/{EPOCHS}")
    loop = tqdm(train_loader, leave=True)

    for batch in loop:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

        loop.set_description(f"loss: {loss.item():.4f}")

<div id='model'></div>

## 6.0 Model Saving

In [None]:
#!mkdir /content/finetuned_model/

In [None]:
save_dir = "/content/finetuned_model/mt5-en-ha-finetuned"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
print("Model saved.")

<div id='inference'></div>

## 7.0 Inference

### 7.1 Load finetuned model

In [None]:
finetuned_model = MT5ForConditionalGeneration.from_pretrained(save_dir).to(device)
finetuned_tokenizer = T5Tokenizer.from_pretrained(save_dir)

### 7.2 Translate New Sentences

In [None]:
def translate(sentence):
    input_text = "translate English to Hausa: " + sentence

    inputs = finetuned_tokenizer.encode(input_text, return_tensors="pt").to(device)

    outputs = finetuned_model.generate(
        inputs,
        max_length=128,
        num_beams=4,
        early_stopping=True
    )

    return finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Test translation
translate("What is your name?")

<div id='push'></div>

## 8.0 Push to hugginface

In [None]:
#!pip install huggingface_hub -q
from huggingface_hub import notebook_login

notebook_login()

In [None]:
finetuned_model.push_to_hub("mt5-en-ha-finetuned")
finetuned_tokenizer.push_to_hub("mt5-en-ha-finetuned")