<a href="https://colab.research.google.com/github/hiterharris/Assignment-1/blob/master/MTAT_nllb_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning NLLB

## Step 1: Set Up the Environment

Kaggle’s default environment has transformers, but you might need to install other dependencies like datasets and sentencepiece.

In [None]:
!pip install -q transformers datasets sentencepiece accelerate bitsandbytes


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m664.8/664.8 MB[0m [31m77.7 MB/s[0m eta [36m0:00:01[0m

## Step 2: Load Your Data

Since you have sentence-aligned files, you need to prepare them in a format suitable for datasets (e.g., a TSV or JSON).
Example: Loading TSV Aligned Data

If your data is in two aligned text files

In [None]:
import pandas as pd
from datasets import Dataset

# Load sentence-aligned files
with open("/kaggle/input/mtat25-ted-data-test/data/TED2020.de-en.de.train", "r", encoding="utf-8") as src, open("/kaggle/input/mtat25-ted-data-test/data/TED2020.de-en.en.train", "r", encoding="utf-8") as tgt:
    src_lines = src.readlines()[:3000]
    tgt_lines = tgt.readlines()[:3000]

# Convert to pandas DataFrame
df = pd.DataFrame({"source": src_lines, "target": tgt_lines})

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle()


## Step 3: Load Pretrained NLLB Model

NLLB models are available in different sizes on Hugging Face:

    facebook/nllb-200-distilled-600M (medium)
    facebook/nllb-200-1.3B (large)
    facebook/nllb-200-3.3B (very large)

Choose a smaller model (e.g., 600M) if running in Kaggle to avoid RAM issues.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-distilled-600M"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


## Step 4: Tokenize Data

NLLB requires a special language token identifier (<2xx>). Find the correct tokens for your source and target languages from NLLB language codes.

For example, if translating German → English:

In [None]:
SRC_LANG = "<deu_Latn>"
TGT_LANG = "<eng_Latn>"

def preprocess_function(examples):
    inputs = [SRC_LANG + text.strip() for text in examples["source"]]
    #targets = [TGT_LANG + text.strip() for text in examples["target"]]
    targets = [text.strip() for text in examples["target"]]

    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)


## Step 5: Set Up Training Arguments

Define the training arguments using Hugging Face’s Trainer API.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./nllb-finetuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1,
    push_to_hub=False,
    report_to="none"
)


## Step 6: Define Data Collator and Trainer

Use DataCollatorForSeq2Seq to handle padding efficiently.

In [None]:
from transformers import DataCollatorForSeq2Seq, Trainer

# Data collator for padding sequences
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    tokenizer=tokenizer,
    data_collator=data_collator
)


## Step 7: Start Training

Now, you can start finetuning.

In [None]:
trainer.train()


## Step 8: Save and Download Model

Once training is complete, save the model and tokenizer.

In [None]:
model.save_pretrained("nllb-finetuned-no-target-tag")
tokenizer.save_pretrained("nllb-finetuned-no-target-tag")
