# 03 — Fine-tuning NLLB (Tagalog → Cebuano)

**Purpose:**
 Fine-tune the NLLB model on your aligned Cebuano–Tagalog dataset.

**Key stages:**

1. Load TSVs (train/dev).
2. Tokenize sentences. During decoding, the BOS token is forced to `ceb_Latn` so the model generates Cebuano.
3. Train with `Seq2SeqTrainer` (2 epochs, 8-sentence batches).
4. Evaluate automatically at each epoch.
5. Save model and metrics.

**Outputs:**

- `experiments/finetune/` folder containing:
  - Trained model weights.
  - Tokenizer config.
  - `metrics.json` (evaluation scores).

### Fine-tuning the NLLB model

This cell runs the `finetune.py` script to train a multilingual NLLB translation model (`facebook/nllb-200-distilled-600M`) on the prepared Tagalog–Cebuano parallel dataset.

In [1]:
!python ../src/train/finetune.py \
  --train ../data/processed/train.tsv \
  --dev ../data/processed/dev.tsv \
  --out ../experiments/finetune

Resolved codes → src: 'tgl_Latn' as 'tgl_Latn' (id=256174) | tgt: 'ceb_Latn' as 'ceb_Latn' (id=256035)
Train samples: 22,851 | Dev samples: 2,930
{'loss': 2.3604, 'grad_norm': 2.953125, 'learning_rate': 1.982849142457123e-05, 'epoch': 0.02}
{'loss': 2.2806, 'grad_norm': 3.265625, 'learning_rate': 1.9653482674133707e-05, 'epoch': 0.04}
{'loss': 2.258, 'grad_norm': 3.703125, 'learning_rate': 1.9478473923696188e-05, 'epoch': 0.05}
{'loss': 2.2059, 'grad_norm': 3.03125, 'learning_rate': 1.9303465173258665e-05, 'epoch': 0.07}
{'loss': 2.212, 'grad_norm': 3.515625, 'learning_rate': 1.9128456422821142e-05, 'epoch': 0.09}
{'loss': 2.0434, 'grad_norm': 2.5625, 'learning_rate': 1.895344767238362e-05, 'epoch': 0.11}
{'loss': 2.1088, 'grad_norm': 3.109375, 'learning_rate': 1.87784389219461e-05, 'epoch': 0.12}
{'loss': 2.0098, 'grad_norm': 3.53125, 'learning_rate': 1.8603430171508577e-05, 'epoch': 0.14}
{'loss': 2.1264, 'grad_norm': 2.703125, 'learning_rate': 1.8428421421071054e-05, 'epoch': 0.16}


`torch_dtype` is deprecated! Use `dtype` instead!

Tokenizing train:   0%|          | 0/22851 [00:00<?, ? examples/s]
Tokenizing train:   4%|▍         | 1000/22851 [00:00<00:12, 1805.91 examples/s]
Tokenizing train:   9%|▉         | 2000/22851 [00:01<00:11, 1801.19 examples/s]
Tokenizing train:  13%|█▎        | 3000/22851 [00:01<00:10, 1830.16 examples/s]
Tokenizing train:  18%|█▊        | 4000/22851 [00:02<00:10, 1848.11 examples/s]
Tokenizing train:  22%|██▏       | 5000/22851 [00:02<00:10, 1749.05 examples/s]
Tokenizing train:  26%|██▋       | 6000/22851 [00:03<00:09, 1770.00 examples/s]
Tokenizing train:  31%|███       | 7000/22851 [00:03<00:08, 1832.77 examples/s]
Tokenizing train:  35%|███▌      | 8000/22851 [00:04<00:08, 1828.50 examples/s]
Tokenizing train:  39%|███▉      | 9000/22851 [00:04<00:07, 1863.50 examples/s]
Tokenizing train:  44%|████▍     | 10000/22851 [00:05<00:06, 1855.46 examples/s]
Tokenizing train:  48%|████▊     | 11000/22851 [00:05<00:06, 1869.95 examples/s]


The script performs the following steps:

* Loads the training and development splits (`train.tsv` and `dev.tsv`) from `../data/processed/`.
* Initializes the NLLB tokenizer and model, automatically resolving the correct language codes (`tgl_Latn` → `ceb_Latn`).
* Preprocesses each sentence pair, prefixing Cebuano inputs with the source language tag for proper conditioning.
* Fine-tunes the model for 2 epochs using mixed precision (FP16 or BF16 if available on GPU).
* Saves the trained model, checkpoints, and training configuration in `../experiments/finetune/`.

After execution, you should see logs showing tokenization progress, GPU precision mode, and epoch-by-epoch training metrics. The final trained model will be stored in the specified `finetune` experiment folder.

### Evaluate the fine-tuned model on the test set

This cell loads the **fine-tuned NLLB model** from `../experiments/finetune/`, translates the test set, and computes **BLEU** and **chrF2** with SacreBLEU.

In [2]:
!python ../src/eval/evaluate.py \
  --model_dir ../experiments/finetune \
  --test_tsv ../data/processed/test.tsv \
  --out_json ../experiments/finetune/metrics.json \
  --save_hyp ../experiments/finetune/hyp.txt

{
  "BLEU": 27.06,
  "chrF2": 49.22,
  "ref_len": 109569,
  "sys_len": 87779,
  "sacrebleu_version": "2.5.1",
  "n_samples": 2750,
  "model_dir": "../experiments/finetune",
  "codes": {
    "src": "tgl_Latn",
    "tgt": "ceb_Latn"
  },
  "decoding": {
    "beams": 5,
    "max_new_tokens": 200,
    "batch_size": 16
  }
}


`torch_dtype` is deprecated! Use `dtype` instead!


Translating:   1%|          | 1/172 [00:03<09:10,  3.22s/it]
Translating:   1%|          | 2/172 [00:06<09:01,  3.18s/it]
Translating:   2%|▏         | 3/172 [00:09<08:20,  2.96s/it]
Translating:   2%|▏         | 4/172 [00:12<08:55,  3.19s/it]
Translating:   3%|▎         | 5/172 [00:15<08:15,  2.97s/it]
Translating:   3%|▎         | 6/172 [00:19<09:29,  3.43s/it]
Translating:   4%|▍         | 7/172 [00:23<09:39,  3.51s/it]
Translating:   5%|▍         | 8/172 [00:26<09:46,  3.58s/it]
Translating:   5%|▌         | 9/172 [00:29<08:59,  3.31s/it]
Translating:   6%|▌         | 10/172 [00:33<09:19,  3.45s/it]
Translating:   6%|▋         | 11/172 [00:35<08:30,  3.17s/it]
Translating:   7%|▋         | 12/172 [00:40<09:18,  3.49s/it]
Translating:   8%|▊         | 13/172 [00:42<08:34,  3.24s/it]
Translating:   8%|▊         | 14/172 [00:45<08:14,  3.13s/it]
Translating:   9%|▊         | 15/172 [00:48<07:37,  2.92s/it]
Translating:   9%|▉        

What happens

* Loads `test.tsv` (columns: `src`, `tgt`) and the tokenizer/model from `../experiments/finetune`.
* Resolves NLLB language tags and **forces BOS** to Cebuano (`ceb_Latn`) for decoding consistency.
* Translates the source sentences in **mini-batches** (default batch size = 16) using beam search (default beams = 5).
* Writes the system outputs to `../experiments/finetune/hyp.txt`.
* Scores the outputs against references with SacreBLEU.