# 03 — Fine-tuning NLLB (Cebuano → Tagalog)

**Purpose:**
 Fine-tune the NLLB model on your aligned Cebuano–Tagalog dataset.

**Key stages:**

1. **Load TSVs** (train/dev).
2. **Tokenize** both sides, adding language tags (`<<ceb_Latn>>`, forced `tgl_Latn` BOS).
3. **Train** with `Seq2SeqTrainer` (2 epochs, 8-sentence batches).
4. **Evaluate** automatically at each epoch.
5. **Save** model and metrics.

**Outputs:**

- `experiments/finetune/` folder containing:
  - Trained model weights.
  - Tokenizer config.
  - `metrics.json` (evaluation scores).

### Fine-tuning the NLLB model

This cell runs the `finetune.py` script to train a multilingual NLLB translation model (`facebook/nllb-200-distilled-600M`) on the prepared Cebuano–Tagalog parallel dataset.

In [1]:
!python ../src/train/finetune.py \
  --train ../data/processed/train.tsv \
  --dev ../data/processed/dev.tsv \
  --out ../experiments/finetune

Resolved codes → src: 'ceb_Latn' as 'ceb_Latn' (id=256035) | tgt: 'tgl_Latn' as 'tgl_Latn' (id=256174)
Train samples: 22,851 | Dev samples: 2,930
{'loss': 2.4677, 'grad_norm': 3.28125, 'learning_rate': 1.982849142457123e-05, 'epoch': 0.02}
{'loss': 2.3679, 'grad_norm': 3.34375, 'learning_rate': 1.9653482674133707e-05, 'epoch': 0.04}
{'loss': 2.3481, 'grad_norm': 3.53125, 'learning_rate': 1.9478473923696188e-05, 'epoch': 0.05}
{'loss': 2.3069, 'grad_norm': 3.328125, 'learning_rate': 1.9303465173258665e-05, 'epoch': 0.07}
{'loss': 2.2973, 'grad_norm': 3.0, 'learning_rate': 1.9128456422821142e-05, 'epoch': 0.09}
{'loss': 2.1498, 'grad_norm': 2.703125, 'learning_rate': 1.895344767238362e-05, 'epoch': 0.11}
{'loss': 2.1989, 'grad_norm': 3.375, 'learning_rate': 1.87784389219461e-05, 'epoch': 0.12}
{'loss': 2.2096, 'grad_norm': 3.5625, 'learning_rate': 1.8603430171508577e-05, 'epoch': 0.14}
{'loss': 2.1677, 'grad_norm': 2.859375, 'learning_rate': 1.8428421421071054e-05, 'epoch': 0.16}
{'loss'

`torch_dtype` is deprecated! Use `dtype` instead!

Tokenizing train:   0%|          | 0/22851 [00:00<?, ? examples/s]
Tokenizing train:   4%|▍         | 1000/22851 [00:00<00:13, 1587.24 examples/s]
Tokenizing train:   9%|▉         | 2000/22851 [00:01<00:15, 1332.91 examples/s]
Tokenizing train:  13%|█▎        | 3000/22851 [00:02<00:13, 1458.32 examples/s]
Tokenizing train:  18%|█▊        | 4000/22851 [00:02<00:12, 1529.98 examples/s]
Tokenizing train:  22%|██▏       | 5000/22851 [00:03<00:11, 1570.80 examples/s]
Tokenizing train:  26%|██▋       | 6000/22851 [00:03<00:10, 1558.93 examples/s]
Tokenizing train:  31%|███       | 7000/22851 [00:04<00:10, 1523.22 examples/s]
Tokenizing train:  35%|███▌      | 8000/22851 [00:05<00:10, 1360.55 examples/s]
Tokenizing train:  39%|███▉      | 9000/22851 [00:06<00:10, 1359.52 examples/s]
Tokenizing train:  44%|████▍     | 10000/22851 [00:06<00:09, 1392.44 examples/s]
Tokenizing train:  48%|████▊     | 11000/22851 [00:07<00:08, 1373.86 examples/s]


The script performs the following steps:

* Loads the training and development splits (`train.tsv` and `dev.tsv`) from `../data/processed/`.
* Initializes the NLLB tokenizer and model, automatically resolving the correct language codes (`ceb_Latn` → `tgl_Latn`).
* Preprocesses each sentence pair, prefixing Cebuano inputs with the source language tag for proper conditioning.
* Fine-tunes the model for 2 epochs using mixed precision (FP16 or BF16 if available on GPU).
* Saves the trained model, checkpoints, and training configuration in `../experiments/finetune/`.

After execution, you should see logs showing tokenization progress, GPU precision mode, and epoch-by-epoch training metrics. The final trained model will be stored in the specified `finetune` experiment folder.

### Evaluate the fine-tuned model on the test set

This cell loads the **fine-tuned NLLB model** from `../experiments/finetune/`, translates the test set, and computes **BLEU** and **chrF2** with SacreBLEU.

In [2]:
!python ../src/eval/evaluate.py \
  --model_dir ../experiments/finetune \
  --test_tsv ../data/processed/test.tsv \
  --out_json ../experiments/finetune/metrics.json \
  --save_hyp ../experiments/finetune/hyp.txt \
  --code_src ceb_Latn --code_tgt tgl_Latn


{
  "BLEU": 29.83,
  "chrF2": 56.32,
  "ref_len": 85119,
  "sys_len": 89467,
  "sacrebleu_version": "2.5.1",
  "n_samples": 2750,
  "model_dir": "../experiments/finetune",
  "codes": {
    "src": "ceb_Latn",
    "tgt": "tgl_Latn"
  },
  "decoding": {
    "beams": 5,
    "max_new_tokens": 200,
    "batch_size": 16
  }
}


`torch_dtype` is deprecated! Use `dtype` instead!


Translating:   1%|          | 1/172 [00:03<10:05,  3.54s/it]
Translating:   1%|          | 2/172 [00:06<09:32,  3.37s/it]
Translating:   2%|▏         | 3/172 [00:15<16:28,  5.85s/it]
Translating:   2%|▏         | 4/172 [00:20<15:35,  5.57s/it]
Translating:   3%|▎         | 5/172 [00:22<12:10,  4.37s/it]
Translating:   3%|▎         | 6/172 [00:26<11:36,  4.19s/it]
Translating:   4%|▍         | 7/172 [00:29<10:02,  3.65s/it]
Translating:   5%|▍         | 8/172 [00:33<10:01,  3.67s/it]
Translating:   5%|▌         | 9/172 [00:40<13:24,  4.93s/it]
Translating:   6%|▌         | 10/172 [00:44<12:05,  4.48s/it]
Translating:   6%|▋         | 11/172 [00:46<10:16,  3.83s/it]
Translating:   7%|▋         | 12/172 [00:50<10:40,  4.00s/it]
Translating:   8%|▊         | 13/172 [00:53<09:31,  3.60s/it]
Translating:   8%|▊         | 14/172 [00:56<08:46,  3.33s/it]
Translating:   9%|▊         | 15/172 [00:59<08:20,  3.19s/it]
Translating:   9%|▉        

What happens

* Loads `test.tsv` (columns: `src`, `tgt`) and the tokenizer/model from `../experiments/finetune`.
* Resolves NLLB language tags and **forces BOS** to Tagalog (`tgl_Latn`) for decoding consistency.
* Translates the source sentences in **mini-batches** (default batch size = 16) using beam search (default beams = 5).
* Writes the system outputs to `../experiments/finetune/hyp.txt`.
* Scores the outputs against references with SacreBLEU and saves: