# 04b — Fine-tuning with Back-Translated Data

Repeat the fine-tuning process from `03_finetune.ipynb`,  
but uses the merged dataset `train_plus_bt.tsv`, which combines:
- Original parallel data (`train.tsv`)
- Synthetic back-translated pairs generated in `04_pivot_bt_eval.ipynb`

Goal: Assess whether data augmentation via back-translation improves BLEU and chrF2 scores.

### Fine-tuning with Back-Translated Data

This command runs the same `finetune.py` training script, but this time uses the augmented dataset that includes synthetic pairs generated through back-translation (BT).

In [1]:
!python ../src/train/finetune.py \
  --train ../data/processed/train_plus_bt.tsv \
  --dev ../data/processed/dev.tsv \
  --out ../experiments/finetune_bt

Resolved codes → src: 'tgl_Latn' as 'tgl_Latn' (id=256174) | tgt: 'ceb_Latn' as 'ceb_Latn' (id=256035)
Train samples: 42,851 | Dev samples: 2,930
{'loss': 2.3223, 'grad_norm': 3.421875, 'learning_rate': 1.9908530894157178e-05, 'epoch': 0.01}
{'loss': 2.2993, 'grad_norm': 3.578125, 'learning_rate': 1.9815195071868583e-05, 'epoch': 0.02}
{'loss': 2.1627, 'grad_norm': 3.5625, 'learning_rate': 1.9721859249579992e-05, 'epoch': 0.03}
{'loss': 2.1358, 'grad_norm': 3.59375, 'learning_rate': 1.9628523427291398e-05, 'epoch': 0.04}
{'loss': 2.1456, 'grad_norm': 3.515625, 'learning_rate': 1.95351876050028e-05, 'epoch': 0.05}
{'loss': 2.1357, 'grad_norm': 3.28125, 'learning_rate': 1.944185178271421e-05, 'epoch': 0.06}
{'loss': 2.1021, 'grad_norm': 3.1875, 'learning_rate': 1.9348515960425614e-05, 'epoch': 0.07}
{'loss': 2.0753, 'grad_norm': 3.15625, 'learning_rate': 1.925518013813702e-05, 'epoch': 0.07}
{'loss': 2.0777, 'grad_norm': 3.140625, 'learning_rate': 1.9161844315848422e-05, 'epoch': 0.08}
{

`torch_dtype` is deprecated! Use `dtype` instead!

Tokenizing train:   0%|          | 0/42851 [00:00<?, ? examples/s]
Tokenizing train:   2%|▏         | 1000/42851 [00:00<00:30, 1377.74 examples/s]
Tokenizing train:   5%|▍         | 2000/42851 [00:01<00:28, 1417.89 examples/s]
Tokenizing train:   7%|▋         | 3000/42851 [00:01<00:25, 1546.14 examples/s]
Tokenizing train:   9%|▉         | 4000/42851 [00:02<00:23, 1622.99 examples/s]
Tokenizing train:  12%|█▏        | 5000/42851 [00:03<00:26, 1438.99 examples/s]
Tokenizing train:  14%|█▍        | 6000/42851 [00:04<00:24, 1499.55 examples/s]
Tokenizing train:  16%|█▋        | 7000/42851 [00:04<00:23, 1540.59 examples/s]
Tokenizing train:  19%|█▊        | 8000/42851 [00:05<00:22, 1574.46 examples/s]
Tokenizing train:  21%|██        | 9000/42851 [00:05<00:20, 1621.06 examples/s]
Tokenizing train:  23%|██▎       | 10000/42851 [00:06<00:19, 1651.90 examples/s]
Tokenizing train:  26%|██▌       | 11000/42851 [00:06<00:19, 1661.91 examples/s]


Similar to the fine-tuning processing in `03_finetune.ipynb`, this cell:
- Loads the **combined training corpus**:
  - `../data/processed/train_plus_bt.tsv` → includes both real parallel data and synthetic (back-translated) examples.  
  - `../data/processed/dev.tsv` → validation set for monitoring performance.
- Fine-tunes the multilingual **NLLB-200 distilled 600M** model using the same hyperparameters and language codes (`tgl_Latn → ceb_Latn`).
- The additional synthetic Tagalog–Cebuano examples help the model generalize better to unseen patterns and reduce overfitting to limited real data.

This run allows comparison between:
- The **standard fine-tuning** (real parallel data only), and  
- The **BT-augmented fine-tuning**, which includes synthetic examples to improve translation quality and domain robustness.

### Evaluate the fine-tuned + back-translation model on the test set

Runs the evaluation script on the model trained with **train_plus_bt.tsv** (real + synthetic pairs). It translates the test set and computes **BLEU** and **chrF2** using SacreBLEU.

In [2]:
!python ../src/eval/evaluate.py \
  --model_dir ../experiments/finetune_bt \
  --test_tsv ../data/processed/test.tsv \
  --out_json ../experiments/finetune_bt/metrics.json \
  --save_hyp ../experiments/finetune_bt/hyp.txt

{
  "BLEU": 27.16,
  "chrF2": 49.25,
  "ref_len": 109569,
  "sys_len": 87739,
  "sacrebleu_version": "2.5.1",
  "n_samples": 2750,
  "model_dir": "../experiments/finetune_bt",
  "codes": {
    "src": "tgl_Latn",
    "tgt": "ceb_Latn"
  },
  "decoding": {
    "beams": 5,
    "max_new_tokens": 200,
    "batch_size": 16
  }
}


`torch_dtype` is deprecated! Use `dtype` instead!


Translating:   1%|          | 1/172 [00:03<08:55,  3.13s/it]
Translating:   1%|          | 2/172 [00:06<08:33,  3.02s/it]
Translating:   2%|▏         | 3/172 [00:08<07:55,  2.81s/it]
Translating:   2%|▏         | 4/172 [00:11<08:08,  2.91s/it]
Translating:   3%|▎         | 5/172 [00:14<07:30,  2.70s/it]
Translating:   3%|▎         | 6/172 [00:17<08:16,  2.99s/it]
Translating:   4%|▍         | 7/172 [00:20<07:58,  2.90s/it]
Translating:   5%|▍         | 8/172 [00:23<08:14,  3.02s/it]
Translating:   5%|▌         | 9/172 [00:26<07:47,  2.87s/it]
Translating:   6%|▌         | 10/172 [00:29<07:56,  2.94s/it]
Translating:   6%|▋         | 11/172 [00:31<07:32,  2.81s/it]
Translating:   7%|▋         | 12/172 [00:35<08:32,  3.20s/it]
Translating:   8%|▊         | 13/172 [00:38<07:53,  2.98s/it]
Translating:   8%|▊         | 14/172 [00:40<07:34,  2.88s/it]
Translating:   9%|▊         | 15/172 [00:43<07:02,  2.69s/it]
Translating:   9%|▉        