# 04b — Fine-tuning with Back-Translated Data

Repeat the fine-tuning process from `03_finetune.ipynb`,  
but uses the merged dataset `train_plus_bt.tsv`, which combines:
- Original parallel data (`train.tsv`)
- Synthetic back-translated pairs generated in `04_pivot_bt_eval.ipynb`

Goal: Assess whether data augmentation via back-translation improves BLEU and chrF2 scores.

### Fine-tuning with Back-Translated Data

This command runs the same `finetune.py` training script, but this time uses the augmented dataset that includes synthetic pairs generated through back-translation (BT).

In [1]:
!python ../src/train/finetune.py \
  --train ../data/processed/train_plus_bt.tsv \
  --dev ../data/processed/dev.tsv \
  --out ../experiments/finetune_bt

Resolved codes → src: 'ceb_Latn' as 'ceb_Latn' (id=256035) | tgt: 'tgl_Latn' as 'tgl_Latn' (id=256174)
Train samples: 42,851 | Dev samples: 2,930
{'loss': 2.4145, 'grad_norm': 2.515625, 'learning_rate': 1.9908530894157178e-05, 'epoch': 0.01}
{'loss': 2.3917, 'grad_norm': 3.484375, 'learning_rate': 1.9815195071868583e-05, 'epoch': 0.02}
{'loss': 2.2988, 'grad_norm': 3.328125, 'learning_rate': 1.9721859249579992e-05, 'epoch': 0.03}
{'loss': 2.2238, 'grad_norm': 3.359375, 'learning_rate': 1.9628523427291398e-05, 'epoch': 0.04}
{'loss': 2.215, 'grad_norm': 2.59375, 'learning_rate': 1.95351876050028e-05, 'epoch': 0.05}
{'loss': 2.1846, 'grad_norm': 3.34375, 'learning_rate': 1.944185178271421e-05, 'epoch': 0.06}
{'loss': 2.1204, 'grad_norm': 3.296875, 'learning_rate': 1.9348515960425614e-05, 'epoch': 0.07}
{'loss': 2.1726, 'grad_norm': 4.0625, 'learning_rate': 1.925518013813702e-05, 'epoch': 0.07}
{'loss': 2.0889, 'grad_norm': 2.9375, 'learning_rate': 1.9161844315848422e-05, 'epoch': 0.08}
{

`torch_dtype` is deprecated! Use `dtype` instead!

Tokenizing train:   0%|          | 0/42851 [00:00<?, ? examples/s]
Tokenizing train:   2%|▏         | 1000/42851 [00:00<00:20, 2017.66 examples/s]
Tokenizing train:   5%|▍         | 2000/42851 [00:00<00:20, 2035.37 examples/s]
Tokenizing train:   7%|▋         | 3000/42851 [00:01<00:20, 1925.99 examples/s]
Tokenizing train:   9%|▉         | 4000/42851 [00:02<00:22, 1741.90 examples/s]
Tokenizing train:  12%|█▏        | 5000/42851 [00:02<00:22, 1706.13 examples/s]
Tokenizing train:  14%|█▍        | 6000/42851 [00:03<00:20, 1792.82 examples/s]
Tokenizing train:  16%|█▋        | 7000/42851 [00:03<00:21, 1686.67 examples/s]
Tokenizing train:  19%|█▊        | 8000/42851 [00:04<00:19, 1773.33 examples/s]
Tokenizing train:  21%|██        | 9000/42851 [00:05<00:18, 1799.32 examples/s]
Tokenizing train:  23%|██▎       | 10000/42851 [00:05<00:17, 1846.46 examples/s]
Tokenizing train:  26%|██▌       | 11000/42851 [00:06<00:17, 1849.39 examples/s]


Similar to the fine-tuning processing in `03_finetune.ipynb`, this cell:
- Loads the **combined training corpus**:
  - `../data/processed/train_plus_bt.tsv` → includes both real parallel data and synthetic (back-translated) examples.  
  - `../data/processed/dev.tsv` → validation set for monitoring performance.
- Fine-tunes the multilingual **NLLB-200 distilled 600M** model using the same hyperparameters and language codes (`ceb_Latn → tgl_Latn`).
- The additional synthetic Tagalog–Cebuano examples help the model generalize better to unseen patterns and reduce overfitting to limited real data.

This run allows comparison between:
- The **standard fine-tuning** (real parallel data only), and  
- The **BT-augmented fine-tuning**, which includes synthetic examples to improve translation quality and domain robustness.

### Evaluate the fine-tuned + back-translation model on the test set

Runs the evaluation script on the model trained with **train_plus_bt.tsv** (real + synthetic pairs). It translates the test set and computes **BLEU** and **chrF2** using SacreBLEU.

In [2]:
!python ../src/eval/evaluate.py \
  --model_dir ../experiments/finetune_bt \
  --test_tsv ../data/processed/test.tsv \
  --out_json ../experiments/finetune_bt/metrics.json \
  --save_hyp ../experiments/finetune_bt/hyp.txt

{
  "BLEU": 30.08,
  "chrF2": 56.64,
  "ref_len": 85119,
  "sys_len": 89786,
  "sacrebleu_version": "2.5.1",
  "n_samples": 2750,
  "model_dir": "../experiments/finetune_bt",
  "codes": {
    "src": "ceb_Latn",
    "tgt": "tgl_Latn"
  },
  "decoding": {
    "beams": 5,
    "max_new_tokens": 200,
    "batch_size": 16
  }
}


`torch_dtype` is deprecated! Use `dtype` instead!


Translating:   1%|          | 1/172 [00:03<09:24,  3.30s/it]
Translating:   1%|          | 2/172 [00:06<09:15,  3.27s/it]
Translating:   2%|▏         | 3/172 [00:15<16:12,  5.76s/it]
Translating:   2%|▏         | 4/172 [00:20<15:17,  5.46s/it]
Translating:   3%|▎         | 5/172 [00:22<12:07,  4.36s/it]
Translating:   3%|▎         | 6/172 [00:26<11:28,  4.15s/it]
Translating:   4%|▍         | 7/172 [00:28<09:58,  3.62s/it]
Translating:   5%|▍         | 8/172 [00:32<10:01,  3.67s/it]
Translating:   5%|▌         | 9/172 [00:40<13:12,  4.86s/it]
Translating:   6%|▌         | 10/172 [00:43<11:53,  4.41s/it]
Translating:   6%|▋         | 11/172 [00:45<10:04,  3.75s/it]
Translating:   7%|▋         | 12/172 [00:50<10:27,  3.92s/it]
Translating:   8%|▊         | 13/172 [00:52<09:17,  3.51s/it]
Translating:   8%|▊         | 14/172 [00:55<08:30,  3.23s/it]
Translating:   9%|▊         | 15/172 [00:58<08:04,  3.08s/it]
Translating:   9%|▉        