# 04 — Pivot (Source→Tagalog→Target) and Back-translation

Purpose: Investigate semi-supervised and data-augmentation approaches such as pivot translation and back-translation to enhance training data for low-resource languages, enabling more effective cross-lingual transfer and translation quality.

## Pivot Translation

- Translates source → intermediate (pivot) → target.
- For example: Cebuano → Tagalog → English, or here maybe Cebuano → Tagalog → another Tagalog variant.
- Evaluates if pivoting improves or hurts quality.

**Files created:**

- `test.pivot` (Cebuano→Tagalog intermediate).
- `hyp.pivot2tgt` (Tagalog→Target final output).
- `metrics.json` with new scores.

### Pivot setup: extract test source/ref

In [1]:
from pathlib import Path
import pandas as pd

pivot_model = "facebook/nllb-200-distilled-600M"
exp = Path("../experiments/pivot"); exp.mkdir(parents=True, exist_ok=True)

DF = pd.read_csv("../data/processed/test.tsv", sep="\t", header=None, names=["src","tgt"])
DF["src"].to_csv(exp / "test.src", index=False, header=False)
DF["tgt"].to_csv(exp / "test.ref", index=False, header=False)
len(DF)

2750

This cell:
- defines the pivot model (`facebook/nllb-200-distilled-600M`) and creates `experiments/pivot/`
- loads `data/processed/test.tsv` (columns: `src`, `tgt`)
- writes:
  - `experiments/pivot/test.src`  → the Cebuano test sources (one per line)
  - `experiments/pivot/test.ref`  → the Tagalog references (one per line)

These files are the inputs/ground truth for the pivot experiment.


### Source → Pivot translation


In [2]:
!python ../src/decode/translate_simple.py \
  --model $pivot_model \
  --src ../experiments/pivot/test.src \
  --out ../experiments/pivot/test.pivot \
  --src_code ceb_Latn \
  --tgt_code war_Latn

Loaded 3,093 lines from ..\experiments\pivot\test.src
Loading model: facebook/nllb-200-distilled-600M
Using source=ceb_Latn tag='ceb_Latn' → target=war_Latn (id=256194)
Device: cuda | batch=12 | beams=2
✅ Wrote 3093 translations → ..\experiments\pivot\test.pivot


This runs the batch translator on `test.src` and produces `test.pivot`.

- Input: `experiments/pivot/test.src` (Cebuano)
- Model: `facebook/nllb-200-distilled-600M`
- Language codes used internally: `ceb_Latn` → `war_Latn`
- Output: `experiments/pivot/test.pivot` (Waray)

### Pivot → Target translation


In [3]:
!python ../src/decode/translate_simple.py \
  --model $pivot_model \
  --src ../experiments/pivot/test.pivot \
  --out ../experiments/pivot/hyp.pivot2tgt \
  --src_code war_Latn \
  --tgt_code tgl_Latn

Loaded 3,093 lines from ..\experiments\pivot\test.pivot
Loading model: facebook/nllb-200-distilled-600M
Using source=war_Latn tag='war_Latn' → target=tgl_Latn (id=256174)
Device: cuda | batch=12 | beams=2
✅ Wrote 3093 translations → ..\experiments\pivot\hyp.pivot2tgt


This translates the pivot text into the final target and writes `hyp.pivot2tgt`.

- Input: `experiments/pivot/test.pivot`
- Model: `facebook/nllb-200-distilled-600M`
- Language codes (current config): `war_Latn` → `tgl_Latn`
- Output: `experiments/pivot/hyp.pivot2tgt`

### Score the pivot system (BLEU & chrF2)


In [4]:
!python ../src/eval/score.py \
  --ref ../experiments/pivot/test.ref \
  --hyp ../experiments/pivot/hyp.pivot2tgt \
  --out ../experiments/pivot/metrics.json

⚠️  Length mismatch: ref=2750 hyp=3093 — truncating to 2750
{
  "BLEU": 1.39,
  "chrF2": 20.3,
  "ref_len": 89797,
  "sys_len": 91058,
  "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.5.1",
  "sacrebleu_version": "2.5.1"
}


This evaluates the pivot pipeline output against the gold references.

- Reference: `experiments/pivot/test.ref` (gold Tagalog)
- Hypothesis: `experiments/pivot/hyp.pivot2tgt` (pivot pipeline output)
- Output metrics JSON: `experiments/pivot/metrics.json`

## Back-translation

- Uses *monolingual target text* to synthesize extra training pairs:
  - Tagalog mono text → translate backward to Cebuano.
  - Creates synthetic pairs `(Cebuano_bt, Tagalog_real)`.
- Merges them with your real training data (`train_plus_bt.tsv`).
- The next fine-tune round can use this augmented data for better fluency.

### Mine Tagalog Monolingual Sentences

In [5]:
!python ../src/data/mine_mono.py

✅ Wrote 20,000 lines → D:\OneDrive\Documents\My Learning Resource\University Courses\DLSU\2025-26\T1\CSC715M\assignments\mc02\data\mono\target\mono.txt
   Source lines: 24,256 | After dedupe: 24,176 | After limit: 20,000
   Length filter: 6–240 chars | Seed: 42


This step extracts monolingual Tagalog sentences from the existing parallel training and development sets.  
It will be used for back-translation (BT) to create synthetic parallel data later.

### Translate monolingual target → source

In [6]:
# Translate monolingual Tagalog -> Cebuano
!python ../src/decode/translate_simple.py \
  --model $pivot_model \
  --src ../data/mono/target/mono.txt \
  --out ../data/mono/target/mono.bt.src \
  --src_code tgl_Latn \
  --tgt_code ceb_Latn

Loaded 20,000 lines from ..\data\mono\target\mono.txt
Loading model: facebook/nllb-200-distilled-600M
Using source=tgl_Latn tag='tgl_Latn' → target=ceb_Latn (id=256035)
Device: cuda | batch=12 | beams=2
✅ Wrote 20000 translations → ..\data\mono\target\mono.bt.src


- Input (you provide): `data/mono/target/mono.txt` (Tagalog lines, one per line)  
- Translation output: `data/mono/target/mono.bt.src` (synthetic Cebuano)

### Pair synthetic source with original target

In [7]:
# Join into synthetic TSV
mono_tgt = [l.strip() for l in open("../data/mono/target/mono.txt", encoding="utf-8") if l.strip()]
mono_src = [l.strip() for l in open("../data/mono/target/mono.bt.src", encoding="utf-8") if l.strip()]
N = min(len(mono_tgt), len(mono_src))

import pandas as pd, numpy as np
synth = pd.DataFrame({"src": mono_src[:N], "tgt": mono_tgt[:N]})
synth.to_csv("../data/processed/synth_bt.tsv", sep="\t", index=False, header=False)
print("Wrote synthetic pairs:", len(synth))

Wrote synthetic pairs: 20000



- Creates `data/processed/synth_bt.tsv` with columns: `src` (synthetic), `tgt` (original mono Tagalog)

### Pair synthetic source with original target

In [8]:
# Merge with real train.tsv
train = pd.read_csv("../data/processed/train.tsv", sep="\t", header=None, names=["src","tgt"])
merged = pd.concat([train, synth]).sample(frac=1.0, random_state=42)
merged.to_csv("../data/processed/train_plus_bt.tsv", sep="\t", index=False, header=False)
print("Merged training size:", len(merged))

Merged training size: 42851



- Creates `data/processed/synth_bt.tsv` with columns: `src` (synthetic), `tgt` (original mono Tagalog)

### Next step
- Re-run fine-tuning (in your fine-tune notebook) **pointing to** `data/processed/train_plus_bt.tsv`  
  and compare metrics vs. the baseline fine-tune.