# 01 — Data Preparation (TMX → parallel text → train/dev/test)

**Purpose:**
 Prepare clean, aligned parallel data in the exact tab-separated format expected by NLLB and Hugging Face Datasets.

**Inputs:**

- `data/raw/source/source.txt` → Cebuano sentences.
- `data/raw/target/target.txt` → Tagalog translations.

**Process:**

- Filters short/long lines (for quality).
- Shuffles with the fixed seed for reproducibility.
- Splits into 80/10/10 train/dev/test.
- Writes to `data/processed/` as:
  - `train.tsv`
  - `dev.tsv`
  - `test.tsv`

**Outputs:**

- Aligned pairs of sentences for training and evaluation.
- Data sanity check printed via `df.sample()` (so you can visually confirm alignment).

In [9]:
import os, pathlib, sys
root = pathlib.Path(".")
print("Working directory:", root.resolve())
print("\nExpected files:")
print(" - src/prepare/convert_tmx.py")
print(" - src/prepare/make_splits.py")
print(" - data/raw/ceb-tl.tmx")

Working directory: D:\OneDrive\Documents\My Learning Resource\University Courses\DLSU\2025-26\T1\CSC715M\assignments\mc02\notebooks

Expected files:
 - src/prepare/convert_tmx.py
 - src/prepare/make_splits.py
 - data/raw/ceb-tl.tmx


## A) Convert TMX → `source.txt` & `target.txt`

This uses `lxml` with `recover=True` to tolerate broken XML.

In [10]:
# Adjust paths if your notebook is not at project root
!python ../src/prepare/convert_tmx.py --tmx ../data/raw/ceb-tl.tmx --src_lang ceb --tgt_lang tl --out_dir ../data/raw

Extracted 30,919 aligned pairs
Wrote:
  ..\data\raw\source\source.txt
  ..\data\raw\target\target.txt


## B) Clean & split into train/dev/test

In [11]:
!python ../src/prepare/make_splits.py --src ../data/raw/source/source.txt --tgt ../data/raw/target/target.txt --out ../data/processed

Wrote  24734 pairs → ..\data\processed\train.tsv
Wrote   3091 pairs → ..\data\processed\dev.tsv
Wrote   3093 pairs → ..\data\processed\test.tsv

Total usable pairs: 30,918


## C) Peek at the splits + quick QC


In [12]:
import pandas as pd, pathlib

pp = pathlib.Path("../data/processed")
for name in ["train.tsv", "dev.tsv", "test.tsv"]:
    df = pd.read_csv(pp / name, sep="\t", header=None, names=["src","tgt"])
    print(f"{name}: {len(df):,} pairs")
    display(df.sample(min(5, len(df))))

# Quick quality checks
train = pd.read_csv(pp/"train.tsv", sep="\t", header=None, names=["src","tgt"])
avg_src_len = train["src"].str.split().map(len).mean()
avg_tgt_len = train["tgt"].str.split().map(len).mean()
ratio = (train["src"].str.len() / train["tgt"].str.len()).clip(upper=10).mean()

print(f"\nAverage token length — SRC: {avg_src_len:.1f}, TGT: {avg_tgt_len:.1f}")
print(f"Average char length ratio (src/tgt, clipped): {ratio:.2f}")


train.tsv: 22,851 pairs


Unnamed: 0,src,tgt
3119,Busa karon dili unta ang akong ginoong hari ma...,Ngayon nga'y huwag isapuso ng aking panginoon ...
10801,Akong pagaut-uton ang tawo ug ang mananap; ako...,Aking lilipulin ang tao at ang hayop; aking li...
6033,"Pamatia kini, Oh Job: Humunong ka, ug tulotimb...","Dinggin mo ito, Oh Job: Tumigil ka, at bulayin..."
12785,Wala usab nila hinulsoli ang ilang mga pagpama...,At sila'y hindi nagsipagsisi sa kanilang mga p...
10746,Ug si Salomon nagbuhat niadtong dautan sa pagt...,At gumawa si Salomon ng masama sa paningin ng ...


dev.tsv: 2,930 pairs


Unnamed: 0,src,tgt
244,"Ang mga Amorehanon, ang mga Canaanhon, ang mga...","At ang mga Amorrheo, at ang mga Cananeo, at an..."
1349,"ug sanglit dili man siya maka-bayad, ang iyang...",Datapuwa't palibhasa'y wala siyang sukat ibaya...
1644,Ug ang Dios nag-uban sa bata; ug nagtubo siya;...,"At ang Dios ay sumabata, at siya'y lumaki; at ..."
580,"Karon ang nahabilin nga buhat ni Jeroboam, ug ...","Ang iba nga sa mga gawa ni Jeroboam, at ang la..."
2381,"Wala ba ako ibubo nimo ingon sa gatas, Ug gipa...","Hindi mo ba ako ibinuhos na parang gatas, at b..."


test.tsv: 2,750 pairs


Unnamed: 0,src,tgt
2455,Ang Dios magahimo sa ingon niini sa mga kaaway...,"Hatulan nawa ng Dios ang mga kaaway ni David, ..."
1439,Dili ba si Jehova nga imong Dios maoy nagauban...,Hindi ba ang Panginoon ninyong Dios ay sumasai...
632,Busa kinahanglan magbantay kamo sa inyong pagp...,Ingatan ninyo kung paano ang inyong pakikinig:...
511,Ug gikuha ta ang tanan niyang kalungsoran niad...,At ating sinakop ang lahat niyang mga bayan na...
2684,Ug si Saul miingon: Ang Dios nagahimo niana ug...,"At sinabi ni Saul, Gawing gayon ng Dios at lal..."



Average token length — SRC: 32.3, TGT: 26.6
Average char length ratio (src/tgt, clipped): 1.10
