# Dataset Assembly Overview

This notebook documents the assembled dataset for the **Deep Past Challenge** (Akkadian-to-English translation).

The data pipeline merged multiple sources (HuggingFace datasets, Kaggle competition data, lexicons, and OARE sentences), deduplicated them, and produced train/val/test splits stored as Parquet files under `data/processed/`.

In [None]:
import json
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd

# Paths
PROCESSED = Path("../data/processed")
STATS_PATH = PROCESSED / "stats.json"

# Add project root so we can import src modules
sys.path.insert(0, str(Path("..").resolve()))

In [None]:
# Load pipeline stats if available, otherwise compute from parquet
if STATS_PATH.exists():
    with open(STATS_PATH) as f:
        stats = json.load(f)
    print("Loaded stats.json")
    print(json.dumps(stats, indent=2))
else:
    print("stats.json not found; will compute from all_data.parquet")
    stats = None

In [None]:
# Load the full assembled dataset
df = pd.read_parquet(PROCESSED / "all_data.parquet")
print(f"Total rows: {len(df):,}")
print(f"Columns: {list(df.columns)}")
df.head(3)

---
## Source Inventory

Each row is tagged with a `source` field indicating where the pair originated.

In [None]:
source_counts = df["source"].value_counts()
print("Rows per source:\n")
print(source_counts.to_string())
print(f"\nTotal: {source_counts.sum():,}")

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
source_counts.sort_values().plot.barh(ax=ax, color="steelblue")
ax.set_xlabel("Number of rows")
ax.set_title("Rows per data source")
for i, v in enumerate(source_counts.sort_values()):
    ax.text(v + 200, i, f"{v:,}", va="center", fontsize=9)
plt.tight_layout()
plt.show()

---
## Normalization Examples

The pipeline applies `normalize_transliteration()` to every transliteration. This function:
1. Converts ASCII diacritics to Unicode (`sz` -> `s`, `s,` -> `s`, `t,` -> `t`)
2. Subscripts digits on syllables (`du3` -> `du3`)
3. Lowercases determinative braces (`{D}` -> `{d}`)
4. Applies NFC Unicode normalization
5. Normalizes whitespace

Below we show 5 sample transliterations from the dataset, plus before/after normalization on a few raw strings.

In [None]:
# Show 5 example transliterations from the assembled data
samples = df.sample(5, random_state=42)[["transliteration", "translation", "source"]]
for i, row in samples.iterrows():
    print(f"[{row['source']}]")
    print(f"  AKK: {row['transliteration'][:120]}")
    print(f"  ENG: {row['translation'][:120]}")
    print()

In [None]:
from src.data.normalize import normalize_transliteration

# Demonstrate normalization on raw strings
raw_examples = [
    "a-na {D}EN.LIL2 qi2-bi2-ma",
    "KISZIB szu-ta-mu-zi",
    "i-na UGU s,i-ba-at {KI}ba-bi-lim",
    "um-ma sza-lim-a-szur3-ma",
    "1 GU2 AN.NA t,a-ab",
]

print(f"{'Raw':<45} {'Normalized'}")
print("-" * 90)
for raw in raw_examples:
    normed = normalize_transliteration(raw)
    print(f"{raw:<45} {normed}")

---
## Dialect / Genre Distribution

Most rows come from general HuggingFace corpora without dialect metadata. The competition data and OARE sentences are tagged as **Old Assyrian**.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Dialect distribution
dialect_counts = df["dialect"].value_counts()
dialect_counts.plot.bar(ax=axes[0], color=["#7cafc2", "#d28445"])
axes[0].set_title("Dialect distribution")
axes[0].set_ylabel("Rows")
axes[0].tick_params(axis="x", rotation=0)
for j, v in enumerate(dialect_counts):
    axes[0].text(j, v + 500, f"{v:,}", ha="center", fontsize=9)

# Genre distribution
genre_counts = df["genre"].value_counts()
genre_counts.plot.bar(ax=axes[1], color=["#7cafc2", "#d28445"])
axes[1].set_title("Genre distribution")
axes[1].set_ylabel("Rows")
axes[1].tick_params(axis="x", rotation=0)
for j, v in enumerate(genre_counts):
    axes[1].text(j, v + 500, f"{v:,}", ha="center", fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Cross-tabulation: source vs dialect
print("Source x Dialect cross-tab:\n")
ct = pd.crosstab(df["source"], df["dialect"])
print(ct.to_string())

---
## Quality Tier Breakdown

- **gold** — parallel transliteration-translation pairs from curated sources
- **lexicon** — single-word or phrase-level entries from eBL dictionary / OA lexicon

In [None]:
quality_counts = df["quality"].value_counts()

fig, ax = plt.subplots(figsize=(5, 5))
colors = ["#7cafc2", "#d28445"]
wedges, texts, autotexts = ax.pie(
    quality_counts,
    labels=quality_counts.index,
    autopct="%1.1f%%",
    colors=colors,
    startangle=90,
)
ax.set_title("Quality tier breakdown")
for t in autotexts:
    t.set_fontsize(11)
plt.tight_layout()
plt.show()

print(quality_counts.to_string())

---
## Dataset Statistics

Train/val/test split sizes, average lengths, and vocabulary size.

In [None]:
# Load splits
train_df = pd.read_parquet(PROCESSED / "train.parquet")
val_df = pd.read_parquet(PROCESSED / "val.parquet")
test_df = pd.read_parquet(PROCESSED / "test.parquet")
val_comp = pd.read_parquet(PROCESSED / "val_competition.parquet")

print("Split sizes:")
print(f"  train:            {len(train_df):>8,}")
print(f"  val:              {len(val_df):>8,}")
print(f"  test:             {len(test_df):>8,}")
print(f"  val_competition:  {len(val_comp):>8,} (Old Assyrian kaggle-source only)")
print(f"  total:            {len(train_df) + len(val_df) + len(test_df):>8,}")

In [None]:
# Average lengths (whitespace-split tokens)
df["translit_len"] = df["transliteration"].str.split().str.len()
df["transl_len"] = df["translation"].fillna("").str.split().str.len()

print("Transliteration length (whitespace tokens):")
print(f"  mean:   {df['translit_len'].mean():.1f}")
print(f"  median: {df['translit_len'].median():.0f}")
print(f"  max:    {df['translit_len'].max()}")
print()
print("Translation length (whitespace tokens):")
print(f"  mean:   {df['transl_len'].mean():.1f}")
print(f"  median: {df['transl_len'].median():.0f}")
print(f"  max:    {df['transl_len'].max()}")

In [None]:
# Length distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df["translit_len"].clip(upper=100), bins=50, color="steelblue", edgecolor="white")
axes[0].set_title("Transliteration length distribution")
axes[0].set_xlabel("Tokens (clipped at 100)")
axes[0].set_ylabel("Count")

axes[1].hist(df["transl_len"].clip(upper=100), bins=50, color="#d28445", edgecolor="white")
axes[1].set_title("Translation length distribution")
axes[1].set_xlabel("Tokens (clipped at 100)")
axes[1].set_ylabel("Count")

plt.tight_layout()
plt.show()

In [None]:
# Vocabulary size (unique whitespace-split transliteration tokens)
all_tokens = df["transliteration"].str.split().explode()
vocab_size = all_tokens.nunique()
print(f"Unique transliteration tokens (whitespace-split): {vocab_size:,}")
print(f"Total transliteration tokens: {len(all_tokens):,}")

# Most common tokens
print("\nTop 20 most frequent transliteration tokens:")
print(all_tokens.value_counts().head(20).to_string())

---
## Comparison to Baseline

The original baseline model was trained on **1,561** parallel pairs from the Kaggle competition `train.csv`. After assembling data from HuggingFace (cipher-ling and phucthaiv02), the eBL dictionary, the OA lexicon, and OARE sentences, we now have **161,518** deduplicated rows -- a **~103x increase** in training data.

| Metric | Baseline | Assembled |
|--------|----------|-----------|
| Total rows | 1,561 | 161,518 |
| Sources | 1 (Kaggle) | 7 |
| Dialects tagged | 1 | 2 (old_assyrian, unknown) |
| Quality tiers | 1 | 2 (gold, lexicon) |
| Train split | ~1,404 (90%) | 145,366 |
| Val split | ~157 (10%) | 8,076 |
| Competition val | -- | 88 (OA-only) |

In [None]:
# Visual comparison
fig, ax = plt.subplots(figsize=(6, 4))
bars = ax.bar(
    ["Baseline\n(Kaggle only)", "Assembled\n(all sources)"],
    [1561, len(df)],
    color=["#d28445", "#7cafc2"],
    edgecolor="white",
    width=0.5,
)
ax.set_ylabel("Number of parallel pairs")
ax.set_title("Training data: baseline vs assembled")
ax.bar_label(bars, fmt="{:,.0f}", fontsize=11, padding=3)
ax.set_ylim(0, len(df) * 1.15)
plt.tight_layout()
plt.show()

ratio = len(df) / 1561
print(f"Assembled dataset is {ratio:.0f}x larger than the baseline.")

---

**Next steps:**
- Train ByT5-small on the full assembled dataset and compare against the 1,561-pair baseline
- Experiment with filtering to gold-quality-only vs including lexicon entries
- Evaluate on the `val_competition` split (88 Old Assyrian pairs) for competition-relevant performance
- Scale to ByT5-base/large once data pipeline is validated