# EN-AR Dataset Discovery

## Objective
- Build a clean EN-AR discovery dataset from trusted sources only.
- Sources used here: local 25k CSV + Hugging Face `ds2`, `ds3`, and `df4` parquet.
- Inspect Hugging Face schemas first, then select EN/AR columns before merge.

## Scope for This Notebook
- EDA only.
- No model training in this notebook.


In [50]:
# Setup: imports and constants
from __future__ import annotations

from pathlib import Path

import numpy as np
import pandas as pd
from datasets import load_dataset

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

candidate_roots = [Path.cwd(), Path.cwd().parent]
PROJECT_ROOT = next((r for r in candidate_roots if (r / "dataset").exists()), Path.cwd())

LOCAL_25K_CSV_PATH = PROJECT_ROOT / "dataset" / "The Arabic-English Sentence Bank 25k" / "arabic_english_sentences.csv"
LOCAL_DS5_CSV_PATH = PROJECT_ROOT / "dataset" / "translation-english-arabic.csv"
EDA_OUTPUT_DIR = PROJECT_ROOT / "artifacts" / "eda"
REQUIRED_COLUMNS = ["en", "ar"]
MAX_SEQ_LEN = 128

print(f"Project root: {PROJECT_ROOT}")
print(f"Local 25k CSV path: {LOCAL_25K_CSV_PATH}")
print(f"Local ds5 CSV path: {LOCAL_DS5_CSV_PATH}")
print(f"EDA output dir: {EDA_OUTPUT_DIR}")


Project root: c:\My Projects\en-ar-translation
Local 25k CSV path: c:\My Projects\en-ar-translation\dataset\The Arabic-English Sentence Bank 25k\arabic_english_sentences.csv
Local ds5 CSV path: c:\My Projects\en-ar-translation\dataset\translation-english-arabic.csv
EDA output dir: c:\My Projects\en-ar-translation\artifacts\eda


In [51]:
# Validate required local dataset files
assert LOCAL_25K_CSV_PATH.exists(), f"Local 25k CSV not found: {LOCAL_25K_CSV_PATH}"
assert LOCAL_DS5_CSV_PATH.exists(), f"Local ds5 CSV not found: {LOCAL_DS5_CSV_PATH}"
print(f"Local 25k dataset found: {LOCAL_25K_CSV_PATH}")
print(f"Local ds5 dataset found: {LOCAL_DS5_CSV_PATH}")


Local 25k dataset found: c:\My Projects\en-ar-translation\dataset\The Arabic-English Sentence Bank 25k\arabic_english_sentences.csv
Local ds5 dataset found: c:\My Projects\en-ar-translation\dataset\translation-english-arabic.csv


In [None]:
# Load ds2/ds3/ds4 from Hugging Face and ds5 from local CSV
ds2 = load_dataset("Arabic-Clip-Archive/ImageCaptions-7M-Translations-Arabic")
ds3 = load_dataset("salehalmansour/english-to-arabic-translate")
ds4 = load_dataset("ammagra/english-arabic-speech-translation")
ds5 = pd.read_csv(LOCAL_DS5_CSV_PATH, encoding="utf-8")

print("Loaded ds2 splits:", list(ds2.keys()))
print("Loaded ds3 splits:", list(ds3.keys()))
print("Loaded ds4 splits:", list(ds4.keys()))
print(f"Loaded ds5 rows: {len(ds5):,}")


Loaded ds2 splits: ['train']
Loaded ds3 splits: ['train']
Loaded ds4 splits: ['train', 'validation', 'test']
Loaded ds5 rows: 34,912


In [53]:
# Inspect ds2/ds3/ds4 (DatasetDict) and ds5 (DataFrame) schema to choose EN/AR columns
def inspect_dataset_dict(ds, name: str, preview_rows: int = 2) -> None:
    print(f"\n{name} splits: {list(ds.keys())}")
    for split_name, split_ds in ds.items():
        print(f"- {name}[{split_name}] rows: {len(split_ds):,}")
        print(f"  columns: {split_ds.column_names}")
        if len(split_ds) > 0:
            preview_df = split_ds.select(range(min(preview_rows, len(split_ds)))).to_pandas()
            display(preview_df.head(preview_rows))

inspect_dataset_dict(ds2, "ds2")
inspect_dataset_dict(ds3, "ds3")
inspect_dataset_dict(ds4, "ds4")

print("\nds5 schema:")
print("- rows:", f"{len(ds5):,}")
print("- columns:", list(ds5.columns))
display(ds5.head(2))



ds2 splits: ['train']
- ds2[train] rows: 150,000
  columns: ['caption', 'caption_sv', 'caption_multi', 'url', 'multi_language_code', 'multi_language_name', 'multiple_target_model', 'target_code', 'opus_mt_url', 'index']


Unnamed: 0,caption,caption_sv,caption_multi,url,multi_language_code,multi_language_name,multiple_target_model,target_code,opus_mt_url,index
0,sheep eating on grass outside of a barn in the...,får som äter på gräs utanför en lada mitt på e...,الخراف تأكل على العشب خارج الحظيرة في وسط حقل,http://l7.alamy.com/zooms/caadcbc0b3cd43baa411...,ar,arabic,1,ara,Helsinki-NLP/opus-mt-en-ar,450001
1,man in the cinema with popcorn and a cell phone,man på bio med popcorn och en mobiltelefon,رجل في السينما مع الفشار وهاتف خلوي,http://l7.alamy.com/zooms/05280780b00f426d85fa...,ar,arabic,1,ara,Helsinki-NLP/opus-mt-en-ar,450002



ds3 splits: ['train']
- ds3[train] rows: 1,325,899
  columns: ['en', 'ar']


Unnamed: 0,en,ar
0,and this,و هذه؟
1,it was um,...لقد كان



ds4 splits: ['train', 'validation', 'test']
- ds4[train] rows: 260,487
  columns: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id']


Unnamed: 0,client_id,file,audio,sentence,translation,id
0,69a495674a7d640f049bbe552424f75dc1263ecc706b49...,/content/cv4-en/clips/common_voice_en_19786080...,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x...,For around a decade Ivens lived in Eastern Eur...,عاش إيفينز في أوروبا الشرقية لما يُقارب من عقد...,common_voice_en_19786080
1,c5d41c1cf20243babd6d9650ee3c6a65b13fb743953450...,/content/cv4-en/clips/common_voice_en_18664696...,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x...,I almost forgive you the fright you gave me!,لقد سامحتك تقريبا على الرعب الذي سببته لي!,common_voice_en_18664696


- ds4[validation] rows: 26,049
  columns: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id']


Unnamed: 0,client_id,file,audio,sentence,translation,id
0,1e95bfcdd92ff136ab9d0501627c13866a00abbd3bda56...,/content/cv4-en/clips/common_voice_en_18664843...,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x...,"The key of the back door, sir?",مفتاح الباب الخلفي يا سيدي؟,common_voice_en_18664843
1,b28f486b414dbb5ffd2c3f8065c5ddbd9ac0a1e05c191d...,/content/cv4-en/clips/common_voice_en_19195751...,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x...,Completed and rigged bombs were found in Nebra...,تم العثور على قنابل جاهزة للاستعمال و مقلدة في...,common_voice_en_19195751


- ds4[test] rows: 15,531
  columns: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id']


Unnamed: 0,client_id,file,audio,sentence,translation,id
0,0013037a1d45cc33460806cc3f8ecee9d536c45639ba4c...,/content/cv4-en/clips/common_voice_en_699711.mp3,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00\x17TSS...,"""She'll be all right.""",ستكون بخير.,common_voice_en_699711
1,001509f4624a7dee75247f6a8b642c4a0d09f8be3eeea6...,/content/cv4-en/clips/common_voice_en_18132047...,{'bytes': b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x...,"""All's well that ends well.""",الأمور بخواتمها.,common_voice_en_18132047



ds5 schema:
- rows: 34,912
- columns: ['english', 'arabic']


Unnamed: 0,english,arabic
0,Hi.,مرحبًا.
1,Run!,اركض!


In [54]:
# Choose columns and build standardized EN/AR dataframes (ds2/ds3/ds4/ds5)
# Set these manually after schema inspection if auto-detection is wrong.
DS2_EN_COL = "caption"
DS2_AR_COL = "caption_multi"
DS3_EN_COL = "en"
DS3_AR_COL = "ar"
DS4_EN_COL = "sentence"
DS4_AR_COL = "translation"
DS5_EN_COL = "english"
DS5_AR_COL = "arabic"

# Row caps
DS2_MAX_ROWS = None
DS3_MAX_ROWS = 412_000
DS4_MAX_ROWS = None
DS5_MAX_ROWS = None

EN_CANDIDATES = ["en", "english", "eng", "source", "src", "text_en", "sentence_en", "input", "caption"]
AR_CANDIDATES = ["ar", "arabic", "target", "tgt", "text_ar", "sentence_ar", "output", "caption_multi", "translation"]

def pick_column(column_names, preferred, candidates, side_name):
    cols = list(column_names)
    if preferred is not None:
        assert preferred in cols, f"{side_name} column `{preferred}` not found. Available: {cols}"
        return preferred
    for cand in candidates:
        if cand in cols:
            return cand
    raise ValueError(f"Could not infer {side_name} column. Available: {cols}. Set it manually in config.")

def dataset_dict_to_en_ar(ds, preferred_en, preferred_ar, max_rows, name):
    frames = []
    used = 0
    for split_name, split_ds in ds.items():
        chosen_en = pick_column(split_ds.column_names, preferred_en, EN_CANDIDATES, f"{name} EN")
        chosen_ar = pick_column(split_ds.column_names, preferred_ar, AR_CANDIDATES, f"{name} AR")
        split_df = split_ds.to_pandas()[[chosen_en, chosen_ar]].rename(columns={chosen_en: "en", chosen_ar: "ar"})
        if max_rows is not None:
            remaining = max_rows - used
            if remaining <= 0:
                break
            split_df = split_df.head(remaining).copy()
        used += len(split_df)
        frames.append(split_df)
        print(f"{name}[{split_name}] -> rows={len(split_df):,}, en_col={chosen_en}, ar_col={chosen_ar}")
    if not frames:
        raise ValueError(f"No rows collected from {name}.")
    return pd.concat(frames, ignore_index=True)

def dataframe_to_en_ar(df_in, preferred_en, preferred_ar, max_rows, name):
    chosen_en = pick_column(df_in.columns, preferred_en, EN_CANDIDATES, f"{name} EN")
    chosen_ar = pick_column(df_in.columns, preferred_ar, AR_CANDIDATES, f"{name} AR")
    df_out = df_in[[chosen_en, chosen_ar]].rename(columns={chosen_en: "en", chosen_ar: "ar"}).copy()
    if max_rows is not None:
        df_out = df_out.head(max_rows).copy()
    print(f"{name} -> rows={len(df_out):,}, en_col={chosen_en}, ar_col={chosen_ar}")
    return df_out

df_local_25k = pd.read_csv(LOCAL_25K_CSV_PATH, encoding="utf-8")
df_local_25k = df_local_25k.rename(columns={"English": "en", "Arabic": "ar"})[["en", "ar"]]
print(f"local_25k rows: {len(df_local_25k):,}")

df_ds2 = dataset_dict_to_en_ar(ds2, DS2_EN_COL, DS2_AR_COL, DS2_MAX_ROWS, "ds2")
print(f"ds2 total rows used: {len(df_ds2):,}")

df_ds3 = dataset_dict_to_en_ar(ds3, DS3_EN_COL, DS3_AR_COL, DS3_MAX_ROWS, "ds3")
print(f"ds3 total rows used: {len(df_ds3):,}")

df_ds4 = dataset_dict_to_en_ar(ds4, DS4_EN_COL, DS4_AR_COL, DS4_MAX_ROWS, "ds4")
print(f"ds4 total rows used: {len(df_ds4):,}")

df_ds5 = dataframe_to_en_ar(ds5, DS5_EN_COL, DS5_AR_COL, DS5_MAX_ROWS, "ds5")
print(f"ds5 total rows used: {len(df_ds5):,}")


local_25k rows: 25,000
ds2[train] -> rows=150,000, en_col=caption, ar_col=caption_multi
ds2 total rows used: 150,000
ds3[train] -> rows=412,000, en_col=en, ar_col=ar
ds3 total rows used: 412,000
ds4[train] -> rows=260,487, en_col=sentence, ar_col=translation
ds4[validation] -> rows=26,049, en_col=sentence, ar_col=translation
ds4[test] -> rows=15,531, en_col=sentence, ar_col=translation
ds4 total rows used: 302,067
ds5 -> rows=34,912, en_col=english, ar_col=arabic
ds5 total rows used: 34,912


In [55]:
# Merge all selected datasets and remove exact duplicate EN-AR pairs
source_frames = [
    ("local_25k", df_local_25k),
    ("ds2", df_ds2),
    ("ds3", df_ds3),
    ("ds4", df_ds4),
    ("ds5", df_ds5),
]

for name, frame in source_frames:
    missing = [c for c in REQUIRED_COLUMNS if c not in frame.columns]
    assert not missing, f"{name} is missing columns: {missing}"

df_merged = pd.concat([f for _, f in source_frames], ignore_index=True)
rows_before_dedup = len(df_merged)
unique_pairs_before = df_merged.drop_duplicates(subset=REQUIRED_COLUMNS).shape[0]
duplicate_pairs_before = rows_before_dedup - unique_pairs_before
duplicate_ratio_before = duplicate_pairs_before / rows_before_dedup if rows_before_dedup else 0.0

df = df_merged.drop_duplicates(subset=REQUIRED_COLUMNS, keep="first").reset_index(drop=True)
rows_after_dedup = len(df)

print("Rows by source:")
for name, frame in source_frames:
    print(f"- {name}: {len(frame):,}")
print(f"Merged rows (before dedup): {rows_before_dedup:,}")
print(f"Duplicate pairs before dedup: {duplicate_pairs_before:,} ({duplicate_ratio_before:.4%})")
print(f"Rows after dedup: {rows_after_dedup:,}")
print(f"Columns: {list(df.columns)}")


Rows by source:
- local_25k: 25,000
- ds2: 150,000
- ds3: 412,000
- ds4: 302,067
- ds5: 34,912
Merged rows (before dedup): 923,979
Duplicate pairs before dedup: 95,353 (10.3198%)
Rows after dedup: 828,626
Columns: ['en', 'ar']


In [56]:
# Basic shape, row count, and memory footprint
row_count, col_count = df.shape
memory_mb = df.memory_usage(deep=True).sum() / (1024 ** 2)

print(f"Shape: {df.shape}")
print(f"Row count: {row_count:,}")
print(f"Column count: {col_count}")
print(f"Approx memory usage: {memory_mb:,.2f} MB")


Shape: (828626, 2)
Row count: 828,626
Column count: 2
Approx memory usage: 231.63 MB


In [57]:
# Detect swapped-language rows (AR in en, EN in ar), fix them, then re-deduplicate
en_text = df["en"].fillna("").astype(str)
ar_text = df["ar"].fillna("").astype(str)

en_has_arabic = en_text.str.contains(r"[\u0600-\u06FF]", regex=True)
en_has_latin = en_text.str.contains(r"[A-Za-z]", regex=True)
ar_has_arabic = ar_text.str.contains(r"[\u0600-\u06FF]", regex=True)
ar_has_latin = ar_text.str.contains(r"[A-Za-z]", regex=True)

# Conservative swap rule to reduce false positives
swap_mask = en_has_arabic & ar_has_latin & (~en_has_latin | ~ar_has_arabic)
swap_count = int(swap_mask.sum())

preview_before_swap = df.loc[swap_mask, ["en", "ar"]].head(10).copy()

if swap_count > 0:
    df.loc[swap_mask, ["en", "ar"]] = df.loc[swap_mask, ["ar", "en"]].values

rows_before_rededup = len(df)
df = df.drop_duplicates(subset=REQUIRED_COLUMNS, keep="first").reset_index(drop=True)
rows_after_rededup = len(df)
rededup_removed = rows_before_rededup - rows_after_rededup

print(f"Detected likely swapped rows: {swap_count:,}")
print(f"Rows removed after re-dedup: {rededup_removed:,}")
print("Preview of detected rows before swap (first 10):")
preview_before_swap


Detected likely swapped rows: 1
Rows removed after re-dedup: 0
Preview of detected rows before swap (first 10):


Unnamed: 0,en,ar
780351,راجع به چی حرف میزنی؟,[TO REMOVE]


In [None]:
# Clean boundary newline markers (literal "\n" and real line breaks and "-")
newline_clean_stats = {}
for col in REQUIRED_COLUMNS:
    before_col = df[col].fillna("").astype(str)
    after_col = (
        before_col
        .str.replace(r"^(?:\s*\\n\s*|\s*-\s*)+", "", regex=True)
        .str.replace(r"(?:\s*\\n\s*|\s*-\s*)+$", "", regex=True)
        .str.replace(r"^[\r\n\-]+", "", regex=True)
        .str.replace(r"[\r\n\-]+$", "", regex=True)
        .str.strip()
    )
    changed_rows = int((before_col != after_col).sum())
    newline_clean_stats[col] = changed_rows
    df[col] = after_col

print("Boundary newline cleanup changes:")
for col, count in newline_clean_stats.items():
    print(f"- {col}: {count:,} rows updated")


Boundary newline cleanup changes:
- en: 54,067 rows updated
- ar: 194 rows updated


In [59]:
# Remove Arabic diacritics from the Arabic column
arabic_diacritics_pattern = r"[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06ED]"
before_ar = df["ar"].fillna("").astype(str)
diacritics_removed_chars = int(before_ar.str.count(arabic_diacritics_pattern).sum())
after_ar = before_ar.str.replace(arabic_diacritics_pattern, "", regex=True)
changed_rows = int((before_ar != after_ar).sum())
df["ar"] = after_ar.str.replace(r"\s+", " ", regex=True).str.strip()

rows_before_rededup = len(df)
df = df.drop_duplicates(subset=REQUIRED_COLUMNS, keep="first").reset_index(drop=True)
rows_after_rededup = len(df)

print(f"Rows with Arabic diacritics removed: {changed_rows:,}")
print(f"Total diacritic characters removed: {diacritics_removed_chars:,}")
print(f"Rows removed after re-dedup: {rows_before_rededup - rows_after_rededup:,}")


Rows with Arabic diacritics removed: 85,165
Total diacritic characters removed: 136,473
Rows removed after re-dedup: 1,050


In [60]:
# Null and empty-row checks
null_counts = df[REQUIRED_COLUMNS].isna().sum()
empty_counts = {
    col: df[col].fillna("").astype(str).str.strip().eq("").sum()
    for col in REQUIRED_COLUMNS
}

quality_df = pd.DataFrame({
    "null_count": null_counts,
    "empty_or_whitespace_count": pd.Series(empty_counts),
})
quality_df


Unnamed: 0,null_count,empty_or_whitespace_count
en,0,28
ar,0,2


In [61]:
# Exact EN-AR pair duplicate ratio after dedup
pair_count = len(df)
unique_pair_count = df.drop_duplicates(subset=REQUIRED_COLUMNS).shape[0]
duplicate_pair_count = pair_count - unique_pair_count
duplicate_ratio = duplicate_pair_count / pair_count if pair_count else 0.0

print(f"Total pairs: {pair_count:,}")
print(f"Unique pairs: {unique_pair_count:,}")
print(f"Duplicate pairs: {duplicate_pair_count:,}")
print(f"Duplicate ratio: {duplicate_ratio:.4%}")


Total pairs: 827,576
Unique pairs: 827,576
Duplicate pairs: 0
Duplicate ratio: 0.0000%


In [62]:
# Unique EN and AR counts
unique_en_count = df["en"].nunique(dropna=True)
unique_ar_count = df["ar"].nunique(dropna=True)
total_rows = len(df)

print(f"Unique EN sentences: {unique_en_count:,}")
print(f"Unique AR sentences: {unique_ar_count:,}")
print(f"EN uniqueness ratio: {unique_en_count / total_rows:.4%}")
print(f"AR uniqueness ratio: {unique_ar_count / total_rows:.4%}")


Unique EN sentences: 770,713
Unique AR sentences: 808,937
EN uniqueness ratio: 93.1290%
AR uniqueness ratio: 97.7478%


In [63]:
# EN/AR word-length statistics
en_word_lengths = df["en"].fillna("").astype(str).str.split().str.len()
ar_word_lengths = df["ar"].fillna("").astype(str).str.split().str.len()

word_length_stats = pd.DataFrame({
    "en": [
        en_word_lengths.mean(),
        en_word_lengths.quantile(0.50),
        en_word_lengths.quantile(0.90),
        en_word_lengths.quantile(0.95),
        en_word_lengths.quantile(0.99),
    ],
    "ar": [
        ar_word_lengths.mean(),
        ar_word_lengths.quantile(0.50),
        ar_word_lengths.quantile(0.90),
        ar_word_lengths.quantile(0.95),
        ar_word_lengths.quantile(0.99),
    ],
}, index=["mean", "p50", "p90", "p95", "p99"]).round(2)

word_length_stats


Unnamed: 0,en,ar
mean,7.91,6.44
p50,7.0,5.0
p90,13.0,11.0
p95,14.0,13.0
p99,21.0,18.0


In [64]:
# EN/AR character-length statistics + outlier preview
en_char_lengths = df["en"].fillna("").astype(str).str.len()
ar_char_lengths = df["ar"].fillna("").astype(str).str.len()

char_length_stats = pd.DataFrame({
    "en": [
        en_char_lengths.mean(),
        en_char_lengths.quantile(0.50),
        en_char_lengths.quantile(0.90),
        en_char_lengths.quantile(0.95),
        en_char_lengths.quantile(0.99),
        en_char_lengths.max(),
    ],
    "ar": [
        ar_char_lengths.mean(),
        ar_char_lengths.quantile(0.50),
        ar_char_lengths.quantile(0.90),
        ar_char_lengths.quantile(0.95),
        ar_char_lengths.quantile(0.99),
        ar_char_lengths.max(),
    ],
}, index=["mean", "p50", "p90", "p95", "p99", "max"]).round(2)

max_en_idx = en_char_lengths.idxmax()
max_ar_idx = ar_char_lengths.idxmax()
outlier_preview = pd.DataFrame([
    {"side": "en", "index": int(max_en_idx), "char_len": int(en_char_lengths.loc[max_en_idx]), "text": str(df.loc[max_en_idx, "en"])[:200]},
    {"side": "ar", "index": int(max_ar_idx), "char_len": int(ar_char_lengths.loc[max_ar_idx]), "text": str(df.loc[max_ar_idx, "ar"])[:200]},
])

char_length_stats, outlier_preview


(           en       ar
 mean    42.97    34.93
 p50     36.00    27.00
 p90     78.00    64.00
 p95     88.00    75.00
 p99    118.00   104.00
 max   1272.00  1281.00,
   side   index  char_len                                               text
 0   en  807787      1272  maids in ethiopia are mostly women who come to...
 1   ar  807787      1281  الخادمات في اثيوبيا معظمهن من النساء اللاتي يا...)

In [65]:
# Script anomaly checks
en_has_arabic = df["en"].fillna("").astype(str).str.contains(r"[\u0600-\u06FF]", regex=True)
ar_has_latin = df["ar"].fillna("").astype(str).str.contains(r"[A-Za-z]", regex=True)

anomaly_summary = pd.DataFrame({
    "count": [int(en_has_arabic.sum()), int(ar_has_latin.sum())],
    "ratio": [float(en_has_arabic.mean()), float(ar_has_latin.mean())],
}, index=["en_contains_arabic", "ar_contains_latin"])

anomaly_summary["ratio"] = (anomaly_summary["ratio"] * 100).round(4).astype(str) + "%"
anomaly_summary


Unnamed: 0,count,ratio
en_contains_arabic,0,0.0%
ar_contains_latin,28026,3.3865%


In [83]:
# Truncation impact estimate for MAX_SEQ_LEN (word-level proxy)
en_token_proxy = df["en"].fillna("").astype(str).str.split().str.len()
ar_token_proxy = df["ar"].fillna("").astype(str).str.split().str.len()

en_trunc_count = int((en_token_proxy > MAX_SEQ_LEN).sum())
ar_trunc_count = int((ar_token_proxy > MAX_SEQ_LEN).sum())
either_trunc_count = int(((en_token_proxy > MAX_SEQ_LEN) | (ar_token_proxy > MAX_SEQ_LEN)).sum())
total_rows = len(df)

truncation_estimate = pd.DataFrame({
    "count": [en_trunc_count, ar_trunc_count, either_trunc_count],
    "ratio": [
        en_trunc_count / total_rows if total_rows else 0.0,
        ar_trunc_count / total_rows if total_rows else 0.0,
        either_trunc_count / total_rows if total_rows else 0.0,
    ],
}, index=["en_gt_max_len", "ar_gt_max_len", "either_side_gt_max_len"])
truncation_estimate["ratio"] = (truncation_estimate["ratio"] * 100).round(4).astype(str) + "%"

print(f"MAX_SEQ_LEN reference: {MAX_SEQ_LEN}")
truncation_estimate


MAX_SEQ_LEN reference: 128


Unnamed: 0,count,ratio
en_gt_max_len,18,0.0022%
ar_gt_max_len,11,0.0013%
either_side_gt_max_len,20,0.0024%


In [81]:
# Top EN/AR frequencies + random sample preview
top_en = df["en"].fillna("").astype(str).value_counts().head(10).rename_axis("en").reset_index(name="count")
top_ar = df["ar"].fillna("").astype(str).value_counts().head(10).rename_axis("ar").reset_index(name="count")
sample_preview = df.sample(n=10)[["en", "ar"]].reset_index(drop=True)

print("Top 10 EN entries:")
display(top_en)
print("Top 10 AR entries:")
display(top_ar)
print("Random sample preview (10 rows):")
sample_preview


Top 10 EN entries:


Unnamed: 0,en,count
0,im sorry,183
1,thank you,180
2,come on,163
3,oh my god,139
4,i dont know,137
5,what are you talking about,133
6,all right,127
7,thats it,120
8,no no,119
9,excuse me,115


Top 10 AR entries:


Unnamed: 0,ar,count
0,يا إلهي,116
1,كيف حالك؟,85
2,هذا رائع,68
3,نعم,65
4,أليس كذلك؟,64
5,ما الأمر؟,57
6,هل أنت بخير ؟,55
7,شكرا لك,54
8,حقا ؟,52
9,يا إلهي!,47


Random sample preview (10 rows):


Unnamed: 0,en,ar
0,Intra-limb coordination involves the planning ...,ينطوي التآزر بين الأطراف على تخطيط المسارات في...
1,to the lambrick foundation,في نخب مؤسسة (لامبريك).
2,whispering he can hear us,إن بإمكانه سماعنا .
3,A woman is doing inclined dumbbell presses at ...,امرأة تقوم بتمارين الدمبل وهي مائلة في صالة أل...
4,step right up we had a winner,يوجد تقدم لدينا فائز
5,well have cos on post,سنحضر ضباط في المكان
6,theres a blue door,هناك باب أزرق
7,a document of the department's operations in t...,وثيقة عن عمليات الإدارة في مكان العمل
8,guano bowls,ذرق طائر
9,"two children are standing near a wall, one is ...",طفلان يقفان بالقرب من الجدار أحدهما يمسك بمؤخر...


In [84]:
# Export compact EDA artifacts (stats/tables/samples)
EDA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

summary = {
    "rows": int(len(df)),
    "columns": list(df.columns),
    "max_seq_len_reference": int(MAX_SEQ_LEN),
}

# Include commonly computed diagnostics when available in notebook state
if "duplicate_pairs_before" in globals():
    summary["duplicate_pairs_before_dedup"] = int(duplicate_pairs_before)
if "duplicate_ratio_before" in globals():
    summary["duplicate_ratio_before_dedup"] = float(duplicate_ratio_before)
if "swap_count" in globals():
    summary["swap_count"] = int(swap_count)
if "rededup_removed" in globals():
    summary["rededup_removed_after_swap"] = int(rededup_removed)
if "diacritics_removed_chars" in globals():
    summary["diacritics_removed_chars"] = int(diacritics_removed_chars)

(pd.Series(summary)).to_json(EDA_OUTPUT_DIR / "eda_summary.json", force_ascii=False, indent=2)

if "word_length_stats" in globals():
    word_length_stats.to_csv(EDA_OUTPUT_DIR / "word_length_stats.csv", encoding="utf-8")
if "char_length_stats" in globals():
    char_length_stats.to_csv(EDA_OUTPUT_DIR / "char_length_stats.csv", encoding="utf-8")
if "anomaly_summary" in globals():
    anomaly_summary.to_csv(EDA_OUTPUT_DIR / "anomaly_summary.csv", encoding="utf-8")
if "truncation_estimate" in globals():
    truncation_estimate.to_csv(EDA_OUTPUT_DIR / "truncation_estimate.csv", encoding="utf-8")
if "top_en" in globals():
    top_en.to_csv(EDA_OUTPUT_DIR / "top_en.csv", index=False, encoding="utf-8")
if "top_ar" in globals():
    top_ar.to_csv(EDA_OUTPUT_DIR / "top_ar.csv", index=False, encoding="utf-8")
if "sample_preview" in globals():
    sample_preview.to_csv(EDA_OUTPUT_DIR / "sample_preview.csv", index=False, encoding="utf-8")

print(f"Saved compact EDA artifacts to: {EDA_OUTPUT_DIR}")


Saved compact EDA artifacts to: c:\My Projects\en-ar-translation\artifacts\eda


In [85]:
# Export full cleaned combined dataset for training reuse
EDA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

full_dataset_parquet_path = EDA_OUTPUT_DIR / "final_cleaned_combined_dataset.parquet"
full_dataset_csv_path = EDA_OUTPUT_DIR / "final_cleaned_combined_dataset.csv"
metadata_path = EDA_OUTPUT_DIR / "final_cleaned_combined_dataset_metadata.json"

# Parquet is preferred for speed/size; CSV is added for portability.
df.to_parquet(full_dataset_parquet_path, index=False)
df.to_csv(full_dataset_csv_path, index=False, encoding="utf-8")

metadata = {
    "rows": int(len(df)),
    "columns": list(df.columns),
    "parquet_path": str(full_dataset_parquet_path),
    "csv_path": str(full_dataset_csv_path),
    "max_seq_len_reference": int(MAX_SEQ_LEN),
}
pd.Series(metadata).to_json(metadata_path, force_ascii=False, indent=2)

print(f"Saved parquet: {full_dataset_parquet_path}")
print(f"Saved csv: {full_dataset_csv_path}")
print(f"Saved metadata: {metadata_path}")
print(f"Final cleaned rows: {len(df):,}")


Saved parquet: c:\My Projects\en-ar-translation\artifacts\eda\final_cleaned_combined_dataset.parquet
Saved csv: c:\My Projects\en-ar-translation\artifacts\eda\final_cleaned_combined_dataset.csv
Saved metadata: c:\My Projects\en-ar-translation\artifacts\eda\final_cleaned_combined_dataset_metadata.json
Final cleaned rows: 827,576


## Decision Summary

- Final discovery dataset is built from: local 25k CSV + `ds2` + `ds3` (capped) + `ds4` + local `ds5` CSV.
- Exact duplicate EN-AR pairs are removed after merge.
- Swapped-language rows are detected and corrected (`en`/`ar` swap).
- Boundary newline markers are cleaned from both columns.
- Arabic diacritics are removed from the Arabic side.
- Final cleaned combined dataset is exported for training reuse in `artifacts/eda/final_cleaned_combined_dataset.parquet` (and CSV copy).
- EDA artifacts and summaries are exported under `artifacts/eda/`.


## Dataset Credits

The combined EN-AR dataset in this project is sourced from the following datasets (in variable order):

1. `local_25k` (Kaggle): https://www.kaggle.com/datasets/tahaalselwii/the-arabic-english-sentence-bank-25k?resource=download
2. `ds2` (Hugging Face): https://huggingface.co/datasets/Arabic-Clip-Archive/ImageCaptions-7M-Translations-Arabic/viewer/default/train?p=1132
3. `ds3` (Hugging Face): https://huggingface.co/datasets/salehalmansour/english-to-arabic-translate
4. `ds4` (Hugging Face): https://huggingface.co/datasets/ammagra/english-arabic-speech-translation
5. `ds5` (Kaggle): https://www.kaggle.com/datasets/yumnagamal/translation-english-arabic?select=merge_df.csv
