# Advanced Transliteration Pipeline Playground

Use this notebook to interactively test the 7-step hybrid transliteration pipeline.

## Setup
Run the cell below to load the necessary modules.

In [1]:
import sys
import os

# Add project root to path
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

from services.transliteration_core import (
    calculate_name_similarity,
    tokenize_arabic_name,
    tokenize_latin_name,
    arabic_to_latin,
    jaro_winkler_similarity,
)
from utils.text_normalization import (
    normalize_arabic,
    normalize_latin,
)

print("Modules loaded successfully!")

Modules loaded successfully!


## Step 1: Arabic Normalization
Test how Arabic text is normalized (diacritics removed, alef unified, etc.).

In [2]:
arabic_names = [
    "أحمد",
    "محمّد",
    "فاطمة",
    "عبدالله"
]

for name in arabic_names:
    normalized = normalize_arabic(name)
    print(f"Original: {name}\nNormalized: {normalized}\n---")

Original: أحمد
Normalized: احمد
---
Original: محمّد
Normalized: محمد
---
Original: فاطمة
Normalized: فاطمه
---
Original: عبدالله
Normalized: عبدالله
---


## Step 2: Tokenization
See how names are split into tokens (handling compounds like "Abdullah" or "Bin").

In [3]:
names_to_tokenize = [
    "عبدالله محمد",
    "أحمد بن سعيد",
    "Abdullah Mohammed",
    "Ahmed bin Saeed"
]

for name in names_to_tokenize:
    # Auto-detect language for tokenization demo
    if any("\u0600" <= c <= "\u06FF" for c in name):
        tokens = tokenize_arabic_name(name)
        lang = "Arabic"
    else:
        tokens = tokenize_latin_name(name)
        lang = "Latin"
        
    print(f"{lang}: '{name}' -> {tokens}")

Arabic: 'عبدالله محمد' -> ['عبد', 'الله', 'محمد']
Arabic: 'أحمد بن سعيد' -> ['احمد', 'بن', 'سعيد']
Latin: 'Abdullah Mohammed' -> ['abd', 'al', 'lah', 'mohammed']
Latin: 'Ahmed bin Saeed' -> ['ahmed', 'bin', 'said']


## Step 3: Cross-Script Bridge (Arabic -> Latin)
Test the transliteration logic that converts Arabic names to Latin characters.

In [4]:
arabic_names = [
    "أحمد",
    "محمد",
    "فاطمة",
    "سماح جابر"
]

for name in arabic_names:
    latin = arabic_to_latin(name)
    print(f"'{name}' -> '{latin}'")

'أحمد' -> 'Ahmd'
'محمد' -> 'Mhmd'
'فاطمة' -> 'Fatmh'
'سماح جابر' -> 'Smah Jabr'


## Step 4: Full Pipeline Test
Enter two names (OCR and User Input) to see the full 7-step comparison results.

In [14]:
# EDIT THESE NAMES TO TEST
ocr_name = "سوجاتا"
user_name = "Sujita"

print(f"Comparing:\n  1. {ocr_name}\n  2. {user_name}\n")

result = calculate_name_similarity(ocr_name, user_name)

print("PIPELINE RESULTS:")
print(f"1. Normalized (OCR):  '{result['normalized']['text1_arabic']}'")
print(f"2. Normalized (User): '{result['normalized']['text2_latin']}'")
print(f"3. Tokens (OCR):  {result['tokens']['text1']}")
print(f"4. Tokens (User): {result['tokens']['text2']}")
print(f"5. Latin Bridge:  '{result['latin_bridges']['text1_to_latin']}'")
print(f"6. Latin Phonetic:  {result['latin_phonetic_similarity']:.3f}")
print(f"7. Final Score:     {result['final_score']:.3f}")

Comparing:
  1. سوجاتا
  2. Sujita

PIPELINE RESULTS:
1. Normalized (OCR):  'سوجاتا'
2. Normalized (User): 'sujita'
3. Tokens (OCR):  ['سوجاتا']
4. Tokens (User): ['sujita']
5. Latin Bridge:  'Sojata'
6. Latin Phonetic:  1.000
7. Final Score:     1.000
