# LLM-based Thai-to-Latin Transliteration

This notebook uses an OpenAI LLM to transliterate a list of common Thai words into various Latin script representations. The goal is to generate multiple possible transliterations that a human might use.

The process is as follows:
1. Load the most common Thai words from `artifacts/combined_word_freq.json`.
2. Set parameters for `TOP_K` words to transliterate and the `BATCH_SIZE` for API calls.
3. Prepare a system prompt to instruct the LLM on the transliteration task.
4. Batch the words and send requests to the OpenAI API.
5. Parse the JSON response from the LLM.
6. Save the aggregated transliterations to `artifacts/llm_transliteration`.

In [1]:
import json
import os
from pathlib import Path
import openai
from tqdm import tqdm
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# --- Parameters ---
# Note: If the API key is not found in the .env file,
# it will try to use the OPENAI_API_KEY environment variable.
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found in .env file or environment variables")

openai.api_key = OPENAI_API_KEY

# Select model
MODEL = "gpt-4.1-mini"

# Transliteration Parameters
TOP_K = 10000  # Number of most common words to transliterate
BATCH_SIZE = 50  # Number of words to send in each API call

# Define paths
WORD_FREQ_PATH = Path("../artifacts/combined_word_freq.json")
OUTPUT_DIR = Path("../artifacts/llm_transliteration")
OUTPUT_PATH = OUTPUT_DIR / f"{MODEL}.json"

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

In [2]:
# --- Load the combined word frequencies data ---
with open(WORD_FREQ_PATH, "r", encoding="utf-8") as f:
    word_freq = json.load(f)

# Get the list of words to transliterate
words_to_transliterate = list(word_freq.keys())[:TOP_K]

print(f"Selected top {TOP_K} words for transliteration.")
print(f"Batch size set to {BATCH_SIZE} words per API call.")
print(f"First 10 words: {words_to_transliterate[:10]}")

# Create or Load partially completed file
if os.path.exists(OUTPUT_PATH):
    print(f"Partial results file: {OUTPUT_PATH} already exists, attempting resume...")
    with open(OUTPUT_PATH, 'r') as fp:
        results_dict = json.load(fp)
        print(f"Read {len(results_dict)} entries")
else:
    print(f"No partial results found, starting over at file {OUTPUT_PATH}")
    results_dict = dict()

# Add any new words to result dictionary
pending_words = list()
for word in words_to_transliterate:
    if (word not in results_dict):
        results_dict[word] = list()
        pending_words.append(word)
    elif (word in results_dict) and (len(results_dict[word]) == 0):
        pending_words.append(word)

print(f"Final output size: {len(results_dict)} words")
print(f"Number of words to work on: {len(pending_words)} words")

Selected top 10000 words for transliteration.
Batch size set to 50 words per API call.
First 10 words: ['ที่', 'ไม่', 'ใน', 'มี', 'และ', 'ของ', 'ได้', 'เป็น', 'มา', 'ไป']
Partial results file: ../artifacts/llm_transliteration/gpt-4.1-mini.json already exists, attempting resume...
Read 500 entries
Final output size: 10000 words
Number of words to work on: 9500 words


In [3]:
# --- Function & System Prompt ---

SYSTEM_PROMPT = """You are an expert in Thai to English transliteration. Your task is to provide multiple, common-sense transliterations for a given list of Thai words. The transliterations should reflect how a human user might type the word using a standard Latin keyboard.

Please follow these guidelines:
1. For each Thai word, provide a list of possible transliterations.
2. The transliterations should be lowercase.
3. Prioritize common and intuitive spellings over strict phonetic or academic systems.
4. Provide as many reasonable variations as you can think of. For example, for "สวัสดี", you might provide ["sawatdee", "sawasdee", "sawasdi", "sawatdi"].
5. The output MUST be a valid JSON object where keys are the original Thai words and values are lists of their transliterations.

Example Input:
["สวัสดี", "ขอบคุณ", "ส"]

Example Output:
{
  "สวัสดี": ["sawatdee", "sawasdee", "sawasdi", "sawatdi"],
  "ขอบคุณ": ["khobkhun", "khopkhun", "kobkun", "kopkun"],
  "ส": ["s", "so", "sor"]
}
"""

def get_transliterations_from_llm(word_batch) -> dict | None:
    """
    Sends a batch of words to the OpenAI API for transliteration.
    """
    try:
        response = openai.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": json.dumps(word_batch, ensure_ascii=False)}
            ],
            response_format={"type": "json_object"},
            temperature=0.35,
            timeout=30,
            max_completion_tokens=3000
        )

        content = response.choices[0].message.content
        return json.loads(content)

    except openai.APITimeoutError as e:
        print(f"Timeout error: {e}")
    except openai.APIError as e:
        print(f"OpenAI API Error: {e}")
    except json.JSONDecodeError:
        print(f"Failed to decode JSON from response:\n{content}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

    print("\nResponse Dump:\n")
    print(response)
    return None

In [4]:
# --- Main Processing Loop ---

# Add pending words to batches
word_batches = [
    pending_words[i : i+BATCH_SIZE]
    for i in range(0, len(pending_words), BATCH_SIZE)
]

print(f"Starting LLM transliteration for {len(pending_words)} words in {len(word_batches)} batches...")

for batch in tqdm(word_batches, desc="Transliterating batches"):
    result = get_transliterations_from_llm(batch)
    if not result:
        print("No results returned for batch")
        continue

    for thai, latin in result.items():
        # Hallucination catch
        if thai not in pending_words:
            print(f"Hallucination Error: This word was not requested - {thai}\n Results dump below:\n")
            print(result)
            continue
        # Clean results - remove duplicates, trim, lowercase
        clean_latin = list({token.strip(' -').lower() for token in latin})
        # Empty list catch
        if len(clean_latin) == 0:
            continue
        # Add results to output dict
        results_dict[thai] = clean_latin
        # Remove words with successful transliteration from pending list
        pending_words.remove(thai)

print(f"\nFinished transliteration. {len(pending_words)} words remaining to work on.")
print(f"Transliteration dictionary size: {len(results_dict)} words.")

Starting LLM transliteration for 9500 words in 190 batches...


Transliterating batches: 100%|██████████| 190/190 [1:02:00<00:00, 19.58s/it]


Finished transliteration. 0 words remaining to work on.
Transliteration dictionary size: 10000 words.





In [5]:
# --- Warn of remaining work ---

if pending_words:
    print(f"\nWarning: The following {len(pending_words)} words have no valid transliterations:")
    for word in pending_words:
        print(f"- {word}")
else:
    print("\nAll words have at least one valid transliteration.")


All words have at least one valid transliteration.


In [6]:
# --- Display Sample Results ---

print("Sample of transliterations:")
sample_count = 0
for word, transliterations in results_dict.items():
    if sample_count < 10:
        print(f"- {word}: {transliterations}")
        sample_count += 1
    else:
        break

Sample of transliterations:
- ที่: ['tee', 'thi']
- ไม่: ['my', 'mai', 'mhai']
- ใน: ['nai']
- มี: ['mee', 'mi']
- และ: ['lae', 'le']
- ของ: ['kong', 'khong']
- ได้: ['dai']
- เป็น: ['pen', 'bpen']
- มา: ['ma', 'maa']
- ไป: ['bpai', 'pai']


In [7]:
# --- Save Results to File ---

with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    json.dump(results_dict, f, ensure_ascii=False, indent=2)

print(f"\nSaved all transliterations to: {OUTPUT_PATH}")


Saved all transliterations to: ../artifacts/llm_transliteration/gpt-4.1-mini.json
