# Extracting Rows with Specific Transliterations from the TLA Corpus

This notebook loads a subset of the Thesaurus Linguae Aegyptiae (TLA) corpus, normalizes transliterations of Egyptian text, and extracts rows that contain **both** `3h3=k` and `nkht=k`.

It is designed for research in historical linguistics or digital philology of Earlier Egyptian texts.

---

## Overview

1. Load the corpus from HuggingFace Datasets  
2. Normalize transliterations using a custom mapping  
3. Search for rows containing both specific normalized terms  
4. Display the filtered results

---


In [None]:
!pip install datasets pandas


In [None]:
# 1. Load TLA dataset from HuggingFace
from datasets import load_dataset
import pandas as pd

# Load dataset
dataset = load_dataset("thesaurus-linguae-aegyptiae/tla-Earlier_Egyptian_original-v18-premium", split="train")
df = dataset.to_pandas()


In [None]:
# 2. Define normalization rules (based on Egyptological conventions)
def normalize(text):
    return (
        text.replace("ꞽ", "E")
            .replace("ꜣ", "A")
            .replace("ꜥ", "3")
            .replace("ḫ", "kh")
            .replace("ṯ", "tj")
            .replace("š", "sh")
            .replace("ẖ", "x")
            .replace("ḥ", "h")
    )

# Apply normalization
df["transliteration_norm"] = df["transliteration"].fillna("").apply(normalize)


In [None]:
# 3. Filter rows containing both '3h3=k' and 'nkht=k'
mask = df["transliteration_norm"].str.contains("3h3=k") & df["transliteration_norm"].str.contains("nkht=k")
df_filtered = df[mask]


In [None]:
# 4. Display results
df_filtered[["transliteration", "transliteration_norm"]].head()

### Sample Output

This table shows entries that contain both `3h3=k` and `nkht=k` in their normalized transliteration.

| transliteration        | transliteration_norm |
|------------------------|----------------------|
| ...                    | ...                  |

---


## Notes

- This notebook assumes a basic understanding of Middle Egyptian transliteration.
- The normalization scheme is simplified and may not cover all phonetic symbols.
- Make sure `datasets` and `pandas` are installed in your environment.

## Author
Mio Ohashi  
April 2025

