# ü™∂ Kanien‚Äôk√©ha Tokenizer Demo

This demo loads vocabulary rules and a sample data template for the Akwesasne dialect of Kanien‚Äôk√©ha (Mohawk), processes them into morpheme segments, and prepares data for tokenizer training and evaluation.

**Author:** MoniGarr  
**Mission:** AI Research Residency Qualification ¬∑ Onkwehonwehneha NLP

In [None]:
# üîß Install Required Packages
!pip install pyyaml pandas

In [None]:
# üì¶ Import Modules
import yaml
import json
import pandas as pd
from pathlib import Path

In [None]:
# üìÇ Load Vocab Rules (YAML)
with open("kanienkeha_vocab_rules.yaml", "r", encoding="utf-8") as file:
    vocab_rules = yaml.safe_load(file)
vocab_rules.keys()

In [None]:
# üìÇ Load Sample Data Template (JSON)
with open("data_template.json", "r", encoding="utf-8") as file:
    sample_data = json.load(file)
df = pd.DataFrame(sample_data)
df[['input_text', 'translation', 'morpheme_gloss']]

## üß† Build Morphemes Dataset

In [None]:
# üîÅ Flatten morphemes into high-quality JSON
morpheme_records = []

for entry in sample_data:
    for idx, morph in enumerate(entry["morpheme_gloss"]):
        morpheme_records.append({
            "source_id": entry["id"],
            "surface_form": morph,
            "token_index": idx,
            "input_text": entry["input_text"],
            "translation": entry["translation"],
            "tag": entry.get("morpheme_tags", [])[idx] if idx < len(entry.get("morpheme_tags", [])) else None,
            "stem_type": entry.get("stem_type"),
            "category": entry.get("category"),
            "usage_context": entry.get("usage_context"),
            "validation_status": entry.get("validation_status")
        })

with open("morphemes.json", "w", encoding="utf-8") as f:
    json.dump(morpheme_records, f, ensure_ascii=False, indent=2)

print("‚úÖ morphemes.json created.")

## üìä Visual Inspection

In [None]:
# üßæ Load morphemes and inspect sample
morphemes_df = pd.read_json("morphemes.json")
morphemes_df.head(10)

## ü§ó Export Hugging Face Tokenizer Configuration

In [None]:
tokenizer_config = {
    "lang": "kanienkeha",
    "name": "kanienkeha-akwesasne-tokenizer",
    "tokenizer_class": "PreTrainedTokenizerFast",
    "do_lower_case": False,
    "model_max_length": 128,
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "sep_token": "<sep>",
    "cls_token": "<cls>"
}

with open("tokenizer_config.json", "w", encoding="utf-8") as f:
    json.dump(tokenizer_config, f, indent=2)
print("‚úÖ tokenizer_config.json generated.")

## üì§ Ready for Upload

You now have:
- `morphemes.json`: Structured token data
- `tokenizer_config.json`: HF-compatible configuration

Ready for HuggingFace Hub upload or further tokenizer training.