# 🪶 Kanien’kéha Tokenizer Demo

This demo loads vocabulary rules and a sample data template for the Akwesasne dialect of Kanien’kéha (Mohawk), processes them into morpheme segments, and prepares data for tokenizer training and evaluation.

**Author:** MoniGarr  
**Mission:** AI Research Residency Qualification · Onkwehonwehneha NLP

In [5]:
# 🔧 Install Required Packages
!pip install pyyaml pandas



In [6]:
# 📦 Import Modules
import yaml
import json
import pandas as pd
from pathlib import Path

In [24]:
# 📂 Load Vocab Rules (YAML)
with open("/Github/mini-indig-llm-kit/datasets/kanienkeha_vocab_extended_rules.yaml", "r", encoding="utf-8") as file:
    vocab_rules = yaml.safe_load(file)
vocab_rules.keys()

dict_keys(['language', 'type', 'dialects', 'description'])

In [26]:
# 📂 Load Sample Data Template (JSON)
with open("/Github/mini-indig-llm-kit/datasets/data_template.json", "r", encoding="utf-8") as file:
    sample_data = json.load(file)
df = pd.DataFrame(sample_data)
df[['input_text', 'translation', 'morpheme_gloss']]

Unnamed: 0,input_text,translation,morpheme_gloss
0,Kenòn:we’s,I like it.,"[ke-, nòn:we, -’s]"
1,Wakenòn:we’,I liked it.,"[wa-, ke-, nòn:we, -’]"
2,Ronòn:we’s,He likes it.,"[ro-, nòn:we, -’s]"


## 🧠 Build Morphemes Dataset

In [27]:
# 🔁 Flatten morphemes into high-quality JSON
morpheme_records = []

for entry in sample_data:
    for idx, morph in enumerate(entry["morpheme_gloss"]):
        morpheme_records.append({
            "source_id": entry["id"],
            "surface_form": morph,
            "token_index": idx,
            "input_text": entry["input_text"],
            "translation": entry["translation"],
            "tag": entry.get("morpheme_tags", [])[idx] if idx < len(entry.get("morpheme_tags", [])) else None,
            "stem_type": entry.get("stem_type"),
            "category": entry.get("category"),
            "usage_context": entry.get("usage_context"),
            "validation_status": entry.get("validation_status")
        })

with open("/Github/mini-indig-llm-kit/datasets/generated/morphemes_extended.json", "w", encoding="utf-8") as f:
    json.dump(morpheme_records, f, ensure_ascii=False, indent=2)

print("✅ morphemes_extended.json created.")

✅ morphemes_extended.json created.


## 📊 Visual Inspection

In [30]:
# 🧾 Load morphemes and inspect sample
morphemes_df = pd.read_json("/Github/mini-indig-llm-kit/datasets/generated/morphemes_extended.json")
morphemes_df.head(10)

Unnamed: 0,source_id,surface_form,token_index,input_text,translation,tag,stem_type,category,usage_context,validation_status
0,1,ke-,0,Kenòn:we’s,I like it.,1SG,C-STEM,blue,everyday expression,verified
1,1,nòn:we,1,Kenòn:we’s,I like it.,like,C-STEM,blue,everyday expression,verified
2,1,-’s,2,Kenòn:we’s,I like it.,habitual,C-STEM,blue,everyday expression,verified
3,2,wa-,0,Wakenòn:we’,I liked it.,PAST,C-STEM,blue,past tense storytelling,pending_review
4,2,ke-,1,Wakenòn:we’,I liked it.,1SG,C-STEM,blue,past tense storytelling,pending_review
5,2,nòn:we,2,Wakenòn:we’,I liked it.,like,C-STEM,blue,past tense storytelling,pending_review
6,2,-’,3,Wakenòn:we’,I liked it.,PUNC,C-STEM,blue,past tense storytelling,pending_review
7,3,ro-,0,Ronòn:we’s,He likes it.,3SG.M,C-STEM,blue,basic sentence practice,needs_validation
8,3,nòn:we,1,Ronòn:we’s,He likes it.,like,C-STEM,blue,basic sentence practice,needs_validation
9,3,-’s,2,Ronòn:we’s,He likes it.,habitual,C-STEM,blue,basic sentence practice,needs_validation


## 🤗 Export Hugging Face Tokenizer Configuration

In [31]:
tokenizer_config = {
    "lang": "kanienkeha",
    "name": "kanienkeha-akwesasne-tokenizer",
    "tokenizer_class": "PreTrainedTokenizerFast",
    "do_lower_case": False,
    "model_max_length": 128,
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "sep_token": "<sep>",
    "cls_token": "<cls>"
}

with open("/Github/mini-indig-llm-kit/datasets/generated/tokenizer_extended_config.json", "w", encoding="utf-8") as f:
    json.dump(tokenizer_config, f, indent=2)
print("✅ tokenizer_extended_config.json generated.")

✅ tokenizer_extended_config.json generated.


## 📤 Ready for Upload

You now have:
- `morphemes.json`: Structured token data
- `tokenizer_config.json`: HF-compatible configuration

Ready for HuggingFace Hub upload or further tokenizer training.