# NaijaML Explorer

Interactive notebook to explore all NaijaML features.

**Tasks covered:**
- Task 1 & 2: Dataset loaders
- Task 3: Nigerian constants & text preprocessing
- Task 4: Language detection
- Task 5: Yorùbá & Igbo diacritization

---
## Task 1 & 2: Dataset Loaders

Load Nigerian NLP datasets with a simple API.

In [None]:
from naijaml.data import load_dataset, list_datasets, dataset_info

In [None]:
# See all available datasets
list_datasets()

In [None]:
# Get info about a dataset (without downloading)
dataset_info("naijasenti")

### NaijaSenti (Sentiment Analysis)

In [None]:
# Load Yorùbá sentiment data
yor_sentiment = load_dataset("naijasenti", lang="yor", split="train")
print(f"Loaded {len(yor_sentiment)} Yorùbá samples")
yor_sentiment[:3]

In [None]:
# Try other languages: hau, ibo, pcm
# YOUR EXPLORATION HERE


### MasakhaNER (Named Entity Recognition)

In [None]:
# Load Hausa NER data
hau_ner = load_dataset("masakhaner", lang="hau", split="test")
print(f"Loaded {len(hau_ner)} Hausa NER samples")
hau_ner[0]

In [None]:
# Visualize NER tags
sample = hau_ner[0]
for token, tag in zip(sample["tokens"], sample["ner_tags"]):
    if tag != "O":
        print(f"{token:20} → {tag}")

### MasakhaNEWS (News Classification)

In [None]:
# Load Hausa news data
hau_news = load_dataset("masakhanews", lang="hau", split="test")
print(f"Loaded {len(hau_news)} Hausa news articles")
hau_news[0]

In [None]:
# Count by category
from collections import Counter
Counter(item["label"] for item in hau_news)

---
## Task 3: Nigerian Constants & Preprocessing

### Nigerian Constants

In [None]:
from naijaml.utils import (
    STATES, STATE_NAMES, LGAS, BANKS, TELCOS,
    format_naira, parse_naira,
    is_valid_phone, normalize_phone, get_telco,
    is_valid_bvn, is_valid_nin
)

In [None]:
# Nigerian states and capitals
print(f"{len(STATES)} states + FCT")
print(f"Lagos capital: {STATES['Lagos']}")
print(f"Kano capital: {STATES['Kano']}")

In [None]:
# LGAs
print("Lagos LGAs:", LGAS["Lagos"][:5], "...")

In [None]:
# Nigerian banks
print("Banks:", list(BANKS.keys())[:10])

In [None]:
# Telcos and their prefixes
for telco, info in TELCOS.items():
    print(f"{telco}: {info['prefixes'][:3]}...")

### Naira Formatting

In [None]:
# Format amounts
print(format_naira(1500000))
print(format_naira(50000, include_kobo=False))

In [None]:
# Parse amounts
print(parse_naira("₦1,500,000.00"))
print(parse_naira("NGN 50,000"))

### Phone Number Utilities

In [None]:
# Validate phone numbers
test_phones = ["08031234567", "+2348012345678", "12345", "09012345678"]
for phone in test_phones:
    valid = is_valid_phone(phone)
    telco = get_telco(phone) if valid else None
    print(f"{phone:20} valid={valid:5} telco={telco}")

In [None]:
# Normalize to international format
print(normalize_phone("08031234567"))
print(normalize_phone("0803-123-4567"))

### Text Preprocessing

In [None]:
from naijaml.nlp import (
    clean_nigerian_text, clean_social_media,
    mask_pii, find_phones, find_naira_amounts,
    normalize_unicode, strip_diacritics,
    extract_hashtags, extract_mentions
)

In [None]:
# Clean social media text
tweet = "@user Check https://t.co/abc This film too sweet!!! #Nollywood #NaijaFilm"
print("Original:", tweet)
print("Cleaned: ", clean_social_media(tweet))
print("Hashtags:", extract_hashtags(tweet))
print("Mentions:", extract_mentions(tweet))

In [None]:
# Mask PII (personally identifiable information)
text_with_pii = "Call me on 08012345678 or email test@example.com for the ₦50,000 deal"
print("Original:", text_with_pii)
print("Masked:  ", mask_pii(text_with_pii))
print("Phones:  ", find_phones(text_with_pii))
print("Amounts: ", find_naira_amounts(text_with_pii))

In [None]:
# Yorùbá diacritic handling
yoruba_text = "Ọjọ́ dára púpọ̀, ẹ kú iṣẹ́"
print("Original:        ", yoruba_text)
print("Strip diacritics:", strip_diacritics(yoruba_text))

In [None]:
# All-in-one cleaning
messy_text = "@someone Check this ₦100k deal!!! https://bit.ly/xyz Call 08012345678 #Lagos"
print("Original:", messy_text)
print("Cleaned: ", clean_nigerian_text(messy_text, mask_pii_data=True))

---
## Task 4: Language Detection

In [None]:
from naijaml.nlp import (
    detect_language,
    detect_language_with_confidence,
    detect_all_languages,
    SUPPORTED_LANGUAGES
)

print("Supported languages:", SUPPORTED_LANGUAGES)

In [None]:
# Basic detection
samples = [
    "Ọjọ́ dára púpọ̀, ẹ kú iṣẹ́",        # Yorùbá
    "Ina kwana, yaya aiki?",             # Hausa
    "Kedu ka ị mere? Ọ dị mma",          # Igbo
    "Wetin dey happen for this country?", # Pidgin
    "The weather is quite pleasant today", # English
]

for text in samples:
    lang = detect_language(text)
    print(f"{lang}: {text[:40]}...")

In [None]:
# Detection with confidence
for text in samples:
    lang, conf = detect_language_with_confidence(text)
    print(f"{lang} ({conf:5.1%}): {text[:35]}...")

In [None]:
# Get all language probabilities
text = "Wetin dey happen?"
scores = detect_all_languages(text)
print(f"Text: '{text}'\n")
for lang, prob in sorted(scores.items(), key=lambda x: -x[1]):
    bar = "█" * int(prob * 30)
    print(f"{lang}: {bar} {prob:.1%}")

In [None]:
# YOUR EXPLORATION: Try your own text!
my_text = ""  # <-- put your text here

if my_text:
    lang, conf = detect_language_with_confidence(my_text)
    print(f"Detected: {lang} ({conf:.1%})")
    print("\nAll scores:")
    for l, p in sorted(detect_all_languages(my_text).items(), key=lambda x: -x[1]):
        print(f"  {l}: {p:.1%}")

---
## Task 5: Diacritization

Restore diacritics to plain Yorùbá and Igbo text. Yorùbá uses tonal marks (á, à) and dot-below characters (ọ, ẹ, ṣ) that are often omitted when typing.

### Yorùbá Full Diacritization

Restores both tonal marks (á, à) and dot-below (ọ, ẹ, ṣ). Uses **bigram word context** to disambiguate — e.g. the word "ile" becomes "ilé" (house) after "inu" but "ilẹ̀" (ground) after "ori".

In [None]:
from naijaml.nlp import diacritize

# Everyday sentences
sentences = [
    "Ojo dara pupo",
    "E ku ise o, bawo ni?",
    "Mo fe ra onje ni oja",
    "Awon omo de n sere ni ona",
    "Oluwa a bukun fun e",
]

for s in sentences:
    print(f"{s:35s} \u2192 {diacritize(s)}")

In [None]:
# Bigram context disambiguation
# The same word "ile" gets different diacritics based on context
pairs = [
    ("inu ile",  "'inú ilé' = inside the house"),
    ("ori ile",  "'orí ilẹ̀' = on the ground"),
    ("ile nla",  "no context → unigram default"),
]

for text, note in pairs:
    result = diacritize(text)
    print(f"{text:15s} → {result:15s}  ({note})")

### Yorùbá Dot-Below Only

A simpler, higher-accuracy mode that only restores ọ, ẹ, ṣ (no tonal marks). Use when you need reliable results.

In [None]:
from naijaml.nlp import diacritize_dot_below

sentences = [
    "Ojo dara pupo",
    "E ku ise o",
    "Omo mi dara",
]

for s in sentences:
    full = diacritize(s)
    dots = diacritize_dot_below(s)
    print(f"{s:20s} \u2192 full: {full:25s} dot-below: {dots}")

### Igbo Diacritization

Restores dot-below vowels (ị, ọ, ụ) in Igbo text.

In [None]:
from naijaml.nlp import diacritize_igbo

igbo_sentences = [
    "Kedu ka i mere",
    "O di mma",
    "Unu nile ekele",
    "Chineke gozie gi",
]

for s in igbo_sentences:
    print(f"{s:25s} \u2192 {diacritize_igbo(s)}")

### Strip Diacritics

Go the other direction — remove diacritics for search, indexing, or comparison.

In [None]:
from naijaml.nlp.diacritizer import strip_diacritics

text = "Ọjọ́ dára púpọ̀, ẹ kú iṣẹ́"
print(f"Original:          {text}")
print(f"Strip all:         {strip_diacritics(text)}")
print(f"Strip tones only:  {strip_diacritics(text, tones_only=True)}")

In [None]:
# Try your own text!
my_text = ""  # <-- type undiacritized Yorùbá here

if my_text:
    print(f"Full:      {diacritize(my_text)}")
    print(f"Dot-below: {diacritize_dot_below(my_text)}")

---
## Your Playground

Use the cells below for your own experiments.

In [None]:
# Your code here
