# Wolof NLP Toolkit - Quickstart

This notebook demonstrates the core functionality of the Wolof NLP toolkit.

## Installation

```bash
pip install wolof-nlp
```

Or for development:
```bash
git clone https://github.com/maimouna-mbacke/wolof-nlp
cd wolof-nlp
pip install -e .
```

In [19]:
import sys
sys.path.insert(0, '../src')

from wolof_nlp import (
    WolofTokenizer,
    tokenize,
    morphemes,
    normalize,
    analyze_morphology
)

print("Wolof NLP loaded successfully")

Wolof NLP loaded successfully


## 1. Two Tokenization Modes

The toolkit provides two tokenization modes:
- `tokenize()`: Word-level tokenization for standard NLP pipelines
- `morphemes()`: Morpheme-level splitting for linguistic analysis

In [20]:
examples = [
    ("Damay dem.", "I'm going"),
    ("Dafa baax.", "It's good"),
    ("Dinaa lekk.", "I will eat"),
]

print(f"{'Input':<20} {'tokenize()':<25} {'morphemes()'}")
print("-" * 70)
for text, meaning in examples:
    tok_result = tokenize(text)
    morph_result = morphemes(text)
    print(f"{text:<20} {str(tok_result):<25} {morph_result}")

Input                tokenize()                morphemes()
----------------------------------------------------------------------
Damay dem.           ['Damay', 'dem']          ['da', 'ma', 'y', 'dem']
Dafa baax.           ['Dafa', 'baax']          ['Dafa', 'baax']
Dinaa lekk.          ['Dinaa', 'lekk']         ['di', 'na', 'a', 'lekk']


## 2. Language Detection

The tokenizer detects Wolof, French, Arabic loanwords, and Senegalese place names.

In [21]:
tokenizer = WolofTokenizer(normalize=True, detect_language=True)

text = "Sénégal dafa neex trop billahi"
tokens = tokenizer.tokenize(text)

print(f"Input: {text}\n")
for tok in tokens:
    if tok.type.name == 'WORD':
        print(f"  {tok.text:<12} -> {tok.language.name}")

Input: Sénégal dafa neex trop billahi

  Sénégal      -> WOLOF
  dafa         -> WOLOF
  neex         -> WOLOF
  trop         -> FRENCH
  billahi      -> ARABIC


## 3. Orthography Normalization

Converts informal/non-standard spellings to CLAD standard orthography, without altering meaning.

In [22]:
examples = [
    ("dieuradieuf", "jërëjëf", "Thank you"),
    ("serigne", "seriñ", "Sir/Mister"),
    ("sokhna", "soxna", "Madam"),
    ("thiep bi neex na", "ceeb bi neex na", "The rice is delicious"),
]

print(f"{'Informal':<20} {'CLAD Standard':<20} {'Meaning'}")
print("-" * 60)
for informal, expected, meaning in examples:
    normalized = normalize(informal)
    print(f"{informal:<20} {normalized:<20} {meaning}")

Informal             CLAD Standard        Meaning
------------------------------------------------------------
dieuradieuf          jërëjëf              Thank you
serigne              seriñ                Sir/Mister
sokhna               soxna                Madam
thiep bi neex na     ceeb bi neex na      The rice is delicious


## 4. Morphological Analysis

Decomposes words into constituent morphemes with grammatical labels.

In [23]:
words = [
    ("bindkat", "writer (bind + kat)"),
    ("soppiku", "to change oneself (soppi + ku)"),
    ("gisante", "to see each other (gis + ante)"),
    ("defaat", "to do again (def + aat)"),
    ("demul", "didn't go (dem + ul)"),
]

for word, meaning in words:
    morphs = analyze_morphology(word)
    morph_str = " + ".join(f"{m.text}:{m.type.name}" for m in morphs)
    print(f"{word:<12} -> {morph_str}")
    print(f"             ({meaning})\n")

bindkat      -> bind:ROOT + kat:NOMINALIZATION
             (writer (bind + kat))

soppiku      -> soppi:ROOT + ku:REFLEXIVE
             (to change oneself (soppi + ku))

gisante      -> gis:ROOT + ante:RECIPROCAL
             (to see each other (gis + ante))

defaat       -> def:ROOT + aat:REPETITIVE
             (to do again (def + aat))

demul        -> dem:ROOT + ul:NEGATION
             (didn't go (dem + ul))



## 5. Processing Real-World Text

Example with typical Senegalese social media text featuring code-switching.

In [24]:
comments = [
    "Dafa trop neex série bi wallahi",
    "Masha Allah vraiment rafet",
    "Dem naa Dakar tey",
]

for comment in comments:
    tokens = tokenizer.tokenize(comment)
    words = [t for t in tokens if t.type.name == 'WORD']
    
    wolof = [t.text for t in words if t.language and t.language.name == 'WOLOF']
    french = [t.text for t in words if t.language and t.language.name == 'FRENCH']
    arabic = [t.text for t in words if t.language and t.language.name == 'ARABIC']
    
    print(f"Text: {comment}")
    print(f"  Wolof:  {wolof}")
    print(f"  French: {french}")
    print(f"  Arabic: {arabic}\n")

Text: Dafa trop neex série bi wallahi
  Wolof:  ['Dafa', 'neex', 'bi']
  French: ['trop', 'série']
  Arabic: ['wallahi']

Text: Masha Allah vraiment rafet
  Wolof:  ['rafet']
  French: ['vraiment']
  Arabic: ['Allah']

Text: Dem naa Dakar tey
  Wolof:  ['Dem', 'na', 'a', 'Dakar', 'tey']
  French: []
  Arabic: []



## Next Steps

- See `02_applications.ipynb` for POS tagging, NER, sentiment analysis
- See `03_evaluation.ipynb` for evaluation on gold standard data