# Runyoro Pronunciation Lexicon
This notebook extracts unique words from the Runyoro Bible text and illustrates how to generate a pronunciation lexicon using [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus).

## 1. Install dependencies
Install Phonetisaurus via pip if it is not already available.

In [None]:
!pip install -q phonetisaurus

## 2. Load the Runyoro Bible text
We read the text file located at `../data/raw/runyoro_bible/runyoro_bible.txt`.

In [None]:

from pathlib import Path
import re

text_path = Path('../data/raw/runyoro_bible/runyoro_bible.txt')
text = text_path.read_text(encoding='utf-8')
print(text[:300])


## 3. Extract unique words
Use a regular expression to find alphabetic tokens and deduplicate them.

In [None]:

words = re.findall(r"[^\W\d_']+", text, flags=re.UNICODE)
unique_words = sorted(set(w.lower() for w in words))
print(len(unique_words))
unique_words[:20]


## 4. Save the word list
The list of unique words is saved to `unique_words.txt`.

In [None]:

word_list_path = Path('unique_words.txt')
word_list_path.write_text('
'.join(unique_words), encoding='utf-8')
word_list_path


## 5. Generate pronunciations with Phonetisaurus
If you already have a G2P model (`.fst` file), apply it to create the pronunciation lexicon. Replace `PATH_TO_MODEL.fst` with your model path.

In [None]:

!phonetisaurus-apply --model PATH_TO_MODEL.fst --word_list unique_words.txt --output lexicon.txt


### Training your own model
If you do not yet have a model, you can train one using a corpus of word–pronunciation pairs. Once prepared, run a command like the following:

In [None]:

!phonetisaurus-train --corpus g2p_corpus.csv --model_prefix my_g2p_model


After training finishes, you will obtain `my_g2p_model.fst`. Use this file with `phonetisaurus-apply` as shown above to produce `lexicon.txt`.