A lightweight and simple NLP toolkit for the Breton language, in Python.
ostilhou
is dependent on hunspell
sudo apt install hunspell libhunspell-dev
After cloning the repository :
pip install -r requirements.txt
Sentence splitting, simple pre-tokenizer, text normalization, text inverse-normalization.
Spelling error detection with An Drouizig's hunspell dictionary.
Various dictionary files (tsv format) for proper nouns, first names, last names, nouns (with gender), acronyms.
asr.phonetize_word
function.
No API transcription for now...
Using a Vosk model.
Text transcriptions can be infered from audio file with the functions asr.recognizer.transcribe_file(path)
and asr.recognizer.transcribe_file_timecoded(path)
.
If working with pydub.AudioSegment, the functions asr.recognizer.transcribe_segment(audiosegment)
and asr.transcribe_segment_timecoded(audiosegment)
.
No post-processing is applied by default.
Speech-to-Text post-processing steps:
Raw infered text
-> Verbal fillers removal (optional)
-> n-token substitution (from 'asr/postproc_sub.tsv')
-> Numbers and units inverse-normalization (optional)
-> Regional adaptation, using a look-up table (optional)
The 'n-token substitution' step adds necessary hypens, composed name and brand name capitalization.
Audio data augmentation functions.
The library comes with a few toy corpora from the public domain.