Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

This repository contains the code for the paper, Link to - paper.

1. Data

Download the data folder from this drive.

Data Description

This dataset contains corpus of high quality, contemporary Yorùbá speech and text data parallel across four Yorùbá dialects; Standard Yorùbá, Ifè, Ìlàje and Ìjèbú in three domains (religious, news, and Ted talks). The dataset can be used in (text-to-text) machine translation (MT), automatic speech recognition (ASR), speech-to-text translation (S2TT), and speech-to-speech translation (STST) tasks.

2. Running the Code:

Clone the repository and install requirements

git clone https://github.com/orevaahia/yorulect
cd yorulect
pip install -r requirements.txt

Machine Translation:

Zero-shot:

# zero-shot evaluation of Aya and MT0
bash scripts/mt/zero_shot_lm.sh

# zero-shot evaluation of NLLB and M2M-100
bash scripts/mt/zero_shot_mt.sh

# zero-shot evaluation of Google Translate
bash scripts/mt/zero_shot_gmnmt.sh

Finetuning NLLB :

bash scripts/mt/finetune_mt.sh

Automatic Speech Recognition:

Zero-shot:

# zero-shot evaluation of MMS and Whisper
bash scripts/speech/zero_shot_asr.sh

Finetuning MMS and XLSR:

# MMS
bash scripts/speech/finetune_mms_asr.sh

# XSLR
bash scripts/speech/finetune_xslr_asr.sh

Citation

If you use this dataset, please cite our work.

@misc{ahia2024voicesunheardnlpresources,
      title={Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects}, 
      author={Orevaoghene Ahia and Anuoluwapo Aremu and Diana Abagyan and Hila Gonen and David Ifeoluwa Adelani and Daud Abolade and Noah A. Smith and Yulia Tsvetkov},
      year={2024},
      eprint={2406.19564},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19564}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
seamless_communication @ 81aee56		seamless_communication @ 81aee56
src		src
utils		utils
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

1. Data

Data Description

2. Running the Code:

Machine Translation:

Zero-shot:

Finetuning NLLB :

Automatic Speech Recognition:

Zero-shot:

Finetuning MMS and XLSR:

Citation

About

Releases

Packages

Contributors 2

Languages

License

orevaahia/yorulect

Folders and files

Latest commit

History

Repository files navigation

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

1. Data

Data Description

2. Running the Code:

Machine Translation:

Zero-shot:

Finetuning NLLB :

Automatic Speech Recognition:

Zero-shot:

Finetuning MMS and XLSR:

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages