Skip to content

orevaahia/yorulect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

This repository contains the code for the paper, Link to - paper.

1. Data

Download the data folder from this drive.

Data Description

This dataset contains corpus of high quality, contemporary Yorùbá speech and text data parallel across four Yorùbá dialects; Standard Yorùbá, Ifè, Ìlàje and Ìjèbú in three domains (religious, news, and Ted talks). The dataset can be used in (text-to-text) machine translation (MT), automatic speech recognition (ASR), speech-to-text translation (S2TT), and speech-to-speech translation (STST) tasks.

2. Running the Code:

Clone the repository and install requirements

git clone https://github.com/orevaahia/yorulect
cd yorulect
pip install -r requirements.txt

Machine Translation:

Zero-shot:

# zero-shot evaluation of Aya and MT0
bash scripts/mt/zero_shot_lm.sh

# zero-shot evaluation of NLLB and M2M-100
bash scripts/mt/zero_shot_mt.sh

# zero-shot evaluation of Google Translate
bash scripts/mt/zero_shot_gmnmt.sh

Finetuning NLLB :

bash scripts/mt/finetune_mt.sh

Automatic Speech Recognition:

Zero-shot:

# zero-shot evaluation of MMS and Whisper
bash scripts/speech/zero_shot_asr.sh

Finetuning MMS and XLSR:

# MMS
bash scripts/speech/finetune_mms_asr.sh

# XSLR
bash scripts/speech/finetune_xslr_asr.sh

Citation

If you use this dataset, please cite our work.

@misc{ahia2024voicesunheardnlpresources,
      title={Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects}, 
      author={Orevaoghene Ahia and Anuoluwapo Aremu and Diana Abagyan and Hila Gonen and David Ifeoluwa Adelani and Daud Abolade and Noah A. Smith and Yulia Tsvetkov},
      year={2024},
      eprint={2406.19564},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19564}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published