This repository contains the code for the paper, Link to - paper.
Download the data folder from this drive.
This dataset contains corpus of high quality, contemporary Yorùbá speech and text data parallel across four Yorùbá dialects; Standard Yorùbá, Ifè, Ìlàje and Ìjèbú in three domains (religious, news, and Ted talks). The dataset can be used in (text-to-text) machine translation (MT), automatic speech recognition (ASR), speech-to-text translation (S2TT), and speech-to-speech translation (STST) tasks.
Clone the repository and install requirements
git clone https://github.com/orevaahia/yorulect
cd yorulect
pip install -r requirements.txt
# zero-shot evaluation of Aya and MT0
bash scripts/mt/zero_shot_lm.sh
# zero-shot evaluation of NLLB and M2M-100
bash scripts/mt/zero_shot_mt.sh
# zero-shot evaluation of Google Translate
bash scripts/mt/zero_shot_gmnmt.sh
Finetuning NLLB :
bash scripts/mt/finetune_mt.sh
# zero-shot evaluation of MMS and Whisper
bash scripts/speech/zero_shot_asr.sh
# MMS
bash scripts/speech/finetune_mms_asr.sh
# XSLR
bash scripts/speech/finetune_xslr_asr.sh
If you use this dataset, please cite our work.
@misc{ahia2024voicesunheardnlpresources,
title={Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects},
author={Orevaoghene Ahia and Anuoluwapo Aremu and Diana Abagyan and Hila Gonen and David Ifeoluwa Adelani and Daud Abolade and Noah A. Smith and Yulia Tsvetkov},
year={2024},
eprint={2406.19564},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19564},
}