Skip to content

mdoumbouya/nicolingua

Repository files navigation

Using Radio Archives for Low-Resource Speech Recognition

Code

Dataset: West African Radio Corpus

17,091 audio clips of length 30 seconds sampled from archives collected from 6 Guinean radio stations. The broadcasts consist of news and various radio shows in languages including French, Guerze, Koniaka, Kissi, Kono, Maninka, Mano, Pular, Susu, and Toma. Some radio shows include phone calls, background and foreground music, and various noise types.

Download from OpenSLR

http://openslr.org/105/

Download via https

wget \
    https://nicolingua.s3.eu-west-2.amazonaws.com/nicolingua-0003-west-african-radio-corpus.tgz \
    nicolingua-0003-west-african-radio-corpus.tgz

Download with aws-cli

aws s3 cp \
    s3://nicolingua/nicolingua-0003-west-african-radio-corpus-openslr.tgz \
    nicolingua-0003-west-african-radio-corpus-openslr.tgz

Dataset: West African Virtual Assistant Speech Recognition Corpus:

10,083 recorded utterances from 49 speakers (16 female and 33 male) ranging from 5 to 76 years old on a variety of devices.

Download from OpenSLR

http://openslr.org/106/

Download via https

wget \
    https://nicolingua.s3.eu-west-2.amazonaws.com/nicolingua-0004-west-african-va-asr-corpus.tgz \
    nicolingua-0004-west-african-va-asr-corpus.tgz

Download with aws-cli

aws s3 cp \
    s3://nicolingua/nicolingua-0004-west-african-va-asr-corpus.tgz \
    nicolingua-0004-west-african-va-asr-corpus.tgz

Pre-Trained Model: West African Wav2vec

Compatible with the baseline wav2vec large model. Traned on the West African Radio Corpus.

Download via https

wget \
    https://nicolingua.s3.eu-west-2.amazonaws.com/nicolingua-0003-west-african-wav2vec.tgz \
    nicolingua-0003-west-african-wav2vec.tgz

Download with aws-cli

aws s3 cp \
    s3://nicolingua/nicolingua-0003-west-african-wav2vec.tgz \
    nicolingua-0003-west-african-wav2vec.tgz

Licence

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

How to cite our work

APA

Doumbouya, M., Einstein, L., Piech, C.. (2021). Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users. In AAAI.

BibTex

 @inproceedings{doumbouya2021usingradio,
    title={Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users},
    author={Doumbouya, Moussa and Einstein, Lisa and Piech, Chris},
    booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    year={2021}
  }