MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Name | License | Hours | Languages | Label |
---|---|---|---|---|
CommonVoice | CC 0 | 6,732 | bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
CoVoST2 | CC 0 | 687 | en, fr, it, es, pt, et, nl, sv, lv, sl | ✅ |
CSS10 | Public Domain | 99 | nl, fi, fr, de, el, hu, es | ✅ |
EMU | CC BY 3.0 | 56 | pl | ✅ |
EU Parliament | CC BY 4.0 | 32 | pl | ✅ |
FLEURS | CC BY 4.0 | 215 | bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
Large Corpus of Czech Parliament Plenary Hearings | CC BY 4.0 | 444 | cs | ✅ |
LibriLight | Public Domain | 57,706 | en | ❌ |
LibriTTS | CC BY 4.0 | 585 | en | ✅ |
LibriSpeech | CC BY 4.0 | 360 | en | ✅ |
LibriVoxDeEn | Public Domain | 547 | de | ✅ |
MC Speech | CC 0 | 22 | pl | ✅ |
Multilingual LibriSpeech | CC BY 4.0 | 50,687 | nl, en, fr, de, it, pl, pt, es | ✅ |
SIWIS | CC BY 4.0 | 11 | fr | ✅ |
Speech Commands | CC BY 4.0 | 18 | en | ✅ |
VCTK | CC BY 4.0 | 44 | en | ✅ |
VoxPopuli | CC 0 | 383,500 | bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ❌ |
1,791 | hr, cs, nl, en, et, fu, fr, de, hu, it, lt, pl, ro, sk, sl, es | ✅ | ||
YouTube-Commons | CC BY 4.0 | 3,261 | bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es | ❌ |
443,396 | bg, cs, nl, en, et, fi, fr, de, el, hu, it, lv, lt, pl, pt, ro, es, sv | ✅ | ||
MOSEL 🍇 | CC BY 4.0 | 441,206 | bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
For the languages, two-letter ISO 639 codes are used.
If you want to add an open-source compliant dataset to the list, please fill a Pull Request. If you want to report an issue about existing content, please use the issues section.
If you use MOSEL dataset, please cite:
@inproceedings{mosel,
title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, United States",
publisher = "Association for Computational Linguistics",
}