Skip to content

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

License

Notifications You must be signed in to change notification settings

liuyanfeier/open-speech-corpora

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a Creative Commons license or a Community Data License Agreement). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

📜 CC-0

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Common Voice English English 1,118 hours (validated); 1,488 hours (total) 51,072 speakers (reported: 13% female / 46% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice German German 483 hours (validated); 538 hours (total) 8,460 speakers (reported: 9% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice French French 350 hours (validated); 412 hours (total) 8,164 speakers (reported: 12% female / 65% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Welsh Welsh 59 hours (validated); 77 hours (total) 1,149 speakers (reported: 18% female / 29% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Breton Breton 5 hours (validated); 12 hours (total) 133 speakers (reported: 2% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Chuvash Chuvash <1 hour (validated); 2 hours (total) 38 speakers (reported: 0% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Turkish Turkish 13 hours (validated); 14 hours (total) 461 speakers (reported: 8% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Tatar Tatar 25 hours (validated); 27 hours (total) 142 speakers (reported: 2% female / 81% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kyrgyz Kyrgyz 11 hours (validated); 21 hours (total) 119 speakers (reported: 44% female / 45% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Irish Irish 2 hour (validated); 4 hour (total) 80 speakers (reported: 16% female / 59% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kabyle Kabyle 262 hours (validated); 276 hours (total) 693 speakers (reported: 22% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Catalan Catalan 245 hours (validated); 295 hours (total) 3,724 speakers (reported: 35% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Taiwanese Mandarin Taiwanese Mandarin 42 hours (validated); 60 hours (total) 1,108 speakers (reported: 26% female / 48% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Slovenian Slovenian 3 hour (validated); 6 hours (total) 51 speakers (reported: 16% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Italian Italian 85 hours (validated); 122 hours (total) 4,292 speakers (reported: 18% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Dutch Dutch 24 hours (validated); 33 hours (total) 701 speakers (reported: 10% female / 66% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Hakha Chin Hakha Chin 2 hours (validated); 5 hours (total) 290 speakers (reported: 20% female / 23% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Esperanto Esperanto 35 hours (validated); 41 hours (total) 215 speakers (reported: 7% female / 70% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Estonian Estonian 10 hours (validated); 13 hours (total) 230 speakers (reported: 38% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Persian Persian 211 hours (validated); 255 hours (total) 2,763 speakers (reported: 6% female / 78% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Basque Basque 65 hours (validated); 99 hours (total) 638 speakers (reported: 23% female / 51% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Spanish Spanish 167 hours (validated); 221 hours (total) 8,252 speakers (reported: 10% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Mandarin Mandarin (China) 26 hours (validated); 31 hours (total) 963 speakers (reported: 10% female / 64% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Mongolian Mongolian 9 hours (validated); 12 hours (total) 296 speakers (reported: 25% female / 36% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Sakha Sakha 3 hours (validated); 6 hours (total) 37 speakers (reported: 10% female / 54% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Dhivehi Dhivehi 6 hours (validated); 8 hours (total) 101 speakers (reported: 64% female / 28% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kinyarwanda Kinyarwanda <1 hours (validated); 17 hours (total) 129 speakers (reported: 8% female / 41% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Swedish Swedish 5 hours (validated); 6 hours (total) 99 speakers (reported: 8% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Russian Russian 72 hours (validated); 76 hours (total) 496 speakers (reported: 23% female / 71% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Indonesian Indonesian 3 hours (validated); 3 hours (total) 56 speakers (reported: 4% female / 82% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Arabic Arabic 7 hours (validated); 12 hours (total) 228 speakers (reported: 24% female / 48% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Tamil Tamil 3 hours (validated); 4 hours (total) 91 speakers (reported: 10% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Interlingua Interlingua 1 hours (validated); 3 hours (total) 12 speakers (reported: 2% female / 94% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Portuguese Portuguese 27 hours (validated); 29 hours (total) 354 speakers (reported: 2% female / 89% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Latvian Latvian 4 hours (validated); 6 hours (total) 86 speakers (reported: 17% female / 64% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Japanese Japanese 3 hours (validated); 3 hours (total) 52 speakers (reported: 0% female / 81% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Votic Votic <1 hours (validated); <1 hours (total) 2 speakers (reported: 0% female / 0% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Abkhaz Abkhaz <1 hours (validated); <1 hours (total) 3 speakers (reported: 2% female / 98% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Chinese (Hong Kong) Chinese (Hong Kong) <1 hours (validated); <1 hours (total) 15 speakers (reported: 24% female / 37% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Czech Czech 36 hours (validated); 45 hours (total) 353 speakers (reported: 2% female / 69% male) https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
LJ Speech Corpus English ~24 hours one female https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 CC-0
NST Danish ASR Database Danish 229,992 utterances 616 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/ CC-0
NST Danish Dictation Danish 34,955 utterances 151 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/ CC-0
NST Danish Speech Synthesis Danish 4,108 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/ CC-0
NST Swedish ASR Database Swedish 366,000 utterances 1,000 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/ CC-0
NST Swedish Dictation Swedish 45,620 utterances 195 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/ CC-0
NST Swedish Speech Synthesis Swedish 5,279 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/ CC-0
NST Norwegian ASR Database Norwegian 359,760 utterances 980 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ CC-0
NST Norwegian Dictation Norwegian 33,360 utterances 144 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/ CC-0
NST Norwegian Speech Synthesis Norwegian 5,363 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/ CC-0
NB Tale – Speech Database for Norwegian Norwegian 7,600 utterances + ~12 hours 380 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/ CC-0
Norwegian Parliamentary Speech Corpus (v0.1) Norwegian ~59 hours 203 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/ CC-0

📜 CC-BY

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ARU Speech Corpus English (UK) 720 utterances / speaker 12 (6 femals; 6 male) http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip CC-BY 3.0
Althingi Parliamentary Speech Corpus Icelandic 542 hours and 25 minutes 196 speakers http://www.malfong.is/index.php?dlid=73&lang=en CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus Icelandic ~21 hours http://www.malfong.is/index.php?dlid=8&lang=en CC-BY 3.0
Hjal Corpus Icelandic ~41,000 recordings 883 speakers http://www.malfong.is/index.php?dlid=5&lang=en CC-BY 3.0
The Malromur Corpus Icelandic 152 hours 563 speakers http://www.malfong.is/index.php?dlid=65&lang=en CC-BY 4.0
Telecooperation German Corpus for Kinect German ~35 hours ~180 speakers http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz CC-BY 2.0
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
ParlamentParla Catalan 320 hours https://www.openslr.org/59/ CC-BY 4.0
SIWIS French ~10 hours one female http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip CC-BY 4.0
VCTK English 44 hours 109 speakers http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip CC-BY 4.0
LibriTTS English 586 hours 2,456 speakers (1,185 female / 1,271 male) http://www.openslr.org/60/ CC-BY 4.0
Augmented LibriSpeech Audio (English); Text (English, French) 236 hours https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91 CC-BY 4.0
Helsinki Prosody Corpus English 262.5 hours 1,230 speakers https://github.com/Helsinki-NLP/prosody CC-BY 4.0
Tuva Speech Database Norwegian 24 hours 40 speakers https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= CC-BY 4.0
COERLL Kʼicheʼ corpus Kʼicheʼ 34 minutes ? speakers https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz CC-BY 4.0
Timers and Such v0.1 English (synthetic: US, real: various nationalities) synthetic: 172 hours, real: 0.29 hours 21 synthetic, 11 real https://zenodo.org/record/4110812#.X9j0RmBOkYM CC-BY 4.0
Large Corpus of Czech Parliament Plenary Hearings Czech 444 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126 CC-BY 4.0

📜 CC-BY-SA

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial 2013 English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Vystadial 2016 Czech Czech 77 hours; includes Vystadial 2013 Czech https://lindat.cz/repository/xmlui/handle/11234/1-1740 CC-BY-SA 4.0
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
Spoken Wikipedia Corpus (SWC-2017) English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
Forschergeist German 2 hours 2 speakers (1 female; 1 male) female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz CC-BY-SA 4.0
Malayalam Speech Corpus by SMC Malayalam 1:36 hours 75 speakers (3 female, 12 male, 60 unidentified) https://releases.smc.org.in/msc-reviewed-speech/ CC-BY-SA 4.0
Google Malayalam Malayalam 3.02 hours 24 speakers http://www.openslr.org/63/ CC-BY-SA 4.0

📜 CC-BY-ND

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND

📜 CC-BY-NC

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
TV3Parla Catalan 240 hours http://laklak.eu/share/tv3_0.3.tar.gz CC-BY-NC 4.0
Russian Open STT Corpus Russian ~10,000 hours public, ~10,000 more upon request https://github.com/snakers4/open_stt/#links CC-BY-NC 4.0 with some exceptions
Russian Open TTS Corpus Russian 145 hours 3 males https://github.com/snakers4/open_tts/#links CC-BY-NC 4.0 with some expections
OVM – Otázky Václava Moravce Czech 35 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3 CC-BY-NC 3.0

📜 CC-BY-NC-SA

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus Cameroon Pidgin English ~17 hours http://ota.ox.ac.uk/text/2563.zip CC-BY-NC-SA 3.0

📜 CC-BY-NC-ND

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
MuST-C v1.0 Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair https://ict.fbk.eu/must-c-release-v1-0/ CC-BY-NC-ND 4.0
Czech Parliament Meetings Czech 88 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4 CC-BY-NC-ND 3.0

📜 CDLA-Permissive

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
DiPCo English ~5 hours 32 speakers (13 female; 19 male) https://s3.amazonaws.com/dipco/DiPCo.tgz CDLA-Permissive-1.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets GNU-GPL 3.0
VoxForge Russian http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
VoxForge German http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0

📜 Apache License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
THCHS-30 Mandarin Chinese 33.57 hours (13,389 utterances) 40 speakers (31 female; 9 male) http://www.openslr.org/18/ Apache 2.0
Living Audio Dataset - Dutch Dutch 57:49 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - English English 50:50 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Irish Irish 61:56 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Russian Russian 34:58 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0

📜 MIT License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
M-AILABS German Corpus German 237 hours and 22 minutes http://www.caito.de/data/Training/stt_tts/de_DE.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English Corpus Queen's English 45 hours and 35 minutes http://www.caito.de/data/Training/stt_tts/en_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English Corpus American English 102 hours and 7 minutes http://www.caito.de/data/Training/stt_tts/en_US.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish Corpus Spanish Spanish 108 hours and 34 minutes http://www.caito.de/data/Training/stt_tts/es_ES.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian Corpus Italian 127 hours and 40 minutes http://www.caito.de/data/Training/stt_tts/it_IT.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian Corpus Ukrainian 87 hours and 8 minutes http://www.caito.de/data/Training/stt_tts/uk_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian Corpus Russian 46 hours and 47 minutes http://www.caito.de/data/Training/stt_tts/ru_RU.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 Corpus French 190 hours and 30 minutes http://www.caito.de/data/Training/stt_tts/fr_FR.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish Corpus Polish 53 hours and 50 minutes http://www.caito.de/data/Training/stt_tts/pl_PL.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)

📜 Custom License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Fluent Speech Commands Corpus English 19 hours (30,043 utterances) 97 speakers http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz Fluent Speech Commands Public License
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness Questionable Legality: https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers http://fearlesssteps.exploreapollo.org/
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Non-Commercial Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Non-Commercial Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License
CMU Sphinx Group - AN4 English "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz AN4
FT Speech Danish ~1,857 hours (1,017,244 utterances) 434 speakers (176 female, 258 male) https://ftspeech.dk FT Speech License

About

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published