Skip to content

Latest commit

 

History

History
270 lines (243 loc) · 9.44 KB

corpus.rst

File metadata and controls

270 lines (243 loc) · 9.44 KB

Representing a corpus

In Lhotse, we represent the data using a small number of Python classes, enhanced with methods for solving common data manipulation tasks, that can be stored as JSON or JSONL manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.

Recording manifest

lhotse.audio.Recording

lhotse.audio.RecordingSet

Supervision manifest

lhotse.supervision.SupervisionSegment

lhotse.supervision.SupervisionSet

Standard data preparation recipes

We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.

Currently supported audio corpora
Corpus name Function
ADEPT lhotse.recipes.prepare_adept
Aidatatang_200zh lhotse.recipes.prepare_aidatatang_200zh
Aishell lhotse.recipes.prepare_aishell
Aishell-3 lhotse.recipes.prepare_aishell3
AISHELL-4 lhotse.recipes.prepare_aishell4
AliMeeting lhotse.recipes.prepare_alimeeting
AMI lhotse.recipes.prepare_ami
ASpIRE lhotse.recipes.prepare_aspire
ATCOSIM lhotse.recipes.prepare_atcosim
AudioMNIST lhotse.recipes.prepare_audio_mnist
BABEL lhotse.recipes.prepare_single_babel_language
Bengali.AI Speech lhotse.recipes.prepare_bengaliai_speech
BUT ReverbDB lhotse.recipes.prepare_but_reverb_db
BVCC / VoiceMOS Challenge lhotse.recipes.bvcc
CallHome Egyptian lhotse.recipes.prepare_callhome_egyptian
CallHome English lhotse.recipes.prepare_callhome_english
CHiME-6 lhotse.recipes.prepare_chime6
CMU Arctic lhotse.recipes.prepare_cmu_arctic
CMU Indic lhotse.recipes.prepare_cmu_indic
CMU Kids lhotse.recipes.prepare_cmu_kids
CommonVoice lhotse.recipes.prepare_commonvoice
Corpus of Spontaneous Japanese lhotse.recipes.prepare_csj
CSLU Kids lhotse.recipes.prepare_cslu_kids
DailyTalk lhotse.recipes.prepare_daily_talk
DIHARD III lhotse.recipes.prepare_dihard3
DiPCo lhotse.recipes.prepare_dipco
Earnings'21 lhotse.recipes.prepare_earnings21
Earnings'22 lhotse.recipes.prepare_earnings22
The Edinburgh International Accents of English Corpus lhotse.recipes.prepare_edacc
English Broadcast News 1997 lhotse.recipes.prepare_broadcast_news
Fisher English Part 1, 2 lhotse.recipes.prepare_fisher_english
Fisher Spanish lhotse.recipes.prepare_fisher_spanish
Fluent Speech Commands lhotse.recipes.slu
GALE Arabic Broadcast Speech lhotse.recipes.prepare_gale_arabic
GALE Mandarin Broadcast Speech lhotse.recipes.prepare_gale_mandarin
GigaSpeech lhotse.recipes.prepare_gigaspeech
GigaST lhotse.recipes.prepare_gigast
Heroico lhotse.recipes.prepare_heroico
HiFiTTS lhotse.recipes.prepare_hifitts
HI-MIA (including HI-MIA-CW) lhotse.recipes.prepare_himia
ICMC-ASR lhotse.recipes.prepare_icmcasr
ICSI lhotse.recipes.prepare_icsi
IWSLT22_Ta lhotse.recipes.prepare_iwslt22_ta
KeSpeech lhotse.recipes.prepare_kespeech
L2 Arctic lhotse.recipes.prepare_l2_arctic
LibriCSS lhotse.recipes.prepare_libricss
LibriLight lhotse.recipes.prepare_librilight
LibriSpeech (including "mini") lhotse.recipes.prepare_librispeech
LibriTTS lhotse.recipes.prepare_libritts
LibriTTS-R lhotse.recipes.prepare_librittsr
LJ Speech lhotse.recipes.prepare_ljspeech
Medical lhotse.recipes.prepare_medical
MiniLibriMix lhotse.recipes.prepare_librimix
MTEDx lhotse.recipes.prepare_mtdex
MobvoiHotWord lhotse.recipes.prepare_mobvoihotwords
Multilingual LibriSpeech (MLS) lhotse.recipes.prepare_mls
MUSAN lhotse.recipes.prepare_musan
MuST-C lhotse.recipes.prepare_must_c
National Speech Corpus (Singaporean English) lhotse.recipes.prepare_nsc
People's Speech lhotse.recipes.prepare_peoples_speech
RIRs and Noises Corpus (OpenSLR 28) lhotse.recipes.prepare_rir_noise
Speech Commands lhotse.recipes.prepare_speechcommands
SpeechIO lhotse.recipes.prepare_speechio
SPGISpeech lhotse.recipes.prepare_spgispeech
Switchboard lhotse.recipes.prepare_switchboard
TED-LIUM v2 lhotse.recipes.prepare_tedlium2
TED-LIUM v3 lhotse.recipes.prepare_tedlium
TIMIT lhotse.recipes.prepare_timit
This American Life lhotse.recipes.prepare_this_american_life
UWB-ATCC lhotse.recipes.prepare_uwb_atcc
VCTK lhotse.recipes.prepare_vctk
VoxCeleb lhotse.recipes.prepare_voxceleb
VoxConverse lhotse.recipes.prepare_voxconverse
VoxPopuli lhotse.recipes.prepare_voxpopuli
WenetSpeech lhotse.recipes.prepare_wenet_speech
YesNo lhotse.recipes.prepare_yesno
Eval2000 lhotse.recipes.prepare_eval2000
MGB2 lhotse.recipes.prepare_mgb2
XBMU-AMDO31 lhotse.recipes.xbmu_amdo31
Currently supported video corpora
Corpus name Function
Grid Audio-Visual Speech Corpus lhotse.recipes.prepare_grid

Adding new corpora

Hint

Python data preparation recipes. Each corpus has a dedicated Python file in lhotse/recipes, which you can use as the basis for your own recipe.

Hint

(optional) Downloading utility. For publicly available corpora that can be freely downloaded, we usually define a function called download_<corpus-name>().

Hint

Data preparation Python entry-point. Each data preparation recipe should expose a single function called prepare_<corpus-name>, that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}.

Hint

CLI recipe wrappers. We provide a command-line interface that wraps the download and prepare functions -- see lhotse/bin/modes/recipes for examples of how to do it.

Hint

Pre-defined train/dev/test splits. When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure: {'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}

Hint

Manifest naming convention. The default naming convention is <corpus-name>_<manifest-type>_<split>.jsonl.gz, i.e., we save the manifests in a compressed JSONL file. Here, <manifest-type> can be recordings, supervisions, etc., and <split> can be train, dev, test, etc. In case the corpus has no such split defined, we can use all as default. Other information, e.g., mic type, language, etc. may be included in the <corpus-name>. Some examples are: cmu-indic_recordings_all.jsonl.gz, ami-ihm_supervisions_dev.jsonl.gz, mtedx-english_recordings_train.jsonl.gz.

Hint

Isolated utterance corpora. Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the ~lhotse.supervision.SupervisionSegment will exactly match the ~lhotse.recording.Recording duration (and there will likely be exactly one segment corresponding to any recording).

Hint

Conversational corpora. Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one ~lhotse.audio.Recording object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as a ~lhotse.supervision.SupervisionSegment with the same recording_id value.

Hint

Multi-channel corpora. Corpora with multiple channels for each session (e.g. AMI) should have a single ~lhotse.audio.Recording with multiple ~lhotse.audio.AudioSource objects --each corresponding to a separate channel. Remember to make the ~lhotse.supervision.SupervisionSegment objects correspond to the right channels!