Representing a corpus

In Lhotse, we represent the data using a small number of Python classes, enhanced with methods for solving common data manipulation tasks, that can be stored as JSON or JSONL manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.

Recording manifest

lhotse.audio.Recording

lhotse.audio.RecordingSet

Supervision manifest

lhotse.supervision.SupervisionSegment

lhotse.supervision.SupervisionSet

Standard data preparation recipes

We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.

Currently supported audio corpora

Corpus name	Function
ADEPT	`lhotse.recipes.prepare_adept`
Aidatatang_200zh	`lhotse.recipes.prepare_aidatatang_200zh`
Aishell	`lhotse.recipes.prepare_aishell`
Aishell-3	`lhotse.recipes.prepare_aishell3`
AISHELL-4	`lhotse.recipes.prepare_aishell4`
AliMeeting	`lhotse.recipes.prepare_alimeeting`
AMI	`lhotse.recipes.prepare_ami`
ASpIRE	`lhotse.recipes.prepare_aspire`
ATCOSIM	`lhotse.recipes.prepare_atcosim`
AudioMNIST	`lhotse.recipes.prepare_audio_mnist`
BABEL	`lhotse.recipes.prepare_single_babel_language`
Bengali.AI Speech	`lhotse.recipes.prepare_bengaliai_speech`
BUT ReverbDB	`lhotse.recipes.prepare_but_reverb_db`
BVCC / VoiceMOS Challenge	`lhotse.recipes.bvcc`
CallHome Egyptian	`lhotse.recipes.prepare_callhome_egyptian`
CallHome English	`lhotse.recipes.prepare_callhome_english`
CHiME-6	`lhotse.recipes.prepare_chime6`
CMU Arctic	`lhotse.recipes.prepare_cmu_arctic`
CMU Indic	`lhotse.recipes.prepare_cmu_indic`
CMU Kids	`lhotse.recipes.prepare_cmu_kids`
CommonVoice	`lhotse.recipes.prepare_commonvoice`
Corpus of Spontaneous Japanese	`lhotse.recipes.prepare_csj`
CSLU Kids	`lhotse.recipes.prepare_cslu_kids`
DailyTalk	`lhotse.recipes.prepare_daily_talk`
DIHARD III	`lhotse.recipes.prepare_dihard3`
DiPCo	`lhotse.recipes.prepare_dipco`
Earnings'21	`lhotse.recipes.prepare_earnings21`
Earnings'22	`lhotse.recipes.prepare_earnings22`
The Edinburgh International Accents of English Corpus	`lhotse.recipes.prepare_edacc`
English Broadcast News 1997	`lhotse.recipes.prepare_broadcast_news`
Fisher English Part 1, 2	`lhotse.recipes.prepare_fisher_english`
Fisher Spanish	`lhotse.recipes.prepare_fisher_spanish`
Fluent Speech Commands	`lhotse.recipes.slu`
GALE Arabic Broadcast Speech	`lhotse.recipes.prepare_gale_arabic`
GALE Mandarin Broadcast Speech	`lhotse.recipes.prepare_gale_mandarin`
GigaSpeech	`lhotse.recipes.prepare_gigaspeech`
GigaST	`lhotse.recipes.prepare_gigast`
Heroico	`lhotse.recipes.prepare_heroico`
HiFiTTS	`lhotse.recipes.prepare_hifitts`
HI-MIA (including HI-MIA-CW)	`lhotse.recipes.prepare_himia`
ICMC-ASR	`lhotse.recipes.prepare_icmcasr`
ICSI	`lhotse.recipes.prepare_icsi`
IWSLT22_Ta	`lhotse.recipes.prepare_iwslt22_ta`
KeSpeech	`lhotse.recipes.prepare_kespeech`
L2 Arctic	`lhotse.recipes.prepare_l2_arctic`
LibriCSS	`lhotse.recipes.prepare_libricss`
LibriLight	`lhotse.recipes.prepare_librilight`
LibriSpeech (including "mini")	`lhotse.recipes.prepare_librispeech`
LibriTTS	`lhotse.recipes.prepare_libritts`
LibriTTS-R	`lhotse.recipes.prepare_librittsr`
LJ Speech	`lhotse.recipes.prepare_ljspeech`
Medical	`lhotse.recipes.prepare_medical`
MiniLibriMix	`lhotse.recipes.prepare_librimix`
MTEDx	`lhotse.recipes.prepare_mtdex`
MobvoiHotWord	`lhotse.recipes.prepare_mobvoihotwords`
Multilingual LibriSpeech (MLS)	`lhotse.recipes.prepare_mls`
MUSAN	`lhotse.recipes.prepare_musan`
MuST-C	`lhotse.recipes.prepare_must_c`
National Speech Corpus (Singaporean English)	`lhotse.recipes.prepare_nsc`
People's Speech	`lhotse.recipes.prepare_peoples_speech`
RIRs and Noises Corpus (OpenSLR 28)	`lhotse.recipes.prepare_rir_noise`
Speech Commands	`lhotse.recipes.prepare_speechcommands`
SpeechIO	`lhotse.recipes.prepare_speechio`
SPGISpeech	`lhotse.recipes.prepare_spgispeech`
Switchboard	`lhotse.recipes.prepare_switchboard`
TED-LIUM v2	`lhotse.recipes.prepare_tedlium2`
TED-LIUM v3	`lhotse.recipes.prepare_tedlium`
TIMIT	`lhotse.recipes.prepare_timit`
This American Life	`lhotse.recipes.prepare_this_american_life`
UWB-ATCC	`lhotse.recipes.prepare_uwb_atcc`
VCTK	`lhotse.recipes.prepare_vctk`
VoxCeleb	`lhotse.recipes.prepare_voxceleb`
VoxConverse	`lhotse.recipes.prepare_voxconverse`
VoxPopuli	`lhotse.recipes.prepare_voxpopuli`
WenetSpeech	`lhotse.recipes.prepare_wenet_speech`
YesNo	`lhotse.recipes.prepare_yesno`
Eval2000	`lhotse.recipes.prepare_eval2000`
MGB2	`lhotse.recipes.prepare_mgb2`
XBMU-AMDO31	`lhotse.recipes.xbmu_amdo31`

Currently supported video corpora

Corpus name	Function
Grid Audio-Visual Speech Corpus	`lhotse.recipes.prepare_grid`

Adding new corpora

Hint

Python data preparation recipes. Each corpus has a dedicated Python file in lhotse/recipes, which you can use as the basis for your own recipe.

Hint

(optional) Downloading utility. For publicly available corpora that can be freely downloaded, we usually define a function called download_<corpus-name>().

Hint

Data preparation Python entry-point. Each data preparation recipe should expose a single function called prepare_<corpus-name>, that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}.

Hint

CLI recipe wrappers. We provide a command-line interface that wraps the download and prepare functions -- see lhotse/bin/modes/recipes for examples of how to do it.

Hint

Pre-defined train/dev/test splits. When a corpus defines standard split (e.g. train/dev/test), we return a dict with the following structure: {'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}

Hint

Manifest naming convention. The default naming convention is <corpus-name>_<manifest-type>_<split>.jsonl.gz, i.e., we save the manifests in a compressed JSONL file. Here, <manifest-type> can be recordings, supervisions, etc., and <split> can be train, dev, test, etc. In case the corpus has no such split defined, we can use all as default. Other information, e.g., mic type, language, etc. may be included in the <corpus-name>. Some examples are: cmu-indic_recordings_all.jsonl.gz, ami-ihm_supervisions_dev.jsonl.gz, mtedx-english_recordings_train.jsonl.gz.

Hint

Isolated utterance corpora. Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the ~lhotse.supervision.SupervisionSegment will exactly match the ~lhotse.recording.Recording duration (and there will likely be exactly one segment corresponding to any recording).

Hint

Conversational corpora. Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one ~lhotse.audio.Recording object corresponding to a single conversation/session, that spans its whole duration. Each speech segment in that recording should be represented as a ~lhotse.supervision.SupervisionSegment with the same recording_id value.

Hint

Multi-channel corpora. Corpora with multiple channels for each session (e.g. AMI) should have a single ~lhotse.audio.Recording with multiple ~lhotse.audio.AudioSource objects --each corresponding to a separate channel. Remember to make the ~lhotse.supervision.SupervisionSegment objects correspond to the right channels!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus.rst

corpus.rst

Representing a corpus

Recording manifest

Supervision manifest

Standard data preparation recipes

Adding new corpora

Files

corpus.rst

Latest commit

History

corpus.rst

File metadata and controls

Representing a corpus

Recording manifest

Supervision manifest

Standard data preparation recipes

Adding new corpora