Skip to content

lumaku/german-corpus-aligned

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

German Speech Corpus aligned with CTC segmentation

Alignments on Librivox and Spoken Wikipedia Corpus (SWC) with CTC segmentation:

Dataset Length Speakers Utterances
SWC 210h 363 78214
Librivox 804h 251 368532

This repository contains pre-processed text and alignments. Both corpora are combined to one recipe, audio file and corpus can be attributed by file names and utterance IDs. The audio files can be downloaded separately:

For librivox, the naming scheme is librivox_{book_id}_{chapter}_{utterance_id}. The separate file librivox_utt2spk contains speaker information.

A pretrained ASR model (Transformer) is in the Releases section of this repository.

Further description can be found in the CTC segmentation paper (on Springer Link, on ArXiv)

Mirrors

The repository on Github has limited download capacity due to its Git-LFS data quota. Here is a Gitlab mirror of this repo: https://gitlab.com/Lumaku/german-corpus-aligned

About

Alignments from CTC segmentation on Librispeech and Spoken Wikipedia Corpus

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages