German Speech Corpus aligned with CTC segmentation

Alignments on Librivox and Spoken Wikipedia Corpus (SWC) with CTC segmentation:

Dataset	Length	Speakers	Utterances
SWC	210h	363	78214
Librivox	804h	251	368532

This repository contains pre-processed text and alignments. Both corpora are combined to one recipe, audio file and corpus can be attributed by file names and utterance IDs. The audio files can be downloaded separately:

SWC: German Spoken Wikipedia Corpus
Librivox: The audio files can be retrieved via IDs in the metadata file books-German.json and then automatically retrieved via id using the LibriVox API, e.g. https://librivox.org/api/feed/audiobooks/?id=82&format=json , and then downloading the URL. As downloading the files separately takes time, there is an MP3 boundle is available at the MMK website.

For librivox, the naming scheme is librivox_{book_id}_{chapter}_{utterance_id}. The separate file librivox_utt2spk contains speaker information.

A pretrained ASR model (Transformer) is in the Releases section of this repository.

Further description can be found in the CTC segmentation paper (on Springer Link, on ArXiv)

Mirrors

The repository on Github has limited download capacity due to its Git-LFS data quota. Here is a Gitlab mirror of this repo: https://gitlab.com/Lumaku/german-corpus-aligned

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
crawl		crawl
librivox-metadata		librivox-metadata
librivox_swc		librivox_swc
.gitattributes		.gitattributes
README.md		README.md
download_librivox.py		download_librivox.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl

crawl

librivox-metadata

librivox-metadata

librivox_swc

librivox_swc

.gitattributes

.gitattributes

README.md

README.md

download_librivox.py

download_librivox.py

Repository files navigation

German Speech Corpus aligned with CTC segmentation

Mirrors

About

Releases 1

Packages

Contributors 2

Languages

lumaku/german-corpus-aligned

Folders and files

Latest commit

History

Repository files navigation

German Speech Corpus aligned with CTC segmentation

Mirrors

About

Resources

Stars

Watchers

Forks

Languages