Skip to content

kaiidams/Kokoro-Speech-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 43,253 short audio clips of a single speaker reading 14 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.

The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.

The audio clips were split and transcripts were aligned automatically by Kokoro-Align.

Sample data

Listen from your browser or download randomly sampled 100 clips.

File Format

Metadata is provided in metadata.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

  • ID: this is the name of the corresponding .wav file
  • Transcription: Kanji-kana mixture text spoken by the reader (UTF-8)
  • Reading: Romanized text spoken by the reader (UTF-8)

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

Statistics

The dataset is provided in different sizes, xlarge, large, small, tiny. large, small and tiny don't share same clips. xlarge contains all available clips, including large, small and tiny.

X Large:
Total clips: 44788
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.718 secs
Total duration: 58:41:39

Large:
Total clips: 23461
Min duration: 3.007 secs
Max duration: 14.861 secs
Mean duration: 4.742 secs
Total duration: 30:54:16

Small:
Total clips: 9199
Min duration: 3.007 secs
Max duration: 9.961 secs
Mean duration: 4.687 secs
Total duration: 11:58:31

Tiny:
Total clips: 308
Min duration: 3.030 secs
Max duration: 8.092 secs
Mean duration: 4.695 secs
Total duration: 00:24:05

How to get the data

Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.

To make .wav files of the dataset, run

$ bash download.sh

to download the metadata from the project page. Then run

$ pip3 install torchaudio
$ python3 extract.py --size tiny

This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.

After doing so, run the command again

$ python3 extract.py --size tiny

to get files for tiny under ./output directory.

You can give another size name to the --size option to get dataset of the size.

You can specify the audio clip format to the --format option.

Pretrained Tacotron model

Pretrained Tacotron model trained with Kokoro Speech Dataset and audio samples are available. The model was trained for 21K steps with small. According to the above repo, "Speech started to become intelligible around 20K steps" with LJ Speech Dataset. Audio samples read the first few sentences from Gon Gitsune which is not included in small.

Books

The dataset contains recordings from these books read by ekzemplaro

Similar project

This project was also inspired by CSS10, which contains audio clips of various languages from LibriVox.

Changelog

  • v1.3 Keep word separators in transcripts with '_'
  • v1.2 New metadata generated with a new align model
  • v1.1.1 Added FLAC, MP3, OGG support
  • v1.1 Added more books
  • v1.0 Initial release

Credits

All texts are from Aozora Bunko. Recordings by ekzemplaro from LibriVox. Alignment and annotation by Katsuya Iida.

License

This dataset is in the public domain in the USA (and most likely other countries as well). There are no restrictions on its use. For more information, please see: librivox.org/pages/public-domain.