Skip to content

Provides downloading, unpacking and processing of major Open Source audio (speech) datasets

License

Notifications You must be signed in to change notification settings

mcfletch/audiodatasets

Repository files navigation

Audio Datasets

Documentation Status Updates

Pulls and pre-processes major Open Source datasets for spoken audio

  • Supported Datasets:
  • This is intended for use on Linux servers and it is expected that you will be using the library to feed a machine learning system (not necessary, but that's sort of the point of collecting these datasets)
  • MIT license for the software, but please note that the datasets themselves are generally for non-commercial use only

Features

  • Downloads common Open Source datasets and performs basic preprocessing on them
  • Provides iterables that produce Numpy arrays from the audio data in common formats
  • Uses sphfile to directly accesses sph files instead of needing to convert to wav first
  • Uses a single shared location for the datasets intended to be used by multiple projects

Installation/Setup

You need to create the download directory and make it writable by the running user. Preferably you will do that via group-based permissions to allow sharing, but we will here show creation of a user-specific ownership:

$ mkdir -p /var/datasets
$ chown user:group /var/datasets
$ chmod g+rw /var/datasets

if /var/datasets doesn't exist, or isn't writable, the downloader will instead populate ~/.config/datasets with the data. You may wish to link that directory to /var/datasets so that you can use default instantiations of the corpora:

$ ln -s /var/datasets ~/.config/datasets

Note that the downloader expects that you have the following available, this may not yet be the case in a docker or minimal OS installation:

  • tar
  • wget

Now you can download the datasets.

Note

The datasets are big (100+GB)!

If you are paying for data or are working on a slow connection you will likely want to arrange to do this step during a low-rated period or on a separate data connection.

From a command prompt:

$ pip install audiodatasets
# this will download 100+GB and then unpack it on disk, it will take a while...
$ audiodatasets-download

Creating MFCC data-files:

# this will generate Multi-frequency Cepestral Coefficient (MFCC) summaries for the
# audio datasets (and download them if that hasn't been done). This isn't necessary
# if you are doing only raw-audio processing
$ audiodatasets-preprocess

Playing some audio:

# this will iterate through playing every utterance that includes 'moon' in the transcript
$ audiodatasets-search 'moon'

Usage

Once setup, you likely want to iterate over the data-sets using, for instance, a partition to separate out test/train/validate data. To iterate over the raw audio:

from audiodatasets.corpora import build_corpora, partition
import random

def train_valid_test():
    """Create training, validation and tests datasets

    returns three iterators yielding (array[10:512],transcript) batches
    """
    utterances = []
    for corpus in build_corpora():
        utterances.extend( corpus.iter_utterances())
    random.shuffle(utterances)
    train, test,valid = partition( utterances, (3,1,1) )
    def generation( utterances ):
        while True:
            offset = random.randint(0,511)
            for name,transcript,audio_file in utterances:
                for batch in t.iter_batches( audio_file, batch_size=10, input=512, offset=offset ):
                    yield batch,transcript
    return generation(train),generation(test),generation(valid)

To iterate over the 10ms MFCC preprocessed data, which yields 20 frequency batches per processing window (10ms):

from audiodatasets.corpora import build_corpora, partition
import random

def train_valid_test():
    """Create training, validation and tests datasets

    Note: the batches vary in *time* at highest frequency, while
    the frequency bins are the second-highest frequency.

    See: `LibRosa MFCC <https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html>`_

    returns three iterators yielding (array[10:20:63],transcript) batches
    """
    utterances = []
    for corpus in build_corpora():
        utterances.extend( corpus.mfcc_utterances())
    random.shuffle(utterances)
    train, test,valid = partition( utterances, (3,1,1) )
    def generation( utterances ):
        while True:
            offset = random.randint(0,62)
            for name,transcript,audio_file in utterances:
                for batch in t.iter_batches( audio_file, batch_size=10, input=63, offset=offset ):
                    yield batch,transcript
    return generation(train),generation(test),generation(valid)

About

Provides downloading, unpacking and processing of major Open Source audio (speech) datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published