# Bucketing and Tarring

This script converts an existing audio dataset with a manifest to a tarred and sharded audio dataset that can be read by `TarredAudioToTextDataLayer`.

Bucketing can help to improve the training speed. You can use `--buckets_num` to specify the number of buckets. It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of `(min_duration, max_duration)` is split into equal sized buckets. We recommend you use `--sort_in_shards` to speedup the training by reducing the paddings in the batches.

In [1]:
!git clone https://github.com/NVIDIA/NeMo

fatal: destination path 'NeMo' already exists and is not an empty directory.


# Train set

In [2]:
!python NeMo/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path=./data/processed/train_manifest_merged.json \
    --target_dir=./data/processed/tar/train \
    --num_shards=8 \
    --max_duration=20 \
    --min_duration=0.1 \
    --shuffle --shuffle_seed=1 \
    --sort_in_shards \
    --workers=-1

Creating new tarred dataset ...
After filtering, manifest has 321153 files which amounts to 2267423.2796264687 seconds of audio.
Shuffling...
Number of samples added : 321153
Remainder: 1
Shard 0 has entries 0 ~ 40144
Shard 0 contains 40144 files
Shard 1 has entries 40144 ~ 80288
Shard 1 contains 40144 files
Shard 2 has entries 80288 ~ 120432
Shard 2 contains 40144 files
Shard 3 has entries 120432 ~ 160576
Shard 3 contains 40144 files
Shard 4 has entries 160576 ~ 200720
Shard 4 contains 40144 files
Shard 5 has entries 200720 ~ 240864
Shard 5 contains 40144 files
Shard 6 has entries 240864 ~ 281008
Shard 6 contains 40144 files
Shard 7 has entries 281008 ~ 321152
Shard 7 contains 40144 files
Have 1 entries left over that will be discarded.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed: 20.0min remaining: 33.3min
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed: 20.5min remaining: 12.3min
[Parallel(n_j

# Test set

In [3]:
!python NeMo/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path=./data/processed/test_manifest_merged.json \
    --target_dir=./data/processed/tar/test \
    --num_shards=8 \
    --max_duration=20 \
    --min_duration=0.1 \
    --shuffle --shuffle_seed=1 \
    --sort_in_shards \
    --workers=-1

Creating new tarred dataset ...
After filtering, manifest has 362320 files which amounts to 2575015.070671837 seconds of audio.
Shuffling...
Number of samples added : 362320
Remainder: 0
Shard 0 has entries 0 ~ 45290
Shard 0 contains 45290 files
Shard 1 has entries 45290 ~ 90580
Shard 1 contains 45290 files
Shard 2 has entries 90580 ~ 135870
Shard 2 contains 45290 files
Shard 3 has entries 135870 ~ 181160
Shard 3 contains 45290 files
Shard 4 has entries 181160 ~ 226450
Shard 4 contains 45290 files
Shard 5 has entries 226450 ~ 271740
Shard 5 contains 45290 files
Shard 6 has entries 271740 ~ 317030
Shard 6 contains 45290 files
Shard 7 has entries 317030 ~ 362320
Shard 7 contains 45290 files
Have 0 entries left over that will be discarded.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed: 20.8min remaining: 34.7min
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed: 25.2min remaining: 15.1min
[Parallel(n_jo

# Dev set

In [None]:
!python NeMo/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path=./data/processed/dev_manifest_merged.json \
    --target_dir=./data/processed/tar/dev \
    --num_shards=8 \
    --max_duration=20 \
    --min_duration=0.1 \
    --shuffle --shuffle_seed=1 \
    --sort_in_shards \
    --workers=-1

Creating new tarred dataset ...
After filtering, manifest has 341952 files which amounts to 2422706.3850060245 seconds of audio.
Shuffling...
Number of samples added : 341952
Remainder: 0
Shard 0 has entries 0 ~ 42744
Shard 0 contains 42744 files
Shard 1 has entries 42744 ~ 85488
Shard 1 contains 42744 files
Shard 2 has entries 85488 ~ 128232
Shard 2 contains 42744 files
Shard 3 has entries 128232 ~ 170976
Shard 3 contains 42744 files
Shard 4 has entries 170976 ~ 213720
Shard 4 contains 42744 files
Shard 5 has entries 213720 ~ 256464
Shard 5 contains 42744 files
Shard 6 has entries 256464 ~ 299208
Shard 6 contains 42744 files
Shard 7 has entries 299208 ~ 341952
Shard 7 contains 42744 files
Have 0 entries left over that will be discarded.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.
