# How to combine multiple datasets

This notebook shows how to leverage Lhotse for combining multiple datasets.

We do not perform any data I/O or transforms here for simplicity, but the samplers defined in this tutorial can be used with everything we demonstrate in other tutorials (e.g., see the training dataset in `examples/00-basic-workflow.ipynb`).

⚠️ Throughout this notebook, we mostly use `SimpleCutSampler` and `BucketingSampler` and "eager" (fully in-memory) `CutSet`s because we work with very small datasets here. 
When working with larger datasets, you will usually want to read cuts lazily (e.g., with `CutSet.from_jsonl_lazy()`) and use dynamic samplers (e.g., `DynamicCutSampler` and `DynamicBucketingSampler`).

In [1]:
# Optional auto-formatting

#!pip install nb_black
%load_ext lab_black

In [2]:
# Get the latest version of Lhotse, if not installed:

#!pip install git+https://github.com/lhotse-speech/lhotse

In [3]:
import os
from pathlib import Path

import torch

from lhotse import CutSet
from lhotse.dataset import (
    BucketingSampler,
    DynamicBucketingSampler,
    DynamicCutSampler,
    SimpleCutSampler,
    RoundRobinSampler,
    ZipSampler,
)
from lhotse.recipes import (
    download_librispeech,
    download_yesno,
    prepare_librispeech,
    prepare_yesno,
)

In [4]:
root_dir = Path("data")
tmp_dir = Path("tmp")
tmp_dir.mkdir(exist_ok=True)
num_jobs = os.cpu_count() - 1

# (mini) LibriSpeech

This dataset contains 5h of training data and 2h of dev data.

We're downloading the data, preparing recording/supervision manfiests, and compiling them into CutSets. 

Approx. download size 450MB.

In [5]:
# libri_variant = "librispeech"
libri_variant = "mini_librispeech"
libri_root = download_librispeech(root_dir, dataset_parts=libri_variant)
libri = prepare_librispeech(
    libri_root, dataset_parts=libri_variant, output_dir=root_dir, num_jobs=num_jobs
)
cuts_libri_train = CutSet.from_manifests(
    **libri["train-clean-5"]
).trim_to_supervisions()
cuts_libri_dev = CutSet.from_manifests(**libri["dev-clean-2"]).trim_to_supervisions()

Downloading LibriSpeech parts:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset parts:   0%|          | 0/2 [00:00<?, ?it/s]

Distributing tasks: 0it [00:00, ?it/s]

Processing:   0%|          | 0/1089 [00:00<?, ?it/s]

Distributing tasks: 0it [00:00, ?it/s]

Processing:   0%|          | 0/1519 [00:00<?, ?it/s]

In [6]:
cuts_libri_train.describe()

Cuts count: 1519
Total duration (hours): 5.3
Speech duration (hours): 5.3 (100.0%)
***
Duration statistics (seconds):
mean	12.6
std	3.6
min	1.9
25%	11.3
50%	13.9
75%	15.2
99%	16.6
99.5%	16.7
99.9%	17.1
max	17.3


# YesNo

This dataset contains 30 training utterances and 30 dev utterances. 
It has only two words: yes and no.
It is approx. 50x smaller than mini LibriSpeech, resulting in a heavy data imbalance.

We're downloading the data, preparing recording/supervision manfiests, and compiling them into CutSets. 

Approx. download size 4.5MB.


In [7]:
yesno_root = download_yesno(root_dir)
yesno = prepare_yesno(yesno_root, output_dir=root_dir)
cuts_yesno_train = CutSet.from_manifests(**yesno["train"])
cuts_yesno_dev = CutSet.from_manifests(**yesno["test"])

In [8]:
cuts_yesno_train.describe()

Cuts count: 30
Total duration (hours): 0.1
Speech duration (hours): 0.1 (100.0%)
***
Duration statistics (seconds):
mean	6.0
std	0.4
min	4.9
25%	5.9
50%	6.1
75%	6.2
99%	6.7
99.5%	6.7
99.9%	6.7
max	6.7


Note: we can see that YesNo has much shorter utterances than mini LibriSpeech.

# Helper code

## Mark each cut to see which dataset it comes from

In [9]:
for c in cuts_libri_train:
    c.origin = "libri"
for c in cuts_yesno_train:
    c.origin = "yesno"

## Identity dataset, just to enable iterating DataLoader

In [10]:
class DummyDataset(torch.utils.data.Dataset):
    """
    Dataset that actually does nothing and just returns the CutSet.
    It will help us illustrate iteration over the data using a DataLoader.
    """

    def __getitem__(self, cuts: CutSet) -> CutSet:
        return cuts


dataset = DummyDataset()

# Method 1: simply concatenate datasets

This method is the simplest. 

**When is it good to use it?**

✅ You believe that the quantity ratio between the datasets is adequate (in other words, you don't care about data imbalance for any reason).

✅ You work with small to medium sized data and do not use lazy manifests: you can shuffle everything in memory which ensures that the examples from the smaller dataset are seen uniformly throughout a training epoch.

**When to expect poor performance?**

⚠️ You expect the dataset imbalance to create issues for your model's training.

⚠️ You work with large datasets and Dynamic* samplers -- you will be shuffling data lazily with a buffer window, and examples from smaller datasets will likely only be seen closer to the start or the end of a training epoch.

## 1.1 Using SimpleCutSampler

SimpleCutSampler simply shuffles everything and iterates over the cuts without any regard for their durations. The shuffling is "exact", so you can expect good randomness. We observe that there aren't too many YesNo cuts in the first 20 batches, which reflects the datasets distribution.

Note: unless you specifically prepared the cuts to have similar durations (e.g. for VAD, diarization, speaker ID training), then each mini-batch may contain cuts of various duration, resulting in excessive padding with this type of sampler.

In [11]:
cuts_train = cuts_yesno_train + cuts_libri_train

sampler = SimpleCutSampler(cuts_train, max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [12]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  1 | 88.3s speech | 50.5s padding |
| batch  1 | #libri-cuts  6 | #yesno-cuts  2 | 92.6s speech | 67.4s padding |
| batch  2 | #libri-cuts  6 | #yesno-cuts  0 | 86.5s speech | 11.2s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  1 | 98.4s speech | 42.6s padding |
| batch  4 | #libri-cuts  7 | #yesno-cuts  0 | 93.7s speech | 15.9s padding |
| batch  5 | #libri-cuts  8 | #yesno-cuts  1 | 98.7s speech | 59.9s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  0 | 96.8s speech |  7.8s padding |
| batch  7 | #libri-cuts  7 | #yesno-cuts  0 | 96.5s speech | 17.3s padding |
| batch  8 | #libri-cuts  8 | #yesno-cuts  0 | 87.3s speech | 41.9s padding |
| batch  9 | #libri-cuts  7 | #yesno-cuts  0 | 89.5s speech | 17.1s padding |
| batch 10 | #libri-cuts  6 | #yesno-cuts  1 | 92.3s speech | 35.3s padding |
| batch 11 | #libri-cuts  8 | #yesno-cuts  0 | 91.6s speech | 31.9s padding |
| batch 12 | #libri-cuts  6 | #yesno-cuts  0 | 87.7s speech |  6

## 1.2 Using BucketingSampler

BucketingSampler also shuffles the full cutset in memory, so you can expect good randomness and less padding overall. The batch size is more dynamic with this type of sampler.

In [13]:
cuts_train = cuts_yesno_train + cuts_libri_train

sampler = BucketingSampler(cuts_train.to_eager(), max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [14]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  0 | 88.1s speech |  2.8s padding |
| batch  1 | #libri-cuts  8 | #yesno-cuts  2 | 81.3s speech | 30.7s padding |
| batch  2 | #libri-cuts  6 | #yesno-cuts  0 | 84.7s speech |  1.3s padding |
| batch  3 | #libri-cuts  6 | #yesno-cuts  0 | 87.8s speech |  1.1s padding |
| batch  4 | #libri-cuts  6 | #yesno-cuts  0 | 93.1s speech |  0.9s padding |
| batch  5 | #libri-cuts 11 | #yesno-cuts  0 | 83.7s speech | 14.7s padding |
| batch  6 | #libri-cuts  6 | #yesno-cuts  0 | 90.7s speech |  1.0s padding |
| batch  7 | #libri-cuts  6 | #yesno-cuts  0 | 93.6s speech |  0.7s padding |
| batch  8 | #libri-cuts  6 | #yesno-cuts  0 | 93.3s speech |  0.7s padding |
| batch  9 | #libri-cuts  6 | #yesno-cuts  0 | 88.5s speech |  0.9s padding |
| batch 10 | #libri-cuts  8 | #yesno-cuts  0 | 88.1s speech |  7.7s padding |
| batch 11 | #libri-cuts  6 | #yesno-cuts  0 | 96.4s speech |  2.2s padding |
| batch 12 | #libri-cuts  9 | #yesno-cuts  1 | 79.8s speech | 25

## 1.3 Use DynamicCutSampler

This time, since the dynamic sampler performs shuffling with a fixed-window buffer, you'll notice that there is a higher concentration of YesNo cuts in the first few batches (even though shuffling is enabled!). This might cause some convergence issues during training.

In [15]:
cuts_train = cuts_yesno_train + cuts_libri_train

sampler = DynamicCutSampler(cuts_train, max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [16]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  1 | 88.3s speech | 50.5s padding |
| batch  1 | #libri-cuts  6 | #yesno-cuts  2 | 92.6s speech | 67.4s padding |
| batch  2 | #libri-cuts  6 | #yesno-cuts  0 | 86.5s speech | 11.2s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  1 | 98.4s speech | 42.6s padding |
| batch  4 | #libri-cuts  7 | #yesno-cuts  0 | 93.7s speech | 15.9s padding |
| batch  5 | #libri-cuts  8 | #yesno-cuts  1 | 98.7s speech | 59.9s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  0 | 96.8s speech |  7.8s padding |
| batch  7 | #libri-cuts  7 | #yesno-cuts  0 | 96.5s speech | 17.3s padding |
| batch  8 | #libri-cuts  8 | #yesno-cuts  0 | 87.3s speech | 41.9s padding |
| batch  9 | #libri-cuts  7 | #yesno-cuts  0 | 89.5s speech | 17.1s padding |
| batch 10 | #libri-cuts  6 | #yesno-cuts  1 | 92.3s speech | 35.3s padding |
| batch 11 | #libri-cuts  8 | #yesno-cuts  0 | 91.6s speech | 31.9s padding |
| batch 12 | #libri-cuts  6 | #yesno-cuts  0 | 87.7s speech |  6

# Method 2: have two DataLoaders

This method is also conceptually simple, although requires to write the training loop in a bit different way.

**When is it good to use it?**

✅ You want to stop the epoch as soon as the smallest dataset has been fully iterated. This effectively under-samples the larger datasets and compensates for dataset imbalance.

**When to expect poor performance?**

⚠️ You want to leverage 100% of data at your disposal -- this might not happen here.

⚠️ You work with large datasets and Dynamic* samplers -- since you will be shuffling data lazily with a buffer window, during each epoch you'll probably see mostly the same examples from the larger dataset, just in a different order. Expect about `len(larger_dataset) - len(smaller_dataset)` examples from the larger dataset to be unused during training, unless you specifically design your code to alleviate that (e.g., by sharding the larger dataset and reading different shards each epoch).

In [17]:
sampler_libri = DynamicCutSampler(cuts_libri_train, max_duration=100, shuffle=True)
dloader_libri = torch.utils.data.DataLoader(
    dataset, sampler=sampler_libri, batch_size=None
)

sampler_yesno = DynamicCutSampler(cuts_yesno_train, max_duration=100, shuffle=True)
dloader_yesno = torch.utils.data.DataLoader(
    dataset, sampler=sampler_yesno, batch_size=None
)



In [18]:
dloaders = [iter(dloader_yesno), iter(dloader_libri)]
idx = 0
while True:
    choice = idx % 2
    chosen_dloader = dloaders[choice]

    try:
        batch = next(chosen_dloader)
    except StopIteration:
        break

    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

    idx += 1

| batch  0 | #libri-cuts  0 | #yesno-cuts 16 | 96.6s speech | 11.2s padding |
| batch  1 | #libri-cuts  7 | #yesno-cuts  0 | 90.4s speech | 23.3s padding |
| batch  2 | #libri-cuts  0 | #yesno-cuts 14 | 84.8s speech |  7.3s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  0 | 93.8s speech |  9.7s padding |


## Method 2b: RoundRobinSampler

This method is more straightforward to use than two DataLoaders and may be more memory friendly and easier to manage, as it requires to spawn sub-process workers only from a single DataLoader.

There is an argument `stop_early` that allows us to use balanced (`True`) or imbalanced (`False`) mix of datasets.

### `stop_early=False`

Notice that yesno becomes exhausted after batch 3, and the rest of the epoch is purely mini librispeech.

In [19]:
sampler_libri = DynamicCutSampler(cuts_libri_train, max_duration=100, shuffle=True)
sampler_yesno = DynamicCutSampler(cuts_yesno_train, max_duration=100, shuffle=True)

sampler_both = RoundRobinSampler(sampler_libri, sampler_yesno, stop_early=False)

dloader_both = torch.utils.data.DataLoader(
    dataset, sampler=sampler_both, batch_size=None
)

In [20]:
for idx, batch in enumerate(dloader_both):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  0 | 90.4s speech | 23.3s padding |
| batch  1 | #libri-cuts  0 | #yesno-cuts 16 | 96.6s speech | 11.2s padding |
| batch  2 | #libri-cuts  7 | #yesno-cuts  0 | 93.8s speech |  9.7s padding |
| batch  3 | #libri-cuts  0 | #yesno-cuts 14 | 84.8s speech |  7.3s padding |
| batch  4 | #libri-cuts  8 | #yesno-cuts  0 | 89.0s speech | 40.6s padding |
| batch  5 | #libri-cuts  7 | #yesno-cuts  0 | 97.2s speech |  9.2s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  0 | 87.7s speech | 22.8s padding |
| batch  7 | #libri-cuts  6 | #yesno-cuts  0 | 86.0s speech |  5.7s padding |
| batch  8 | #libri-cuts  6 | #yesno-cuts  0 | 91.8s speech |  8.8s padding |
| batch  9 | #libri-cuts  7 | #yesno-cuts  0 | 90.5s speech | 19.4s padding |
| batch 10 | #libri-cuts  9 | #yesno-cuts  0 | 98.5s speech | 35.9s padding |
| batch 11 | #libri-cuts  7 | #yesno-cuts  0 | 95.0s speech | 15.6s padding |
| batch 12 | #libri-cuts  8 | #yesno-cuts  0 | 96.0s speech | 24

### Balanced mix of datasets `stop_early=True`

Notice that the epoch consists only of 5 mini-batches, become one of the datasets (yesno) is extremely small.

In [21]:
sampler_libri = DynamicCutSampler(cuts_libri_train, max_duration=100, shuffle=True)
sampler_yesno = DynamicCutSampler(cuts_yesno_train, max_duration=100, shuffle=True)

sampler_both = RoundRobinSampler(sampler_libri, sampler_yesno, stop_early=True)

dloader_both = torch.utils.data.DataLoader(
    dataset, sampler=sampler_both, batch_size=None
)

In [22]:
for idx, batch in enumerate(dloader_both):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  0 | 90.4s speech | 23.3s padding |
| batch  1 | #libri-cuts  0 | #yesno-cuts 16 | 96.6s speech | 11.2s padding |
| batch  2 | #libri-cuts  7 | #yesno-cuts  0 | 93.8s speech |  9.7s padding |
| batch  3 | #libri-cuts  0 | #yesno-cuts 14 | 84.8s speech |  7.3s padding |
| batch  4 | #libri-cuts  8 | #yesno-cuts  0 | 89.0s speech | 40.6s padding |


# Method 3: CutSet multiplexing

This method creates a lazily-evaluated CutSet out of two or more other CutSets. The result acts as a stochastic multiplexer.

**When is it good to use it?**

✅ You work with large datasets and Dynamic* samplers -- multiplexing doesn't require to read everything into memory and can improve the randomness of buffered shuffling.

✅ (when `stop_early=True`) You want to stop the epoch as soon as the smallest dataset has been fully iterated. This effectively under-samples the larger datasets and compensates for dataset imbalance. 

**When to expect poor performance?**

⚠️ You skipped setting the mux-ing weights despite the datasets being significantly imbalanced vs. each other.

⚠️ (when `stop_early=True`) You work with large datasets and Dynamic* samplers -- since you will be shuffling data lazily with a buffer window, during each epoch you'll probably see mostly the same examples from the larger dataset, just in a different order. Expect about `len(larger_dataset) - len(smaller_dataset)` examples from the larger dataset to be unused during training, unless you specifically design your code to alleviate that (e.g., by sharding the larger dataset and reading different shards each epoch).

⚠️ (when `stop_early=True`) You want to leverage 100% of data at your disposal -- this might not happen here.

## Unweighted multiplexing

This will tend to exhaust the smaller datasets much sooner than the larger ones, but will keep iterating until all data has been seen.

In [23]:
cuts_train = CutSet.mux(cuts_libri_train, cuts_yesno_train)

sampler = DynamicCutSampler(cuts_train, max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [24]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  2 | 95.5s speech | 74.2s padding |
| batch  1 | #libri-cuts  5 | #yesno-cuts  3 | 84.8s speech | 91.2s padding |
| batch  2 | #libri-cuts  7 | #yesno-cuts  0 | 91.3s speech | 22.8s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  0 | 99.6s speech |  7.7s padding |
| batch  4 | #libri-cuts  6 | #yesno-cuts  1 | 91.0s speech | 35.9s padding |
| batch  5 | #libri-cuts  8 | #yesno-cuts  1 | 99.0s speech | 60.5s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  0 | 96.0s speech |  8.7s padding |
| batch  7 | #libri-cuts  8 | #yesno-cuts  0 | 98.9s speech | 30.2s padding |
| batch  8 | #libri-cuts  8 | #yesno-cuts  0 | 98.9s speech | 22.9s padding |
| batch  9 | #libri-cuts  7 | #yesno-cuts  0 | 86.5s speech | 22.0s padding |
| batch 10 | #libri-cuts  8 | #yesno-cuts  0 | 99.7s speech | 27.9s padding |
| batch 11 | #libri-cuts  7 | #yesno-cuts  0 | 92.0s speech | 16.3s padding |
| batch 12 | #libri-cuts  7 | #yesno-cuts  0 | 94.0s speech | 15

## Early stopping (ends iteration when any of the cutsets gets depleted)

This acts similarly to Method #2 with multiple DataLoaders and will balance your datasets by under-sampling the larger dataset. At the moment of writing, it is unclear to me which of these methods is better.

In [25]:
cuts_train = CutSet.mux(
    cuts_libri_train,
    cuts_yesno_train,
    stop_early=True,
)

sampler = DynamicCutSampler(cuts_train, max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [26]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  3 | #yesno-cuts  8 | 87.4s speech | 221.4s padding |
| batch  1 | #libri-cuts  3 | #yesno-cuts  7 | 85.8s speech | 169.0s padding |
| batch  2 | #libri-cuts  4 | #yesno-cuts  7 | 100.0s speech | 175.9s padding |
| batch  3 | #libri-cuts  6 | #yesno-cuts  2 | 93.8s speech | 62.9s padding |
| batch  4 | #libri-cuts  5 | #yesno-cuts  5 | 86.0s speech | 153.3s padding |
| batch  5 | #libri-cuts  3 | #yesno-cuts  1 | 49.6s speech | 27.8s padding |


## Proportionally weighted multiplexing

It works well for distributing the examples from all datasets roughly uniformly throughout the epoch, even if the datasets themselves are fairly large and CutSets cannot be fully read in memory (e.g., opened with `CutSet.from_jsonl_lazy`).

In [27]:
cuts_train = CutSet.mux(
    cuts_libri_train,
    cuts_yesno_train,
    weights=[len(cuts_libri_train), len(cuts_yesno_train)],
)

sampler = DynamicCutSampler(cuts_train, max_duration=100, shuffle=True)
dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [28]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  7 | #yesno-cuts  0 | 95.1s speech | 18.6s padding |
| batch  1 | #libri-cuts  7 | #yesno-cuts  0 | 86.7s speech | 25.4s padding |
| batch  2 | #libri-cuts  8 | #yesno-cuts  0 | 98.1s speech | 26.1s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  0 | 88.1s speech | 19.9s padding |
| batch  4 | #libri-cuts  8 | #yesno-cuts  1 | 94.9s speech | 68.2s padding |
| batch  5 | #libri-cuts  7 | #yesno-cuts  0 | 86.8s speech | 28.3s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  0 | 97.6s speech | 14.2s padding |
| batch  7 | #libri-cuts  7 | #yesno-cuts  0 | 92.9s speech | 22.0s padding |
| batch  8 | #libri-cuts  6 | #yesno-cuts  3 | 93.4s speech | 83.8s padding |
| batch  9 | #libri-cuts  8 | #yesno-cuts  0 | 94.5s speech | 32.1s padding |
| batch 10 | #libri-cuts  7 | #yesno-cuts  0 | 95.9s speech | 11.6s padding |
| batch 11 | #libri-cuts  7 | #yesno-cuts  0 | 98.9s speech | 13.1s padding |
| batch 12 | #libri-cuts  7 | #yesno-cuts  0 | 91.3s speech | 12

# Method 4: ZipSampler

This method creates a sampler out of other samplers. The resulting sampler yields mini-batches with a constant ratio of data from each source. E.g., if sampler one has max_duration=80 and sampler two has max_duration=20, usually you will get around 100s of speech with an 80:20 proportion. The iteration stops as soon as the smaller sampler is depleted.

**When is it good to use it?**

✅ You want to have a constant ratio of duration between each dataset in every mini-batch.

✅ Your training examples have little or none variation in duration (e.g., fixed-size windows of utterances/recordings).

✅ You work with small to medium sized datasets and can shuffle them completely in memory. During each epoch, you'll observe mostly different examples from the larger dataset.

✅ You want to stop the epoch as soon as the smallest dataset has been fully iterated. This effectively under-samples the larger datasets and compensates for dataset imbalance. 

**When to expect poor performance?**

⚠️ You are using bucketing samplers (dynamic or regular). In these cases, you will usually sample buckets of different cut durations from both sources which will add excessive padding in your mini-batches.

⚠️ You work with large datasets and Dynamic* samplers -- since you will be shuffling data lazily with a buffer window, during each epoch you'll probably see mostly the same examples from the larger dataset, just in a different order. Expect about `len(larger_dataset) - len(smaller_dataset)` examples from the larger dataset to be unused during training, unless you specifically design your code to alleviate that (e.g., by sharding the larger dataset and reading different shards each epoch).


In [29]:
sampler_libri = SimpleCutSampler(cuts_libri_train, max_duration=80, shuffle=True)
sampler_yesno = SimpleCutSampler(cuts_yesno_train, max_duration=20, shuffle=True)

sampler = ZipSampler(sampler_libri, sampler_yesno)

dloader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=None)

In [30]:
for idx, batch in enumerate(dloader):
    if idx == 20:
        break
    n_libri = len([cut for cut in batch if cut.origin == "libri"])
    n_yesno = len(batch) - n_libri
    tot_dur = sum(cut.duration for cut in batch)
    pad_dur = sum(cut.duration for cut in batch.pad()) - tot_dur
    print(
        f"| batch {idx:>2d} | #libri-cuts {n_libri:>2d} | #yesno-cuts {n_yesno:>2d} | {tot_dur:>4.1f}s speech | {pad_dur:>4.1f}s padding |"
    )

| batch  0 | #libri-cuts  6 | #yesno-cuts  3 | 90.0s speech | 94.5s padding |
| batch  1 | #libri-cuts  6 | #yesno-cuts  3 | 94.7s speech | 92.8s padding |
| batch  2 | #libri-cuts  5 | #yesno-cuts  3 | 91.1s speech | 82.7s padding |
| batch  3 | #libri-cuts  7 | #yesno-cuts  3 | 92.9s speech | 104.0s padding |
| batch  4 | #libri-cuts  5 | #yesno-cuts  3 | 86.2s speech | 84.3s padding |
| batch  5 | #libri-cuts  5 | #yesno-cuts  3 | 89.0s speech | 88.4s padding |
| batch  6 | #libri-cuts  7 | #yesno-cuts  3 | 92.9s speech | 107.2s padding |
| batch  7 | #libri-cuts  5 | #yesno-cuts  3 | 90.5s speech | 81.8s padding |
| batch  8 | #libri-cuts  5 | #yesno-cuts  3 | 88.7s speech | 89.9s padding |
| batch  9 | #libri-cuts  5 | #yesno-cuts  3 | 88.8s speech | 82.8s padding |


Note: in the example above, there is an excessive padding because YesNo has much shorter cuts than mini LibriSpeech, and every batch contains YesNo data. For most small, medium, and large speech datasets, you shouldn't see such drastic differences in durations.