Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for random seeds in lhotse + extended support of lazy r… #1291

Merged
merged 4 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ API Reference

This page contains a comprehensive list of all classes and functions within `lhotse`.

Recording manifests
Audio loading, saving, and manifests
-------------------

Data structures used for describing audio recordings in a dataset.
Data structures and utilities used for describing and manipulating audio recordings.

.. automodule:: lhotse.audio
:members:
Expand All @@ -24,6 +24,8 @@ Data structures used for describing supervisions in a dataset.
Lhotse Shar -- sequential storage
---------------------------------

Documentation for Lhotse Shar multi-tarfile sequential I/O format.

Lhotse Shar readers
*******************

Expand Down
2 changes: 1 addition & 1 deletion docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ Command-line interface

.. click:: lhotse.bin:cli
:prog: lhotse
:show-nested:
:nested: full
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,4 @@
"exclude-members": "__weakref__",
}

autodoc_mock_imports = ["torchaudio", "SoundFile"]
autodoc_mock_imports = ["torchaudio", "SoundFile", "soundfile"]
18 changes: 18 additions & 0 deletions docs/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,19 @@ In general, pre-computed features can be greatly compressed (we achieve 70% size

When I/O is not the issue, it might be preferable to use on-the-fly computation as it shouldn't require any prior steps to perform the network training. It is also simpler to apply a vast range of data augmentation methods in a fully randomized way (e.g. reverberation), although Lhotse provides support for approximate feature-domain signal mixing (e.g. for additive noise augmentation) to alleviate that to some extent.

Handling random seeds
---------------------

Lhotse provides several mechanisms for controlling randomness. At a basic level, there is a function :func:`lhotse.utils.fix_random_seed` which seeds Python's, numpy's and torch's RNGs with the provided number.

However, many functions and classes in Lhotse accept either a random seed or an RNG instance to provide a finer control over randomness. Whenever random seed is accepted, it can be either an integer, or one of two strings: ``"randomized"`` or ``"trng"``.

* ``"randomized``" seed is resolved lazily at the moment it's needed and is intended as a mechanism to provide a different seed to each dataloading worker. In order for ``"randomized"`` to work, you have to first invoke :func:`lhotse.dataset.dataloading.worker_init_fn` in a given subprocess which sets the right environment variables. With a PyTorch ``DataLoader`` you can pass the keyword argument ``worker_init_fn==make_worker_init_fn(seed=int_seed, rank=..., world_size=...)`` using :func:`lhotse.dataset.dataloading.make_worker_init_fn` which will set the right seeds for you in multiprocessing and multi-node training. Note that if you resume training, you should change the ``seed`` passed to ``make_worker_init_fn`` on each resumed run to make the model train on different data.
* ``"trng"`` seed is also resolved lazily at runtime, but it uses a true RNG (if available on your OS; consult Python's ``secrets`` module documentation). It's an easy way to ensure that every time you iterate data it's done in different order, but may cause debugging data issues to be more difficult.

.. note:: The lazy seed resolution is done by calling :func:`lhotse.dataset.dataloading.resolve_seed`.


Dataset's list
--------------

Expand Down Expand Up @@ -185,3 +198,8 @@ Collation utilities for building custom Datasets
------------------------------------------------

.. automodule:: lhotse.dataset.collation

Dataloading seeding utilities
-----------------------------

.. automodule:: lhotse.dataset.dataloading
8 changes: 4 additions & 4 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
numpy>=1.18.1
sphinx_rtd_theme
sphinx==4.2.0
sphinx-click==3.0.1
sphinx-autodoc-typehints==1.12.0
sphinx_rtd_theme==2.0.0
sphinx==7.2.6
sphinx-click==5.1.0
sphinx-autodoc-typehints==2.0.0
24 changes: 24 additions & 0 deletions lhotse/audio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
get_ffmpeg_torchaudio_info_enabled,
info,
read_audio,
save_audio,
set_current_audio_backend,
set_ffmpeg_torchaudio_info_enabled,
)
Expand All @@ -21,3 +22,26 @@
set_audio_duration_mismatch_tolerance,
suppress_audio_loading_errors,
)

__all__ = [
"AudioSource",
"Recording",
"RecordingSet",
"AudioLoadingError",
"DurationMismatchError",
"VideoInfo",
"audio_backend",
"available_audio_backends",
"get_current_audio_backend",
"get_default_audio_backend",
"get_audio_duration_mismatch_tolerance",
"get_ffmpeg_torchaudio_info_enabled",
"info",
"read_audio",
"save_audio",
"set_current_audio_backend",
"set_audio_duration_mismatch_tolerance",
"set_ffmpeg_torchaudio_info_enabled",
"null_result_on_audio_loading_error",
"suppress_audio_loading_errors",
]
20 changes: 14 additions & 6 deletions lhotse/audio/recording.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,19 @@ class Recording:
and a 1-hour session with multiple channels and speakers (e.g., in AMI).
In the latter case, it is partitioned into data suitable for model training using :class:`~lhotse.cut.Cut`.

.. hint::
Lhotse reads audio recordings using `pysoundfile`_ and `audioread`_, similarly to librosa,
to support multiple audio formats. For OPUS files we require ffmpeg to be installed.
Internally, Lhotse supports multiple audio backends to read audio file.
By default, we try to use libsoundfile, then torchaudio (with FFMPEG integration starting with torchaudio 2.1),
and then audioread (which is an ffmpeg CLI wrapper).
For sphere files we prefer to use sph2pipe binary as it can work with certain unique encodings such as "shorten".

Audio backends in Lhotse are configurable. See:

* :func:`~lhotse.audio.backend.available_audio_backends`
* :func:`~lhotse.audio.backend.audio_backend`,
* :func:`~lhotse.audio.backend.get_current_audio_backend`
* :func:`~lhotse.audio.backend.set_current_audio_backend`
* :func:`~lhotse.audio.backend.get_default_audio_backend`

.. hint::
Since we support importing Kaldi data dirs, if ``wav.scp`` contains unix pipes,
:class:`~lhotse.audio.Recording` will also handle them correctly.

Examples

Expand Down Expand Up @@ -110,6 +116,8 @@ class Recording:
>>> assert samples.shape == (1, 16000)
>>> samples2 = recording.load_audio(offset=0.5)
>>> assert samples2.shape == (1, 8000)

See also: :class:`~lhotse.audio.recording.Recording`, :class:`~lhotse.cut.Cut`, :class:`~lhotse.cut.CutSet`.
"""

id: str
Expand Down
23 changes: 0 additions & 23 deletions lhotse/bin/modes/cut.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,29 +134,6 @@ def trim_to_supervisions(
Splits each input cut into as many cuts as there are supervisions.
These cuts have identical start times and durations as the supervisions.
When there are overlapping supervisions, they can be kept or discarded with options.

\b
For example, the following cut:
Cut
|-----------------|
Sup1
|----| Sup2
|-----------|

\b
is transformed into two cuts:
Cut1
|----|
Sup1
|----|
Sup2
|-|
Cut2
|-----------|
Sup1
|-|
Sup2
|-----------|
"""
cuts = CutSet.from_file(cuts)

Expand Down
11 changes: 5 additions & 6 deletions lhotse/cut/set.py
Original file line number Diff line number Diff line change
Expand Up @@ -1667,7 +1667,7 @@ def mix(
snr: Optional[Union[Decibels, Sequence[Decibels]]] = 20,
preserve_id: Optional[str] = None,
mix_prob: float = 1.0,
seed: Union[int, Literal["trng"]] = 42,
seed: Union[int, Literal["trng", "randomized"]] = 42,
random_mix_offset: bool = False,
) -> "CutSet":
"""
Expand Down Expand Up @@ -3440,7 +3440,7 @@ def __init__(
snr: Optional[Union[Decibels, Sequence[Decibels]]] = 20,
preserve_id: Optional[str] = None,
mix_prob: float = 1.0,
seed: Union[int, Literal["trng"]] = 42,
seed: Union[int, Literal["trng", "randomized"]] = 42,
random_mix_offset: bool = False,
) -> None:
self.source = cuts
Expand All @@ -3463,10 +3463,9 @@ def __init__(
assert isinstance(self.snr, (type(None), int, float))

def __iter__(self):
if self.seed == "trng":
rng = secrets.SystemRandom()
else:
rng = random.Random(self.seed)
from lhotse.dataset.dataloading import resolve_seed

rng = random.Random(resolve_seed(self.seed))
mix_in_cuts = iter(self.mix_in_cuts.repeat().shuffle(rng=rng, buffer_size=100))

for cut in self.source:
Expand Down
2 changes: 1 addition & 1 deletion lhotse/dataset/cut_transforms/mix.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def __init__(
Otherwise, new random IDs are generated for the augmented cuts (default).
:param random_mix_offset: an optional bool.
When ``True`` and the duration of the to be mixed in cut in longer than the original cut,
select a random sub-region from the to be mixed in cut.
select a random sub-region from the to be mixed in cut.
"""
self.cuts = cuts
if len(self.cuts) == 0:
Expand Down
3 changes: 1 addition & 2 deletions lhotse/dataset/cut_transforms/reverberate.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ class ReverbWithImpulseResponse:
response with some probability :attr:`p`.
The impulse response is chosen randomly from a specified CutSet of RIRs :attr:`rir_cuts`.
If no RIRs are specified, we will generate them using a fast random generator (https://arxiv.org/abs/2208.04101).
If `early_only` is set to True, convolution is performed only with the first 50ms of
the impulse response.
If `early_only` is set to True, convolution is performed only with the first 50ms of the impulse response.
"""

def __init__(
Expand Down
10 changes: 7 additions & 3 deletions lhotse/dataset/dataloading.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@ def make_worker_init_fn(
Calling this function creates a worker_init_fn suitable to pass to PyTorch's DataLoader.

It helps with two issues:
- sets the random seeds differently for each worker and node, which helps with

* sets the random seeds differently for each worker and node, which helps with
avoiding duplication in randomized data augmentation techniques.
- sets environment variables that help WebDataset detect it's inside multi-GPU (DDP)
* sets environment variables that help WebDataset detect it's inside multi-GPU (DDP)
training, so that it correctly de-duplicates the data across nodes.
"""
return partial(
Expand All @@ -43,6 +44,9 @@ def worker_init_fn(
set_different_node_and_worker_seeds: bool = True,
seed: Optional[int] = 42,
) -> None:
"""
Function created by :func:`~lhotse.dataset.dataloading.make_worker_init_fn`, refer to its documentation for details.
"""
if set_different_node_and_worker_seeds:
process_seed = seed + 100 * worker_id
if rank is not None:
Expand Down Expand Up @@ -74,7 +78,7 @@ def resolve_seed(seed: Union[int, Literal["trng", "randomized"]]) -> int:
using a true RNG (to the extend supported by the OS).

If it's "randomized", we'll check whether we're in a dataloading worker of ``torch.utils.data.DataLoader``.
If we are, we expect that it was passed the result of :func:``lhotse.dataset.dataloading.make_worker_init_fn``
If we are, we expect that it was passed the result of :func:`~lhotse.dataset.dataloading.make_worker_init_fn`
into its ``worker_init_fn`` argument, in which case we'll return a special seed exclusive to that worker.
If we are not in a dataloading worker (or ``num_workers`` was set to ``0``), we'll return Python's ``random``
module global seed.
Expand Down
10 changes: 5 additions & 5 deletions lhotse/dataset/input_strategies.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def supervision_intervals(self, cuts: CutSet) -> Dict[str, torch.Tensor]:

Depending on the strategy, the dict should look like:

.. code-block:
.. code-block::

{
"sequence_idx": tensor(shape=(S,)),
Expand All @@ -71,7 +71,7 @@ def supervision_intervals(self, cuts: CutSet) -> Dict[str, torch.Tensor]:

or

.. code-block:
.. code-block::

{
"sequence_idx": tensor(shape=(S,)),
Expand Down Expand Up @@ -127,7 +127,7 @@ def supervision_intervals(self, cuts: CutSet) -> Dict[str, torch.Tensor]:
Returns a dict that specifies the start and end bounds for each supervision,
as a 1-D int tensor, in terms of frames:

.. code-block:
.. code-block::

{
"sequence_idx": tensor(shape=(S,)),
Expand Down Expand Up @@ -233,7 +233,7 @@ def supervision_intervals(self, cuts: CutSet) -> Dict[str, torch.Tensor]:
Returns a dict that specifies the start and end bounds for each supervision,
as a 1-D int tensor, in terms of samples:

.. code-block:
.. code-block::

{
"sequence_idx": tensor(shape=(S,)),
Expand Down Expand Up @@ -410,7 +410,7 @@ def supervision_intervals(self, cuts: CutSet) -> Dict[str, torch.Tensor]:
Returns a dict that specifies the start and end bounds for each supervision,
as a 1-D int tensor, in terms of frames:

.. code-block:
.. code-block::

{
"sequence_idx": tensor(shape=(S,)),
Expand Down
1 change: 1 addition & 0 deletions lhotse/dataset/unsupervised.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ class RecordingChunkIterableDataset(IterableDataset):
overlapping audio chunks.

The format of yielded items is the following::

{
"recording_id": str
"begin_time": tensor with dtype=float32 shape=(1,)
Expand Down
23 changes: 14 additions & 9 deletions lhotse/lazy.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,7 @@ def mux(
*manifests,
stop_early: bool = False,
weights: Optional[List[Union[int, float]]] = None,
seed: Union[int, Literal["trng"]] = 0,
max_open_streams: Optional[int] = None,
seed: Union[int, Literal["trng", "randomized"]] = 0,
):
"""
Merges multiple manifest iterables into a new iterable by lazily multiplexing them during iteration time.
Expand All @@ -96,7 +95,7 @@ def infinite_mux(
cls,
*manifests,
weights: Optional[List[Union[int, float]]] = None,
seed: Union[int, Literal["trng"]] = 0,
seed: Union[int, Literal["trng", "randomized"]] = 0,
max_open_streams: Optional[int] = None,
):
"""
Expand Down Expand Up @@ -315,7 +314,7 @@ def __init__(
self,
*iterators: Iterable,
shuffle_iters: bool = False,
seed: Optional[int] = None,
seed: Optional[Union[int, Literal["trng", "randomized"]]] = None,
) -> None:
self.iterators = []
self.shuffle_iters = shuffle_iters
Expand All @@ -330,12 +329,14 @@ def __init__(
self.iterators.append(it)

def __iter__(self):
from lhotse.dataset.dataloading import resolve_seed

iterators = self.iterators
if self.shuffle_iters:
if self.seed is None:
rng = random # global Python RNG
else:
rng = random.Random(self.seed + self.num_iters)
rng = random.Random(resolve_seed(self.seed) + self.num_iters)
rng.shuffle(iterators)
self.num_iters += 1
for it in iterators:
Expand Down Expand Up @@ -367,7 +368,7 @@ def __init__(
*iterators: Iterable,
stop_early: bool = False,
weights: Optional[List[Union[int, float]]] = None,
seed: Union[int, Literal["trng"]] = 0,
seed: Union[int, Literal["trng", "randomized"]] = 0,
) -> None:
self.iterators = list(iterators)
self.stop_early = stop_early
Expand All @@ -385,7 +386,9 @@ def __init__(
assert len(self.iterators) == len(self.weights)

def __iter__(self):
rng = build_rng(self.seed)
from lhotse.dataset.dataloading import resolve_seed

rng = random.Random(resolve_seed(self.seed))
iters = [iter(it) for it in self.iterators]
exhausted = [False for _ in range(len(iters))]

Expand Down Expand Up @@ -447,7 +450,7 @@ def __init__(
*iterators: Iterable,
stop_early: bool = False,
weights: Optional[List[Union[int, float]]] = None,
seed: Union[int, Literal["trng"]] = 0,
seed: Union[int, Literal["trng", "randomized"]] = 0,
max_open_streams: Optional[int] = None,
) -> None:
self.iterators = list(iterators)
Expand Down Expand Up @@ -475,7 +478,9 @@ def __iter__(self):
- each stream may be interpreted as a shard belonging to some larger group of streams
(e.g. multiple shards of a given dataset).
"""
rng = build_rng(self.seed)
from lhotse.dataset.dataloading import resolve_seed

rng = random.Random(resolve_seed(self.seed))

def shuffled_streams():
# Create an infinite iterable of our streams.
Expand Down
Loading
Loading