Add new sampler: weighted sampler #1344

marcoyang1998 · 2024-05-29T10:54:43Z

Add a weighted sampler, where each cut's sampling probability is proportional to its weight. This is useful for unbalanced dataset, where some classes have very few data. The weight for each cut should be computed by the user and passed to the sampler. It's similar to pytorch's WeightedRandomSampler (see here)

This sampler only works with eager manifest since we need to perform sampling globally.

…ght; do not allow duplicated sample within the same epoch

…ed cut in the same batch

pzelasko · 2024-05-29T11:21:08Z

Thanks @marcoyang1998, I appreciate your work. I think you could achieve a similar outcome by splitting the cutset into subset cutsets for each class, and then using mux to get a cutset to be passed to any of the existing samplers. It would also work with lazy manifests and bucketing.

class_cutsets = [cuts_class0, cuts_class1, ...]
class_weights = [w_class0, w_class1, ...]
cuts = CutSet.mux(*class_cutsets, weights=class_weights)

marcoyang1998 · 2024-05-30T01:54:00Z

Hi Piotr,

thanks for the CutSet.mux example. I thought about this before as well, and it should work very well if the samples under the same class share an equal weight. However, in my setup, which is a multi-class classification (AudioSet), every sample has a unique sampling weight and using CutSet.mux for this scenario is impractical.

… weighted_sampler

pzelasko

That's a valid point, I didn't think of this use-case. In that case LGTM, I just left one comment that may help us reduce code duplication.

pzelasko · 2024-06-04T14:55:46Z

lhotse/dataset/sampling/weighted_simple.py

+from lhotse.dataset.sampling.data_source import WeightedDataSource
+
+
+class WeightedSimpleCutSampler(CutSampler):


I wonder whether this class can inherit SimpleCutSampler and only override the necessary parts (e.g. using WeightedDataSource in __init__); otherwise most of this code looks very similar.

Good idea! I just changed the inheritance, please have a look

pzelasko · 2024-06-05T12:19:04Z

lhotse/dataset/sampling/weighted_simple.py

            drop_last=drop_last,
            shuffle=shuffle,
            world_size=world_size,
            rank=rank,
+            max_duration=max_duration,
+            max_cuts=max_cuts,
            seed=seed,
        )
        assert cuts.is_lazy == False, "This sampler does not support lazy mode!"
        assert (
            shuffle == False


Can we remove this assertion? let's just ignore the value of shuffle in this case.

pzelasko · 2024-06-05T12:19:44Z

lhotse/dataset/sampling/weighted_simple.py

@@ -181,63 +152,3 @@ def __iter__(self) -> "WeightedSimpleCutSampler":
            self.data_source.shuffle(self.seed + self.epoch)


is if self.shuffle branch needed here?

pzelasko

I made the changes I requested myself as I'd like to release a new version and include this. Thanks @marcoyang1998!

marcoyang1998 added 11 commits May 28, 2024 12:02

add file

dd19a1f

add a weighted data source to enable sampling based on per-sample wei…

d8e38ec

…ght; do not allow duplicated sample within the same epoch

add a weighted sampler; do not allow lazy mode; do not allow duplicat…

637ea48

…ed cut in the same batch

modify init file accordingly

ec4e437

add more documentations

e7b93a5

use numpy for sampling; pre-compute the indexes in __iter__ to save time

5cae623

add more documentation

b90b5b0

minor changes to the arguments

627dff2

remove unused file

3774dc6

add test

2449ad5

add more docs

6cbb8ea

marcoyang1998 changed the title ~~Add new sample: weighted sampler~~ Add new sampler: weighted sampler May 29, 2024

marcoyang1998 added 2 commits May 30, 2024 14:20

fix isort

609b47c

Merge branch 'master' of https://github.com/marcoyang1998/lhotse into…

5b78a44

… weighted_sampler

pzelasko reviewed Jun 4, 2024

View reviewed changes

pzelasko added this to the v1.24.0 milestone Jun 4, 2024

marcoyang1998 added 2 commits June 5, 2024 18:55

inherit from SimpleCutSampler; remove duplicated code

045cc1f

minor fix

a32cab0

pzelasko reviewed Jun 5, 2024

View reviewed changes

pzelasko added 2 commits June 5, 2024 15:07

Add changes requested in code review

cc1f72f

Merge branch 'master' into weighted_sampler

6cc8883

pzelasko approved these changes Jun 5, 2024

View reviewed changes

pzelasko enabled auto-merge (squash) June 5, 2024 19:08

pzelasko merged commit 4d57d53 into lhotse-speech:master Jun 5, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new sampler: weighted sampler #1344

Add new sampler: weighted sampler #1344

marcoyang1998 commented May 29, 2024

pzelasko commented May 29, 2024

marcoyang1998 commented May 30, 2024 •

edited

Loading

pzelasko left a comment

pzelasko Jun 4, 2024

marcoyang1998 Jun 5, 2024

pzelasko Jun 5, 2024

pzelasko Jun 5, 2024

pzelasko left a comment

		from lhotse.dataset.sampling.data_source import WeightedDataSource


		class WeightedSimpleCutSampler(CutSampler):

		@@ -181,63 +152,3 @@ def __iter__(self) -> "WeightedSimpleCutSampler":
		self.data_source.shuffle(self.seed + self.epoch)

Add new sampler: weighted sampler #1344

Add new sampler: weighted sampler #1344

Conversation

marcoyang1998 commented May 29, 2024

pzelasko commented May 29, 2024

marcoyang1998 commented May 30, 2024 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

pzelasko Jun 4, 2024

Choose a reason for hiding this comment

marcoyang1998 Jun 5, 2024

Choose a reason for hiding this comment

pzelasko Jun 5, 2024

Choose a reason for hiding this comment

pzelasko Jun 5, 2024

Choose a reason for hiding this comment

pzelasko left a comment

Choose a reason for hiding this comment

marcoyang1998 commented May 30, 2024 •

edited

Loading