Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge shuffling and bucketing buffers in DynamicBucketingSampler #1276

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 29, 2024

I realized it's redundant to have two buffers, and we can instead use a single buffer to achieve both goals of shuffling and bucketing (and achieve them better). The basic idea is that we remove the shuffle buffer and let DynamicBucketer choose examples from the bucketing buffer at random without replacements (whereas previously, it just sampled the first N until hitting the constraint). Effectively, it's using the larger size of both buffers in the old implementation for both tasks while saving the space of the smaller buffer.

@pzelasko pzelasko added this to the v1.20.0 milestone Jan 29, 2024
@pzelasko
Copy link
Collaborator Author

To illustrate this works on a toy example: let's use mini LibriSpeech (5h, ~1500 cuts) with DynamicBucketingSampler(cuts, ..., shuffle=True, shuffle_buffer_size=100, buffer_size=1000) (deliberately small values relative to the dataset size). Testing across multiple settings of num_buckets = (2, 5, 10, 20, 30, 50) and max_duration = (50, 100, 200, 400), we measure the Spearman's correlation between input and output list of cut IDs (input: as in manifest; output: flattened mini-batch CutSets yielded by the sampler).

We find the old implementation has a mean R of 0.78, the new implementation has a mean R of 0.17 -- means the sequence of cuts in the new implementation is much closer to truly shuffled, as it can use a 10x shuffling buffer capacity despite not having a separate buffer for that.

@pzelasko
Copy link
Collaborator Author

The failed tests are flaky -- merging.

@pzelasko pzelasko merged commit f8934bb into master Jan 30, 2024
8 of 10 checks passed
@pzelasko pzelasko deleted the feature/merge-dynamic-bucketing-sampler-buffers branch January 30, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant