Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable seed randomization in dynamic samplers #1278

Merged
merged 2 commits into from
Jan 31, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 31, 2024

This PR enables specifying seed="randomized" and seed="trng" for DynamicCutSampler and DynamicBucketingSampler.

Both options are intended for use with IterableDatasetWrapper and cause the samplers to iterate with different random seeds in each node and dataloading worker. Note that for bucketing this will have the effect of de-synchronizing batch sizes across GPUs from the start of iteration (before the change, this occurs anyway after a number of training steps as observed in #857).

From now on, the sampler also attaches a custom field called dataloading_info to each cut which is a dict containing rank, world_size, and worker_id keys that help diagnose the dataloading.

@pzelasko pzelasko added this to the v1.20.0 milestone Jan 31, 2024
@pzelasko pzelasko marked this pull request as ready for review January 31, 2024 16:05
@pzelasko
Copy link
Collaborator Author

The failing test is flaky - merging

@pzelasko pzelasko merged commit e043228 into master Jan 31, 2024
7 of 10 checks passed
@pzelasko pzelasko deleted the feature/enable-randomized-sampler-seed branch January 31, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant