Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee determinism when sampling (either overall via sample_ratio, or while balancing data) #3191

Merged
merged 13 commits into from
Mar 3, 2023

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Mar 3, 2023

We don't propagate the random_seed to any of our df.sample calls. What this means is that if there is a cache miss, or if we skip saving preprocessed inputs, it's possible that two successive calls have completely different performance because sampling is currently non-deterministic.

This PR ensures that two successive calls using either sample_ratio < 1, or oversample_minority/undersample_majority always samples the rows in the same order. This can be flipped by using a different random_seed if needed, but will be deterministic otherwise.

@github-actions
Copy link

github-actions bot commented Mar 3, 2023

Unit Test Results

         6 files           6 suites   5h 56m 48s ⏱️
  4 011 tests   3 967 ✔️   43 💤 1
11 995 runs  11 862 ✔️ 132 💤 1

For more details on these failures, see this check.

Results for commit a5f2da6.

♻️ This comment has been updated with latest results.

@arnavgarg1 arnavgarg1 changed the title Propagate random seed to dataset sampling Guarantee determinism when sampling (either overall via sample_ratio, or while balancing data) Mar 3, 2023
@arnavgarg1 arnavgarg1 marked this pull request as ready for review March 3, 2023 19:46
@arnavgarg1 arnavgarg1 added bug Something isn't working release-0.7 Needs cherry-pick into 0.7 release branch labels Mar 3, 2023
@arnavgarg1 arnavgarg1 merged commit 385518b into master Mar 3, 2023
@arnavgarg1 arnavgarg1 deleted the determinstic_split branch March 3, 2023 21:56
tgaddair pushed a commit that referenced this pull request Mar 3, 2023
tgaddair pushed a commit that referenced this pull request Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release-0.7 Needs cherry-pick into 0.7 release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants