Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfectly shuffle files for better randomization. #155

Merged
merged 1 commit into from Oct 27, 2021

Conversation

copybara-service[bot]
Copy link

Perfectly shuffle files for better randomization.

With a limited buffer size, you are likely to pick the earlier shards
irrespective of the seed in the initial cycle_length draws.

import tensorflow as tf

def sample(seed, buffer_size, cycle_length=16, num_files=10155):
dataset = tf.data.Dataset.range(num_files)

dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)

dataset = dataset.interleave(
lambda x: tf.data.Dataset.from_tensors(x).repeat(1),
cycle_...


@google-cla
Copy link

google-cla bot commented Oct 27, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 27, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

With a limited buffer size, you are likely to pick the earlier shards
irrespective of the seed in the initial cycle_length draws.

import tensorflow as tf

def sample(seed, buffer_size, cycle_length=16, num_files=10155):
  dataset = tf.data.Dataset.range(num_files)

  dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)

  dataset = dataset.interleave(
     lambda x: tf.data.Dataset.from_tensors(x).repeat(1),
        cycle_...

***

PiperOrigin-RevId: 405934172
@copybara-service copybara-service bot merged commit 2d4107c into main Oct 27, 2021
@copybara-service copybara-service bot deleted the test_405926522 branch October 27, 2021 18:17
@google-cla
Copy link

google-cla bot commented Oct 27, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

0 participants