Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

Merged
merged 1 commit into from
Apr 11, 2023

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Apr 10, 2023

Description of changes:

py1b is a shuffling algorithm that performs the final shuffle over fixed-size blocks instead of intra-shard like py1s. These units are presumably larger or much larger than single shards, leading to better shuffledness at the cost of having to download more shards to make progress.

naive is a shuffling algorithm that naively shuffles all-to-all. This is useful for single-node training on small data, where you want the most random shuffle possible. Statistically, this algorithm will result in all nodes downloading all shards, with those downloads all happening at the start of the epoch, bringing training to a crawl.

Also update the default shuffle algo from py1s to py1b.

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Can you also add a unit test for each shuffling algorithm (naive, py1b, py1s, py2s) to validate the output sample order with an expected sample order?

shuffle_seed (int): Seed for Deterministic data shuffling. Defaults to ``9176``.
shuffle_block_size (int): Unit of shuffle. Defaults to ``1 << 18``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more information here on what shuffle_block_size do ?

@@ -157,8 +158,9 @@ class StreamingDataset(IterableDataset):
partitioned over the workers. Defaults to ``None``.
shuffle (bool): Whether to iterate over the samples in randomized order. Defaults to
``False``.
shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1s``.
shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1b``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a one to line line of brief statement explaining when to use which naive, py1b, py1s, py2s algorithm ?

@abhi-mosaic
Copy link
Member

If we change the default to py1b, will all the StreamingDataset unit tests (aka for correctness, presence of samples, etc) rerun with py1b ? That would be great if so.

@knighton
Copy link
Contributor Author

If we change the default to py1b, will all the StreamingDataset unit tests (aka for correctness, presence of samples, etc) rerun with py1b ? That would be great if so.

pytest is happy :)

@karan6181
Copy link
Collaborator

Based on the offline discussion, James will be creating a follow up PR to address the above two comments

  1. Add a one to two line of brief statement explaining when to use which naive, py1b, py1s, py2s algorithm in the StreamingDataset class documentation.
  2. Add a unit test for each shuffling algorithm (naive, py1b, py1s, py2s) to validate the output sample order with an expected sample order.

@knighton knighton merged commit 124bf46 into main Apr 11, 2023
@knighton knighton deleted the james/shuffle-redux branch April 11, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants