Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

knighton · 2023-04-10T20:17:49Z

Description of changes:

py1b is a shuffling algorithm that performs the final shuffle over fixed-size blocks instead of intra-shard like py1s. These units are presumably larger or much larger than single shards, leading to better shuffledness at the cost of having to download more shards to make progress.

naive is a shuffling algorithm that naively shuffles all-to-all. This is useful for single-node training on small data, where you want the most random shuffle possible. Statistically, this algorithm will result in all nodes downloading all shards, with those downloads all happening at the start of the epoch, bringing training to a crawl.

Also update the default shuffle algo from py1s to py1b.

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

karan6181

Overall looks good. Can you also add a unit test for each shuffling algorithm (naive, py1b, py1s, py2s) to validate the output sample order with an expected sample order?

karan6181 · 2023-04-11T15:19:08Z

streaming/base/dataset.py

        shuffle_seed (int): Seed for Deterministic data shuffling. Defaults to ``9176``.
+        shuffle_block_size (int): Unit of shuffle. Defaults to ``1 << 18``.


Can you add more information here on what shuffle_block_size do ?

karan6181 · 2023-04-11T15:20:20Z

streaming/base/dataset.py

@@ -157,8 +158,9 @@ class StreamingDataset(IterableDataset):
            partitioned over the workers. Defaults to ``None``.
        shuffle (bool): Whether to iterate over the samples in randomized order. Defaults to
            ``False``.
-        shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1s``.
+        shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1b``.


Can you add a one to line line of brief statement explaining when to use which naive, py1b, py1s, py2s algorithm ?

abhi-mosaic · 2023-04-11T19:04:06Z

If we change the default to py1b, will all the StreamingDataset unit tests (aka for correctness, presence of samples, etc) rerun with py1b ? That would be great if so.

knighton · 2023-04-11T19:06:20Z

If we change the default to py1b, will all the StreamingDataset unit tests (aka for correctness, presence of samples, etc) rerun with py1b ? That would be great if so.

pytest is happy :)

karan6181 · 2023-04-11T19:44:56Z

Based on the offline discussion, James will be creating a follow up PR to address the above two comments

Add a one to two line of brief statement explaining when to use which naive, py1b, py1s, py2s algorithm in the StreamingDataset class documentation.
Add a unit test for each shuffling algorithm (naive, py1b, py1s, py2s) to validate the output sample order with an expected sample order.

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks).

295b33d

karan6181 reviewed Apr 11, 2023

View reviewed changes

karan6181 approved these changes Apr 11, 2023

View reviewed changes

knighton merged commit 124bf46 into main Apr 11, 2023

knighton deleted the james/shuffle-redux branch April 11, 2023 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

knighton commented Apr 10, 2023 •

edited

Loading

karan6181 left a comment

karan6181 Apr 11, 2023

karan6181 Apr 11, 2023

abhi-mosaic commented Apr 11, 2023

knighton commented Apr 11, 2023

karan6181 commented Apr 11, 2023

		shuffle_seed (int): Seed for Deterministic data shuffling. Defaults to ``9176``.
		shuffle_block_size (int): Unit of shuffle. Defaults to ``1 << 18``.

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223

Conversation

knighton commented Apr 10, 2023 • edited Loading

Description of changes:

Issue #, if available:

Merge Checklist:

General

Tests

karan6181 left a comment

Choose a reason for hiding this comment

karan6181 Apr 11, 2023

Choose a reason for hiding this comment

karan6181 Apr 11, 2023

Choose a reason for hiding this comment

abhi-mosaic commented Apr 11, 2023

knighton commented Apr 11, 2023

karan6181 commented Apr 11, 2023

knighton commented Apr 10, 2023 •

edited

Loading