Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better defaults for StreamingDataset subclasses #723

Merged
merged 7 commits into from
Nov 13, 2023

Conversation

snarayan21
Copy link
Contributor

With better defaults and new features being added to Streaming, this PR makes sure those improvements are reflected in llm-foundry. Specifically:

  • Default shuffle_algo changed to py1e: More balanced downloads and cache limit performance
  • Default partition_algo changed to relaxed: Elastic determinism and resumptions on many more numbers of nodes
  • Default predownload changed to None: Lower predownload for more balanced download demand. Will be set by StreamingDataset.
  • Default num_canonical_nodes changed to None: Now set to equal the number of physical nodes, unless using py1s and py2s shuffle algorithms, in which case will equal 64 * physical nodes.
  • Default shuffle_block_size changed to None: Now set based on num_canonical_nodes to give a consistently good shuffle quality without needing a crazy amount of downloads.

Defaults were changed in Streaming v0.7.0 with this commit: mosaicml/streaming@93bf054

Copy link
Contributor

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Do we need to change the defaults in

dataset = StreamingTextDataset(
as well as in
dataset = dataset_constructor.build_from_streaming(
?

@snarayan21
Copy link
Contributor Author

Overall looks good to me. Do we need to change the defaults in

dataset = StreamingTextDataset(

as well as in

dataset = dataset_constructor.build_from_streaming(

?

oop yes we do. Changing that now.

Copy link
Contributor

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also update the epoch_size data type to Optional[Union[int, str]] which is inline with the streaming args? Thanks!

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some more to change in denoising.py. Will approve after that, lgtm otherwise.

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snarayan21 please post a PSA in the research slack once you merge this. I'll also include a PSA in the next release notes, but want to make sure research knows that the defaults here are changing.

Copy link
Contributor

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank You!

@dakinggg dakinggg enabled auto-merge (squash) November 13, 2023 22:17
@dakinggg dakinggg merged commit d1960f2 into mosaicml:main Nov 13, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants