Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large size index.json (20GB +) #662

Open
andreamad8 opened this issue Apr 25, 2024 · 2 comments
Open

Support large size index.json (20GB +) #662

andreamad8 opened this issue Apr 25, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@andreamad8
Copy link

🚀 Feature Request

Large index.json are slow to load. Currently, I am trying to increase shard size, so stream.py#L473 will be faster (hopefully).

Motivation

These two steps are very slow for large index.json files.

https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L461

and

https://github.com/mosaicml/streaming/blob/main/streaming/base/stream.py#L473

especially with large scale dataset (e.g, Billion same).

@andreamad8 andreamad8 added the enhancement New feature or request label Apr 25, 2024
@ASchneidman
Copy link

Some more context, we have a dataset with ~1.2 billion samples at like 1MB/sample. The index.json file of the merged dataset will be in the tens of GBs, which makes the dataset prohibitively slow to initialize.

@snarayan21
Copy link
Collaborator

Hey, we have seen index.json load times be slow. I think that this is because we download the index file on every single rank, rather than downloading it on just one rank and then broadcasting its contents to other ranks. Downloading a file that's a few GB from cloud storage just on one rank should be relatively fast. This would be a good enhancement but isn't high priority for us right now -- if it's not too much of a hassle, mind submitting a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants