New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Resumable IterableDataset] Add IterableDataset state_dict #6658
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
would be nice to have this feature in the new dataset release! |
Before finalising this this I'd like to make sure this philosophy makes sense for other libs like cc @muellerzr I'd love your feedback on this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks like a very nice API decision, and super easy for us to bring into Accelerate as part of load_state
. Will be nice to not have to use skip_batches
if a user is using an IterableDataset
.
One design question though: what's the logic behind self._state_dict
rather than having it all be state_dict
?
Private stuff doesn't exist in python, so what's the aim in doing that here and having state_dict
be a passthrough to it? (If this is a common design pattern over in datasets
that's okay)
The We need to copy it every time the user accesses it. Otherwise we would get state_dict = ds.state_dict()
for x in ds:
assert ds.state_dict() == state_dict # and actually `assert ds.state_dict() is state_dict` The state is updated in-place since it's made of dictionaries that are shared with the steps in the IterableDataset pipeline. |
What do you think of making it a full property with a docstring explicitly stating users shouldn’t call/modify it directly? I can imagine some exploratory users getting curious |
I don't think users read docstrings of properties that often. What about explaining the logic in the |
Sure, I can agree with that! |
Just a small note mentioning returns a copy of the state dict should be enough imo |
looking forward as well for this PR to be merge |
A simple implementation of a mechanism to resume an IterableDataset.
This is WIP and untested.
Example:
returns