-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add feature to drop last n keyframes for delta_timestamps #129
Closed
+45
−8
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will leave
self.hf_dataset["index']
andself.hf_dataset["episode_index"]
in an inconsistent state withself.index
andself.episode_data_index
.One problem is that we are duplicating the information that exists on the
hf_dataset
, namelyhf_dataset["index"]
so that we can mutate it. So instead ofLeRobotDataset
being a lightweight wrapper around anhf_dataset
we now introduce this new state in between (self.index
) that the maintainers and the users will have to know and remember about.Another problem is that what should
self.episode_data_index
now reflect?self.episode_data_index
now holds info for the data inhf_dataset
but not forself.index
, this can be confusing.And the dependencies on this new introduced state only cascade down, where
num_samples
no longer depends directly onself.hf_dataset
, etc.So the intentions here are very good -- we want to make the life better for the users (so that they can train better models) and in some sense this implementation adds the new functionality in as straightforward way as possible, but the side effect here is that the complexity of the
LeRobotDataset
grows quite significantly and it is not clear anymore what what depends on. This can lead to subtle bugs and will make extending the functionality much harder down the road (and also new users to the codebase will have a much harder time understanding how things work -- for instance, what is the difference betweenself.hf_dataset["index"]
andself.index
and under what conditions they are different? Yeah, I know the answers to these questions but that is because I spent a good bit of time with this code now, but for new folks that might be quite challenging)One potential way of tackling this might be to attack it via sampling -- in the same way as we have
drop_last
on the pytorchDataLoader
. We could have a custom sampler we could pass to the dataloader that would drop the last n examples from each episode based on theepisode_data_index
.The advantage here is that we would not be adding complexity to the
LeRobotDataset
. It would remain truer to its original intention, that is not as a way to modify data but as a way to add to a HuggingFace dataset in order to cater to the needs of training policies for controlling robots. But the dependency would be one way --LeRobotDataset
would depend on thehf_dataset
it gets passed but the dependency would not go the other way. For all intents and purposes data contained in thehf_dataset
would remain immutable, we would not be modifying the data in thehf_dataset
nor would we be creating some state standing in for aspects of thehf_dataset
. Our users would know where the single point of truth always is.And also the full functionality would be localized to a
Sampler
. Yeah, if users would want to make avail of this functionality they would need to know of the existence of the Sampler (maybe a good place to introduce this would be as an example in the doc string forDiffusionPolicy
) but we would not be adding extra complexity to theLeRobotDataset
and arguably might be fighting complexity in the codebase via reducing coupling and doing something like dependency injection at the level of the dataloader.Sorry, these are just some loose thoughts, though I understand that this problem is not easy to solve. But maybe some of this might be helpful.
I'll keep thinking about this and if anything comes to mind will comment again in his PR 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, here is a SubsetRandomSampler from pytorch.
Might be nice to inherit from it and instatiate it with
episode_data_index
andn_end_keyframes_dropped
. A minimal implementation might be utils function that would return an index from a LeRobotDataset with then_end_keyframes_dropped
excluded.Anyhow, sorry 🙂 Not sure if any of this is helpful, I might be completely off the mark with the intention behind this PR.
Either way, it is very good to know of this difference vs the original training on
pusht
, so thank you for pointing me to this PR, @alexander-soare! 🙌There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@radekosmulski I find your arguments very convincing :) Would you be interested in having a crack at the sampler approach by any chance? cc @Cadene who had this on his TODO list at some point but it got shoved back by other prios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing @alexander-soare, will give it a go! 🙂