Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes a number of changes to enable training of a SOTA diffusion policy model for the PushT environment.
About the trained model weights
The training losses for a model trained on lerobot and an equivalent model trained on the original repository look about the same:
I then ported their weights over to lerobot and ran 500 eval experiments for each model (I chose the best weights from each, so around 200k training steps)
For theirs we got:
For ours we got:
The "pc_success" measures the proportion (as a %) of rollouts that results in a >= 95% overlap between the T and the target being reached.
The "avg_max_reward" metric is the one they use in the paper, for which they report 0.91/0.84 (picking the best checkpoint / picking the average of the last 10 checkpoints). It measures max(clip(overlap / success_threshold, 0, 1)) such that if there is success, the reward is 1.
Because of the relatively small gap in "avg_max_reward" (0.1) but the higher gap in "pc_success" (~0.2) I speculate that the issue has to do with fine grained control when the T is in near 95% overlap. For example, consider this rollout where our policy quickly achieves a near optimal placement but isn't able to manage closing the final gap:
Contrast this with the model trained on the official DP repo where the initial approach is arguably worse, but it's able to apply finishing touches much faster:
Also note that eval on their repo (with their same model weights) is giving higher "avg_max_reward": 0.97 ~ 0.98, although the rollouts look qualitatively the same. Clearly we have some other differences in eval/data that we need to hunt down.