Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issues with PushT diffusion #41

Merged
merged 9 commits into from
Mar 21, 2024

Conversation

alexander-soare
Copy link
Collaborator

@alexander-soare alexander-soare commented Mar 21, 2024

This PR makes a number of changes to enable training of a SOTA diffusion policy model for the PushT environment.

  • Tweaks to the observation encoder including using SpatialSoftmax, relu activation, and switching the order of concatenation of observation features (to match the original implementation).
  • Enable random crop during training.
  • Image normalization to match the original implementation.
  • Track the EMA model in the torch.nn.Module parameters and use it for rollout.
  • Don't draw the action marker on image observations during rollout.

About the trained model weights

The training losses for a model trained on lerobot and an equivalent model trained on the original repository look about the same:

image

I then ported their weights over to lerobot and ran 500 eval experiments for each model (I chose the best weights from each, so around 200k training steps)

For theirs we got:

{
  "avg_max_reward": 0.9191964075746946,
  "pc_success": 60.199999999999996
}

For ours we got:

{
  "avg_max_reward": 0.9064452842990868,
  "pc_success": 42.4,
}

The "pc_success" measures the proportion (as a %) of rollouts that results in a >= 95% overlap between the T and the target being reached.

The "avg_max_reward" metric is the one they use in the paper, for which they report 0.91/0.84 (picking the best checkpoint / picking the average of the last 10 checkpoints). It measures max(clip(overlap / success_threshold, 0, 1)) such that if there is success, the reward is 1.

Because of the relatively small gap in "avg_max_reward" (0.1) but the higher gap in "pc_success" (~0.2) I speculate that the issue has to do with fine grained control when the T is in near 95% overlap. For example, consider this rollout where our policy quickly achieves a near optimal placement but isn't able to manage closing the final gap:

ours

Contrast this with the model trained on the official DP repo where the initial approach is arguably worse, but it's able to apply finishing touches much faster:

theirs

Also note that eval on their repo (with their same model weights) is giving higher "avg_max_reward": 0.97 ~ 0.98, although the rollouts look qualitatively the same. Clearly we have some other differences in eval/data that we need to hunt down.

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Cadene
Copy link
Collaborator

Cadene commented Mar 21, 2024

Thanks for your PR. It looks super clean. Too bad we can't exactly reproduce the exact same policy. It seems a bit behind their policy as you showed with the not-fine grained behavior it produces.

@Cadene Cadene merged commit b633748 into huggingface:main Mar 21, 2024
1 check passed
@Cadene
Copy link
Collaborator

Cadene commented Mar 21, 2024

Merged to be able to load your pretrained model on main ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants