Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] HER does not sample very last state of episode as achieved goal #666

Open
nicoguertler opened this issue Jan 28, 2020 · 3 comments
Labels
question Further information is requested

Comments

@nicoguertler
Copy link

nicoguertler commented Jan 28, 2020

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode. Is this intended?

As a consequence, the sampling strategy "final" seems to always sample the second to last state. My understanding is that the paper suggests using the very last state and in some environments this could actually make a difference.

https://github.com/nicoguertler/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L99-L125

@araffin araffin added the question Further information is requested label Feb 1, 2020
@araffin
Copy link
Collaborator

araffin commented Feb 1, 2020

Hello,

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode.

selected_transition = episode_transitions[-1]

the index [-1] in python is the last element of a list, no? (and indices start at zero)

EDIT: this may be related to #400

@nicoguertler
Copy link
Author

Hello and thank you for your answer!

The episode transitions are tuples containing the last observation, action, reward, new observation and the done boolean.

self.episode_transitions.append((obs_t, action, reward, obs_tp1, done))

The goal selection strategy final indeed chooses the last transition but then uses the old observation (index 0) which corresponds to the second to last observation of the episode.

return self.env.convert_obs_to_dict(selected_transition[0])['achieved_goal']

In my opinion it would be closer to what the paper suggests to use the third element of the transition (corresponding to the new and thus very last observation) in case of the goal selection strategy final.

This could be relevant if the last observation of the episode is important. For example if the episode ends with a contact between objects that only appears in the very last observation.

The other goal selection strategies do not use the very last observation either but this is probably not a big problem as long as the episodes are long enough. In issue #578, however, it is reported that the implementation of HER does not work properly if the episode length is 1. I think this is precisely because of the behavior described above.

Thank you for your help!

@araffin
Copy link
Collaborator

araffin commented Feb 2, 2020

Thanks for the clarification.
For #578 , it seems normal for the future strategy (cf answer: #578 (comment))

For the rest, I need to think more about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants