[question] HER does not sample very last state of episode as achieved goal #666

nicoguertler · 2020-01-28T16:01:30Z

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode. Is this intended?

As a consequence, the sampling strategy "final" seems to always sample the second to last state. My understanding is that the paper suggests using the very last state and in some environments this could actually make a difference.

https://github.com/nicoguertler/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L99-L125

araffin · 2020-02-01T20:47:08Z

Hello,

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode.

stable-baselines/stable_baselines/her/replay_buffer.py

Line 113 in 4fada47

selected_transition = episode_transitions[-1]

the index [-1] in python is the last element of a list, no? (and indices start at zero)

EDIT: this may be related to #400

nicoguertler · 2020-02-02T18:15:03Z

Hello and thank you for your answer!

The episode transitions are tuples containing the last observation, action, reward, new observation and the done boolean.

stable-baselines/stable_baselines/her/replay_buffer.py

Line 75 in 4fada47

self.episode_transitions.append((obs_t, action, reward, obs_tp1, done))

The goal selection strategy final indeed chooses the last transition but then uses the old observation (index 0) which corresponds to the second to last observation of the episode.

stable-baselines/stable_baselines/her/replay_buffer.py

Line 125 in 4fada47

return self.env.convert_obs_to_dict(selected_transition[0])['achieved_goal']

In my opinion it would be closer to what the paper suggests to use the third element of the transition (corresponding to the new and thus very last observation) in case of the goal selection strategy final.

This could be relevant if the last observation of the episode is important. For example if the episode ends with a contact between objects that only appears in the very last observation.

The other goal selection strategies do not use the very last observation either but this is probably not a big problem as long as the episodes are long enough. In issue #578, however, it is reported that the implementation of HER does not work properly if the episode length is 1. I think this is precisely because of the behavior described above.

Thank you for your help!

araffin · 2020-02-02T21:19:09Z

Thanks for the clarification.
For #578 , it seems normal for the future strategy (cf answer: #578 (comment))

For the rest, I need to think more about it.

araffin added the question Further information is requested label Feb 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] HER does not sample very last state of episode as achieved goal #666

[question] HER does not sample very last state of episode as achieved goal #666

nicoguertler commented Jan 28, 2020 •

edited

Loading

araffin commented Feb 1, 2020 •

edited

Loading

nicoguertler commented Feb 2, 2020

araffin commented Feb 2, 2020

[question] HER does not sample very last state of episode as achieved goal #666

[question] HER does not sample very last state of episode as achieved goal #666

Comments

nicoguertler commented Jan 28, 2020 • edited Loading

araffin commented Feb 1, 2020 • edited Loading

nicoguertler commented Feb 2, 2020

araffin commented Feb 2, 2020

nicoguertler commented Jan 28, 2020 •

edited

Loading

araffin commented Feb 1, 2020 •

edited

Loading