Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO2 on Swimmer-v2. Avg reward plot matches the one in the paper but video is not meaningful. #1013

Open
hiteshisharma opened this issue Oct 2, 2019 · 6 comments

Comments

@hiteshisharma
Copy link

hiteshisharma commented Oct 2, 2019

I ran PPO2 on Mujoco Swimmer-v2 with the hyper-parameters as mentioned in the paper for 1 million timesteps. I was able to generate the plot for avg rewards similar to the one in the paper but when I ran the trained model to visualize, the agent did not seem to working.
plot_ppo2_Swimmer-v2
swimmer

Has anyone faced the same issue?

@christopherhesse
Copy link
Contributor

Assuming you're not seeing the correct average reward when visualizing the agent, are you failing to load the parameters or using different observation normalization?

@hiteshisharma
Copy link
Author

It looks like that in order to have a working agent in the visualization, the swimmer (v2) environment should reach an avg reward of ~300 but it is stuck around 100-110. So, while the avg reward is same while visualization, it does not seem to be moving.

Please let me know if I am not understanding it fully.

@christopherhesse
Copy link
Contributor

Oh, so you're saying the learning curve is consistent with the video, but that the learning curve isn't very good because it should get to a reward of 300, but only gets to like 100, right?

@christopherhesse
Copy link
Contributor

It looks like baselines expects to get about 100 for swimmer: http://htmlpreview.github.io/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm

Where did you get the 300 value from?

@hiteshisharma
Copy link
Author

This paper : [https://arxiv.org/pdf/1906.08649.pdf]

What I meant was that we need more than 100-120 to see the swimmer working in the video.

@christopherhesse
Copy link
Contributor

Looking at that paper, it seems to say that PPO gets a score of 155 after 200k timesteps, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants