New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Problem in evaluation methodology in DDPG #199
Comments
This is legacy code, so I do agree for using a number of episode for evaluating instead of a fixed number of steps. I think the best way to evaluate the agent would be to do it outside the training loop, that is to say, save current policy using a callback and use another script for evaluating the policy. So, as a result, I would deprecate and then in a future version remove that feature. Pinging @hill-a and @erniejunior .
I think DDPG is the only one implementing that testing mechanism. |
|
Yes, I have written episodic version here (also for DDPG and PPO)
yes, all the metrics (for the different algos) are about current training only. For now, to properly test your policy at different points during training, you will need an external script.
For evaluation, I would use |
I think you are right, this implementation doesn't really make sense in an episodic environment. The implementation in rllab is much clearer. |
solved by the new callback collection |
I do not really agree with the evaluation procedure for DDPG (I don't know if it's the same in other algos).
How it works now:
For nb_rollout_steps the agent performs steps in the environment.
Then for nb_train_steps it trains (nb_train_steps batches)
Then it evaluates on nb_eval_steps
I guess it could kind of make sense if there was no notion of episode or reset. Otherwise it doesn't. You first perform part of an episode, then you train, then you evaluate on part of an episode. Why make this so complicated ? My main problem with this, is that the evaluation return is based on 1 point only, and that this only evaluation episode is performed with different policies (since you rollout and train every nb_eval_step). You want a good estimate of your policy at time t, and you end up with a very noisy (because sample size is 1), estimate of several different policies.
What I did when I worked with DDPG (and what the paper Deep Reinforcement Learning that Matters ended up doing I think:
Repeat for N episodes
Do 1 full episode of rollout
Train
End repeat
Evaluate on nb_eval_rollouts (~20) episodes.
The point on your learning curve is the average of these 20 evaluation rollout, performed using the same policy.
Actually I don't have any non-episode based, not using reset example in mind.
What do you think ?
It is the same in other algos ?
The text was updated successfully, but these errors were encountered: