Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Problem in evaluation methodology in DDPG #199

Closed
ccolas opened this issue Feb 13, 2019 · 5 comments
Closed

[feature request] Problem in evaluation methodology in DDPG #199

ccolas opened this issue Feb 13, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@ccolas
Copy link
Collaborator

ccolas commented Feb 13, 2019

I do not really agree with the evaluation procedure for DDPG (I don't know if it's the same in other algos).

How it works now:
For nb_rollout_steps the agent performs steps in the environment.
Then for nb_train_steps it trains (nb_train_steps batches)
Then it evaluates on nb_eval_steps

I guess it could kind of make sense if there was no notion of episode or reset. Otherwise it doesn't. You first perform part of an episode, then you train, then you evaluate on part of an episode. Why make this so complicated ? My main problem with this, is that the evaluation return is based on 1 point only, and that this only evaluation episode is performed with different policies (since you rollout and train every nb_eval_step). You want a good estimate of your policy at time t, and you end up with a very noisy (because sample size is 1), estimate of several different policies.

What I did when I worked with DDPG (and what the paper Deep Reinforcement Learning that Matters ended up doing I think:

Repeat for N episodes
Do 1 full episode of rollout
Train
End repeat
Evaluate on nb_eval_rollouts (~20) episodes.

The point on your learning curve is the average of these 20 evaluation rollout, performed using the same policy.

Actually I don't have any non-episode based, not using reset example in mind.

What do you think ?
It is the same in other algos ?

@araffin araffin added the enhancement New feature or request label Feb 13, 2019
@araffin
Copy link
Collaborator

araffin commented Feb 13, 2019

What do you think ?

This is legacy code, so I do agree for using a number of episode for evaluating instead of a fixed number of steps.
For training, the problem may be that an episode has a lot of steps, so you don't want to wait for the end of the episode to update your policy.

I think the best way to evaluate the agent would be to do it outside the training loop, that is to say, save current policy using a callback and use another script for evaluating the policy. So, as a result, I would deprecate and then in a future version remove that feature.

Pinging @hill-a and @erniejunior .

It is the same in other algos ?

I think DDPG is the only one implementing that testing mechanism.

@ccolas
Copy link
Collaborator Author

ccolas commented Feb 13, 2019

  • I agree, we could do small number of rollout steps, then training steps repeatedly, and at some point freeze the whole thing and do N episodes of evaluation. The only point is to evaluate a single policy, and do it with a sufficient number of episode to get a meaningful estimate.

  • After talking with Rémy from my team, it looks like SAC is doing the same. In addition, their is no proper evaluation, the reported returns are computed over the training episodes. I guess in evaluation episodes we would use the average of the policy distribution ?

@araffin
Copy link
Collaborator

araffin commented Feb 13, 2019

it looks like SAC is doing the same

Yes, I have written episodic version here (also for DDPG and PPO)

. In addition, their is no proper evaluation, the reported returns are computed over the training episodes.

yes, all the metrics (for the different algos) are about current training only. For now, to properly test your policy at different points during training, you will need an external script.

I guess in evaluation episodes we would use the average of the policy distribution ?

For evaluation, I would use predict() method with deterministic=True (at least for SAC).
This uses the mean of the distribution, yes.

@ffnc1020
Copy link

ffnc1020 commented Mar 7, 2019

I think you are right, this implementation doesn't really make sense in an episodic environment. The implementation in rllab is much clearer.

@araffin
Copy link
Collaborator

araffin commented Mar 12, 2020

solved by the new callback collection

@araffin araffin closed this as completed Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants