[feature request] Problem in evaluation methodology in DDPG #199

ccolas · 2019-02-13T14:51:55Z

I do not really agree with the evaluation procedure for DDPG (I don't know if it's the same in other algos).

How it works now:
For nb_rollout_steps the agent performs steps in the environment.
Then for nb_train_steps it trains (nb_train_steps batches)
Then it evaluates on nb_eval_steps

I guess it could kind of make sense if there was no notion of episode or reset. Otherwise it doesn't. You first perform part of an episode, then you train, then you evaluate on part of an episode. Why make this so complicated ? My main problem with this, is that the evaluation return is based on 1 point only, and that this only evaluation episode is performed with different policies (since you rollout and train every nb_eval_step). You want a good estimate of your policy at time t, and you end up with a very noisy (because sample size is 1), estimate of several different policies.

What I did when I worked with DDPG (and what the paper Deep Reinforcement Learning that Matters ended up doing I think:

Repeat for N episodes
Do 1 full episode of rollout
Train
End repeat
Evaluate on nb_eval_rollouts (~20) episodes.

The point on your learning curve is the average of these 20 evaluation rollout, performed using the same policy.

Actually I don't have any non-episode based, not using reset example in mind.

What do you think ?
It is the same in other algos ?

araffin · 2019-02-13T15:45:50Z

What do you think ?

This is legacy code, so I do agree for using a number of episode for evaluating instead of a fixed number of steps.
For training, the problem may be that an episode has a lot of steps, so you don't want to wait for the end of the episode to update your policy.

I think the best way to evaluate the agent would be to do it outside the training loop, that is to say, save current policy using a callback and use another script for evaluating the policy. So, as a result, I would deprecate and then in a future version remove that feature.

Pinging @hill-a and @erniejunior .

It is the same in other algos ?

I think DDPG is the only one implementing that testing mechanism.

ccolas · 2019-02-13T15:52:04Z

I agree, we could do small number of rollout steps, then training steps repeatedly, and at some point freeze the whole thing and do N episodes of evaluation. The only point is to evaluate a single policy, and do it with a sufficient number of episode to get a meaningful estimate.
After talking with Rémy from my team, it looks like SAC is doing the same. In addition, their is no proper evaluation, the reported returns are computed over the training episodes. I guess in evaluation episodes we would use the average of the policy distribution ?

araffin · 2019-02-13T16:01:39Z

it looks like SAC is doing the same

Yes, I have written episodic version here (also for DDPG and PPO)

. In addition, their is no proper evaluation, the reported returns are computed over the training episodes.

yes, all the metrics (for the different algos) are about current training only. For now, to properly test your policy at different points during training, you will need an external script.

I guess in evaluation episodes we would use the average of the policy distribution ?

For evaluation, I would use predict() method with deterministic=True (at least for SAC).
This uses the mean of the distribution, yes.

ffnc1020 · 2019-03-07T07:41:45Z

I think you are right, this implementation doesn't really make sense in an episodic environment. The implementation in rllab is much clearer.

araffin · 2020-03-12T20:47:31Z

solved by the new callback collection

araffin added the enhancement New feature or request label Feb 13, 2019

araffin mentioned this issue Mar 14, 2019

[question] Episodic Rewards in A2C vs. PPO2 #235

Closed

araffin closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Problem in evaluation methodology in DDPG #199

[feature request] Problem in evaluation methodology in DDPG #199

ccolas commented Feb 13, 2019 •

edited by araffin

araffin commented Feb 13, 2019

ccolas commented Feb 13, 2019

araffin commented Feb 13, 2019 •

edited

ffnc1020 commented Mar 7, 2019

araffin commented Mar 12, 2020

[feature request] Problem in evaluation methodology in DDPG #199

[feature request] Problem in evaluation methodology in DDPG #199

Comments

ccolas commented Feb 13, 2019 • edited by araffin

araffin commented Feb 13, 2019

ccolas commented Feb 13, 2019

araffin commented Feb 13, 2019 • edited

ffnc1020 commented Mar 7, 2019

araffin commented Mar 12, 2020

ccolas commented Feb 13, 2019 •

edited by araffin

araffin commented Feb 13, 2019 •

edited