rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

pmineiro · 2020-09-03T21:28:34Z

The notebook is easy to get running, kudos for that. However the results do not match the repository.

When I run it the output of "Training Loop" is:

[Learner] Critic Loss = 4.062 | Policy Loss = 0.500 | Steps = 1 | Walltime = 0
[Learner] Critic Loss = 3.844 | Policy Loss = 0.269 | Steps = 46 | Walltime = 3.173
[Learner] Critic Loss = 3.770 | Policy Loss = 0.296 | Steps = 92 | Walltime = 4.182

and the "Evaluation":

[Evaluation] Episode Length = 1000 | Episode Return = 68.235 | Episodes = 1 | Steps = 1000 | Steps Per Second = 420.795
[Evaluation] Episode Length = 1000 | Episode Return = 73.514 | Episodes = 2 | Steps = 2000 | Steps Per Second = 448.120
[Evaluation] Episode Length = 1000 | Episode Return = 71.517 | Episodes = 3 | Steps = 3000 | Steps Per Second = 463.122
[Evaluation] Episode Length = 1000 | Episode Return = 74.285 | Episodes = 4 | Steps = 4000 | Steps Per Second = 464.442
[Evaluation] Episode Length = 1000 | Episode Return = 72.500 | Episodes = 5 | Steps = 5000 | Steps Per Second = 459.378

Is this expected?

The text was updated successfully, but these errors were encountered:

sergomezcol · 2020-09-16T14:58:17Z

Thanks Paul! I'm not sure I understand your question, but let me clarify a bit what we mean by training and evaluation loop and what the numbers you're seeing mean.

In the training loop, the agent reads data batches from the RWRL dataset and performs D4PG learning steps on them. Every step corresponds to a batch of data.

In the evaluation loop, the agent is kept fixed and is used to interact with the RWRL environment for a few episodes. We report the episode return to estimate agent's performance.

What you typically want to do in offline RL is to interleave training and evaluation, so you keep learning more and more from the data and evaluate periodically to estimate learning progress.

I hope that helps!

pmineiro · 2020-09-16T17:06:05Z

Thanks Paul! I'm not sure I understand your question,

My claim is: what is checked in under https://github.com/deepmind/deepmind-research/blob/master/rl_unplugged/rwrl_d4pg.ipynb has cell output which is quite different than what is displayed when the notebook is downloaded and executed.

sergomezcol · 2020-09-17T11:41:56Z

Oh, that is expected. The weights in the neural network are initialized randomly and the data is also randomly shuffled, so the loss values will be different every time you run this unless you fix the seed for the TF random number generator. Since weights are different, the actions during evaluation will also be different and episode returns will change too.

pmineiro · 2020-09-17T14:47:55Z

Can you set and publish the seed(s) in the notebook?

I'm having trouble getting any episode return close to what is published in the notebook (was it a "lucky run"?).

jerryli27 · 2020-09-17T15:31:54Z

Unfortunately releasing the key for this specific notebook result may require too much effort than its worth. As Sergio pointed out, It is totally possible to have a good or a bad run depending on the random seed and the notebook result can indeed be a lucky run. In our experiments we observed that D4PG runs tend to have high variance, making it less robust. The purpose of this colab is to be a starting point and to show D4PG as a baseline -- neither ~70 episode return nor ~136 episode return are high bars and should be easy to beat.

In the paper we average the result of three different runs, and I would suggest to follow similar protocols to avoid "lucky runs" affecting your experiment results.

pmineiro · 2020-09-17T18:45:27Z

First, let me emphasize my appreciation for providing a benchmark to the community. The great thing about the notebook is that, once you reduce any policy (however obtained) to an acme.FeedForwardActor(), evaluation is straightforward. So I'm not actually blocked per se, and I have multiple tasks and difficulty levels to play with. However I'm cautious because I can't reproduce the baselines in the publication (indeed, I just have to read them off of a picture in the reference publication, there's no detail data that I could use to make a new plot with my results merged on it afaik).

Furthermore:

I would suggest to follow similar protocols to avoid "lucky runs" affecting your experiment results.

The point of a benchmark is reliable replication for the purpose of scientific comparison. So I would suggest that an even better benchmark would 1) actually reduce such proposed procedures to code and 2) provide a notebook coupled with the reference paper which produces the baseline results and proscribes the comparison procedure.

In any event, thanks again, I'll close the issue now.

diegolascasas self-assigned this Sep 16, 2020

pmineiro closed this as completed Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

pmineiro commented Sep 3, 2020 •

edited

Loading

sergomezcol commented Sep 16, 2020

pmineiro commented Sep 16, 2020

sergomezcol commented Sep 17, 2020

pmineiro commented Sep 17, 2020

jerryli27 commented Sep 17, 2020

pmineiro commented Sep 17, 2020

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

Comments

pmineiro commented Sep 3, 2020 • edited Loading

sergomezcol commented Sep 16, 2020

pmineiro commented Sep 16, 2020

sergomezcol commented Sep 17, 2020

pmineiro commented Sep 17, 2020

jerryli27 commented Sep 17, 2020

pmineiro commented Sep 17, 2020

pmineiro commented Sep 3, 2020 •

edited

Loading