Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

Closed
pmineiro opened this issue Sep 3, 2020 · 6 comments
Closed

rl_unplugged/rwrl_d4pg.ipynb does not reproduce #74

pmineiro opened this issue Sep 3, 2020 · 6 comments
Assignees

Comments

@pmineiro
Copy link

pmineiro commented Sep 3, 2020

The notebook is easy to get running, kudos for that. However the results do not match the repository.

When I run it the output of "Training Loop" is:

[Learner] Critic Loss = 4.062 | Policy Loss = 0.500 | Steps = 1 | Walltime = 0
[Learner] Critic Loss = 3.844 | Policy Loss = 0.269 | Steps = 46 | Walltime = 3.173
[Learner] Critic Loss = 3.770 | Policy Loss = 0.296 | Steps = 92 | Walltime = 4.182

and the "Evaluation":

[Evaluation] Episode Length = 1000 | Episode Return = 68.235 | Episodes = 1 | Steps = 1000 | Steps Per Second = 420.795
[Evaluation] Episode Length = 1000 | Episode Return = 73.514 | Episodes = 2 | Steps = 2000 | Steps Per Second = 448.120
[Evaluation] Episode Length = 1000 | Episode Return = 71.517 | Episodes = 3 | Steps = 3000 | Steps Per Second = 463.122
[Evaluation] Episode Length = 1000 | Episode Return = 74.285 | Episodes = 4 | Steps = 4000 | Steps Per Second = 464.442
[Evaluation] Episode Length = 1000 | Episode Return = 72.500 | Episodes = 5 | Steps = 5000 | Steps Per Second = 459.378

Is this expected?

@diegolascasas diegolascasas self-assigned this Sep 16, 2020
@sergomezcol
Copy link
Contributor

Thanks Paul! I'm not sure I understand your question, but let me clarify a bit what we mean by training and evaluation loop and what the numbers you're seeing mean.

In the training loop, the agent reads data batches from the RWRL dataset and performs D4PG learning steps on them. Every step corresponds to a batch of data.

In the evaluation loop, the agent is kept fixed and is used to interact with the RWRL environment for a few episodes. We report the episode return to estimate agent's performance.

What you typically want to do in offline RL is to interleave training and evaluation, so you keep learning more and more from the data and evaluate periodically to estimate learning progress.

I hope that helps!

@pmineiro
Copy link
Author

Thanks Paul! I'm not sure I understand your question,

My claim is: what is checked in under https://github.com/deepmind/deepmind-research/blob/master/rl_unplugged/rwrl_d4pg.ipynb has cell output which is quite different than what is displayed when the notebook is downloaded and executed.

@sergomezcol
Copy link
Contributor

Oh, that is expected. The weights in the neural network are initialized randomly and the data is also randomly shuffled, so the loss values will be different every time you run this unless you fix the seed for the TF random number generator. Since weights are different, the actions during evaluation will also be different and episode returns will change too.

@pmineiro
Copy link
Author

Can you set and publish the seed(s) in the notebook?

I'm having trouble getting any episode return close to what is published in the notebook (was it a "lucky run"?).

@jerryli27
Copy link

Unfortunately releasing the key for this specific notebook result may require too much effort than its worth. As Sergio pointed out, It is totally possible to have a good or a bad run depending on the random seed and the notebook result can indeed be a lucky run. In our experiments we observed that D4PG runs tend to have high variance, making it less robust. The purpose of this colab is to be a starting point and to show D4PG as a baseline -- neither ~70 episode return nor ~136 episode return are high bars and should be easy to beat.

In the paper we average the result of three different runs, and I would suggest to follow similar protocols to avoid "lucky runs" affecting your experiment results.

@pmineiro
Copy link
Author

First, let me emphasize my appreciation for providing a benchmark to the community. The great thing about the notebook is that, once you reduce any policy (however obtained) to an acme.FeedForwardActor(), evaluation is straightforward. So I'm not actually blocked per se, and I have multiple tasks and difficulty levels to play with. However I'm cautious because I can't reproduce the baselines in the publication (indeed, I just have to read them off of a picture in the reference publication, there's no detail data that I could use to make a new plot with my results merged on it afaik).

Furthermore:

I would suggest to follow similar protocols to avoid "lucky runs" affecting your experiment results.

The point of a benchmark is reliable replication for the purpose of scientific comparison. So I would suggest that an even better benchmark would 1) actually reduce such proposed procedures to code and 2) provide a notebook coupled with the reference paper which produces the baseline results and proscribes the comparison procedure.

In any event, thanks again, I'll close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants