-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing the scores reported by the IQN paper #37
Comments
hi, thank you for your support and reporting this! could you verify if you are running with that bug fix? we are currently re-running the baseline results for IQN on all games and will be releasing these once they are done (should be by next week). i'll add a note in the "What's New" section when this is done. |
Thank you. The plot I pasted above is before that commit. I'll try again with the current master branch. What about my first question on where these values come from?
|
see the line right before equation (4) in the paper (n=64): https://arxiv.org/pdf/1806.06923.pdf |
I mean |
ah yes, you're right. but look in the second-to-last paragraph of the same page, where they discuss varying N and N' in {1, 8, 32, 64}. Figure 2 compares the results. Although it suggests N' doesn't change much past 8, we decided to use the larger of these explored values. |
I see. Thank you for clarifying it. |
Does dopamine follow the up-to-30-noop evaluation protocol used in the paper? I cannot find code that sends noop actions after reset. |
hi, |
Because sticky actions are disabled by the config file and up-to-30-noop is not implemented, I suspect that the current evaluation protocol of |
BTW I sent Will Dabney an email asking the values of N and N' just a month ago, still don't have a reply. Anyone knows the values? |
Hi, thanks for the thorough look at IQN! Hopefully by now you received the answer, but N and N' should be as in the |
Also, out of curiosity -- did you figure out what was wrong? |
Thank you for the information!
I haven't figure it out. The only difference of dopamine's IQN from the paper's I'm aware of is 30-noop, but I don't know how it affect scores. I would really appreciate it if you could share any other differences you are aware of. |
I got a reply from Georg Ostrovski and confirmed that N=N'=64. He said the weight initialization was as below:
|
Dopamine uses 2D convolutions with |
I tried
The results seem slightly worse than |
Hi, Very interesting! So it seems like it makes a small but noticeable difference, possibly due to the way Adam handles step size adaption. What do you think? |
It would be interesting to average this over multiple runs and make sure
it's a statistically significant difference. I can try running this after
the ICML deadline, unless you beat me to it, Yasuhiro :).
…On Mon, Jan 7, 2019, 2:39 PM Marc G. Bellemare ***@***.*** wrote:
Hi,
Very interesting! So it seems like it makes a small but noticeable
difference, possibly due to the way Adam handles step size adaption. What
do you think?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATYhMVv4zv8YcvnhVuXtg5T62o9cQZMTks5vA6J8gaJpZM4XYJ_h>
.
|
FYI, I have pasted the plots of SAME vs VALID for the six games here, although they are all single runs. https://docs.google.com/document/d/1fsYzmNhfLvtPP4Cm-dbtp_MviH5WLo8qeiUJYMgRVio/edit?usp=sharing |
Could you elaborate this? |
Hi, I am also trying to reproduce the scores of IQN, but according to @psc-g dopamine doesn't follow the 30-noop protocol. Does this mean that if I want to use C51, QR-DQN and IQN as baselines, I will have to redo all the experiments instead of using the scores reported in their paper? I don't have a lot of resources so it is not likely that I can finish all of them before NeurIPS deadline. So I am wondering if you @muupan figured out the affect of 30-noop yet? Were there significant differences between with and without 30-noop? |
@cathera I haven't checked differences between with and without 30-noop. |
The 30-noop are a little bit of a hack, and they don't have a big impact on performance. They were designed to discourage open loop policies like The Brute (discussed in the Machado et al., 2018 paper). With sticky actions (same paper), the 30-noop become less relevant. |
@psc-g @mgbellemare So what is the status of this issue here? The baseline scores of IQN at least for beamrider still don't match the paper. I was just about the open a duplicate issue here, therefore please let me paste the text that I have already written: I have been especially interested in reproducing the atari scores of distributional algorithms (qrdqn / iqn) from the original papers / dqn zoo. I have selected one particular game, where the difference between qrdqn and iqn should be very obvious: BeamRider, see the following plot vom dqn_zoo (see https://github.com/deepmind/dqn_zoo/blob/master/plot_atari_individual.svg): So far I tried exactly replicating the environment from dqn_zoo by registering the custom gym environments (https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/gym_atari.py#L36) but so far have not been successful in reproducing the IQN score (I can reach the exact same qrdqn score with stable_baseline3's QRDQN and the same hyperparams as in the paper) and also from looking at your baseline plots here: https://google.github.io/dopamine/baselines/atari/plots.html IQN is nowhere near the results as in the paper / dqn_zoo's IQN for the beamrider game. So I have a couple of questions:
Generally I have a hard time figuring out, which score's I should rely on when benchmarking my own implementation (to verify it's correctness), so far I found dqn_zoo to be the most promising source, since it's gym config is easy to replicate (compared to xitari of the original papers, which does not follow gym and is therefore hard to use with other implementations) and seems to be closer to the paper than the openai atari configs (e.g. NoFrameskip-v4 versions paired with the common env wrappers, see https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py modulo the FireReset wrapper which Deepmind did not use according to openai/baselines#240). |
Nevermind about the question to point me to the source of the gym config, I figured it out:
|
Hi, Does DQN Zoo use sticky actions? My understanding is no. The 7000 is with sticky actions, so that might explain the difference. (Maybe it's time someone ran a side-by-side comparison). I can't explain the IQN paper vs DQN Zoo difference, although i would take DQN Zoo as the authoritative source. Frame preprocessing (colour, cropping, etc.) can make differences that add up, unfortunately. If your agent is learning and achieving a score in the range of [Dopamine, DQN Zoo] I would assume you've mostly done things correctly. If you are trying to reproduce the algorithm perfectly - why not just use the DQN Zoo code? |
FWIW I've been told (but not verified) that DQN Zoo uses terminal on life loss and a larger eval epsilon, both of which would result in different performance than e.g. the Dopamine implementation. |
Thank you very much @mgbellemare for sharing your thoughts and giving some important hints.
From what I understand the I compared the
It would be really great to have your feedback here. |
Thank you for opensourcing such a great code!
I have questions about your IQN implementation, especially on how it can reproduce the scores reported by th paper.
First, your config file https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin specifies N=N'=64. How did you choose these values?
Second, can the IQN implementation reproduce the scores reported by the paper? I ran it by myself against six games, but the results do not match the paper.
I used this command:
Here is the tensorboard plot I got:
Seeing Figure 7 of the IQN paper, they report the raw score of 342,016 for Asterix, 42,776 for Beam Rider, 734 for Breakout, 25,750 for Q*Bert, 30,140 for Seaquest, 28,888 for Space Invaders. Have you successfully reproduced scores on the same level? If yes, how? If no, are you aware of any differences of implementation or settings from DeepMind's?
The text was updated successfully, but these errors were encountered: