Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the scores reported by the IQN paper #37

Open
muupan opened this issue Oct 11, 2018 · 29 comments
Open

Reproducing the scores reported by the IQN paper #37

muupan opened this issue Oct 11, 2018 · 29 comments

Comments

@muupan
Copy link

muupan commented Oct 11, 2018

Thank you for opensourcing such a great code!

I have questions about your IQN implementation, especially on how it can reproduce the scores reported by th paper.

First, your config file https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin specifies N=N'=64. How did you choose these values?

Second, can the IQN implementation reproduce the scores reported by the paper? I ran it by myself against six games, but the results do not match the paper.

I used this command:

python3 -um dopamine.atari.train '--agent_name=implicit_quantile' '--base_dir=results' '--gin_files=dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin' '--gin_bindings=AtariPreprocessing.terminal_on_life_loss=True' "--gin_bindings=Runner.game_name='Breakout'"

Here is the tensorboard plot I got:
image

Seeing Figure 7 of the IQN paper, they report the raw score of 342,016 for Asterix, 42,776 for Beam Rider, 734 for Breakout, 25,750 for Q*Bert, 30,140 for Seaquest, 28,888 for Space Invaders. Have you successfully reproduced scores on the same level? If yes, how? If no, are you aware of any differences of implementation or settings from DeepMind's?

@psc-g
Copy link
Collaborator

psc-g commented Oct 11, 2018

hi, thank you for your support and reporting this!
it turns out there was a very subtle bug in the IQN implementation that was recently fixed here: 2de70a4
with that bug fix, we were able to reproduce the published results.

could you verify if you are running with that bug fix?

we are currently re-running the baseline results for IQN on all games and will be releasing these once they are done (should be by next week). i'll add a note in the "What's New" section when this is done.

@muupan
Copy link
Author

muupan commented Oct 11, 2018

Thank you. The plot I pasted above is before that commit. I'll try again with the current master branch.

What about my first question on where these values come from?

ImplicitQuantileAgent.num_tau_samples = 64
ImplicitQuantileAgent.num_tau_prime_samples = 64

@psc-g
Copy link
Collaborator

psc-g commented Oct 11, 2018

see the line right before equation (4) in the paper (n=64): https://arxiv.org/pdf/1806.06923.pdf

@muupan
Copy link
Author

muupan commented Oct 11, 2018

I mean N and N' in (3), not n in (4). ImplicitQuantileAgent.num_tau_samples and ImplicitQuantileAgent.num_tau_prime_samples correspond to N and N', respectively, correct?

@psc-g
Copy link
Collaborator

psc-g commented Oct 11, 2018

ah yes, you're right. but look in the second-to-last paragraph of the same page, where they discuss varying N and N' in {1, 8, 32, 64}. Figure 2 compares the results. Although it suggests N' doesn't change much past 8, we decided to use the larger of these explored values.

@muupan
Copy link
Author

muupan commented Oct 11, 2018

I see. Thank you for clarifying it.

@muupan
Copy link
Author

muupan commented Oct 15, 2018

Does dopamine follow the up-to-30-noop evaluation protocol used in the paper? I cannot find code that sends noop actions after reset.

@psc-g
Copy link
Collaborator

psc-g commented Oct 15, 2018

hi,
we don't follow the up-to-30-noop evaluation protocol in dopamine. we have chosen to follow the recommendations in Machado et al., 2018 (https://arxiv.org/abs/1709.06009), which does not include that.

@muupan
Copy link
Author

muupan commented Oct 16, 2018

Because sticky actions are disabled by the config file and up-to-30-noop is not implemented, I suspect that the current evaluation protocol of implicit_quantile_icml.gin is more deterministic and thus easier than that of the IQN paper. Have you compared the scores of dopamine with and without up-to-30-noop?

@muupan
Copy link
Author

muupan commented Oct 22, 2018

I ran implicit_quantile_icml.gin again with the fix. The results are much better now, but on Breakout and Seaquest it didn't reached the paper scores yet. Any ideas?

screen shot 2018-10-22 at 14 51 03

screen shot 2018-10-22 at 14 50 40

screen shot 2018-10-22 at 14 49 53

@muupan
Copy link
Author

muupan commented Nov 3, 2018

BTW I sent Will Dabney an email asking the values of N and N' just a month ago, still don't have a reply. Anyone knows the values?

@mgbellemare
Copy link
Collaborator

Hi, thanks for the thorough look at IQN! Hopefully by now you received the answer, but N and N' should be as in the implicit_quantile_icml.gin file.

@mgbellemare
Copy link
Collaborator

Also, out of curiosity -- did you figure out what was wrong?

@muupan
Copy link
Author

muupan commented Nov 14, 2018

N and N' should be as in the implicit_quantile_icml.gin file.

Thank you for the information!

Also, out of curiosity -- did you figure out what was wrong?

I haven't figure it out. The only difference of dopamine's IQN from the paper's I'm aware of is 30-noop, but I don't know how it affect scores. I would really appreciate it if you could share any other differences you are aware of.

@muupan
Copy link
Author

muupan commented Nov 15, 2018

I got a reply from Georg Ostrovski and confirmed that N=N'=64. He said the weight initialization was as below:

  • for all linear layers, all weights are drawn uniformly from [ -z, z ] , with z = 1 / sqrt ( num_inputs )
  • for all conv layers, all weights are drawn uniformly from [-z, z] , with z = 1 / sqrt ( num_channels * filter_size^2 )

@muupan
Copy link
Author

muupan commented Dec 27, 2018

Dopamine uses 2D convolutions with padding=SAME, which makes the number of activations after the three convolutions be 11*11*64=7744, but it should be padding=VALID, thus 7*7*64=3136 (confirmed by Georg Ostrovski).

@muupan
Copy link
Author

muupan commented Jan 7, 2019

I tried padding=VALID for the same set of games by changing these lines:

--- a/dopamine/agents/implicit_quantile/implicit_quantile_agent.py
+++ b/dopamine/agents/implicit_quantile/implicit_quantile_agent.py
@@ -121,13 +121,16 @@ class ImplicitQuantileAgent(rainbow_agent.RainbowAgent):
     state_net = tf.div(state_net, 255.)
     state_net = slim.conv2d(
         state_net, 32, [8, 8], stride=4,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.conv2d(
         state_net, 64, [4, 4], stride=2,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.conv2d(
         state_net, 64, [3, 3], stride=1,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.flatten(state_net)
     state_net_size = state_net.get_shape().as_list()[-1]
     state_net_tiled = tf.tile(state_net, [num_quantiles, 1])

The results seem slightly worse than padding=SAME. Note that the score reported by the paper is 42,776 for Beam Rider.

screen shot 2019-01-07 at 18 19 02

@mgbellemare
Copy link
Collaborator

Hi,

Very interesting! So it seems like it makes a small but noticeable difference, possibly due to the way Adam handles step size adaption. What do you think?

@psc-g
Copy link
Collaborator

psc-g commented Jan 8, 2019 via email

@muupan
Copy link
Author

muupan commented Jan 8, 2019

FYI, I have pasted the plots of SAME vs VALID for the six games here, although they are all single runs. https://docs.google.com/document/d/1fsYzmNhfLvtPP4Cm-dbtp_MviH5WLo8qeiUJYMgRVio/edit?usp=sharing

@muupan
Copy link
Author

muupan commented Jan 8, 2019

possibly due to the way Adam handles step size adaption

Could you elaborate this?

@cathera
Copy link

cathera commented May 5, 2019

Hi, I am also trying to reproduce the scores of IQN, but according to @psc-g dopamine doesn't follow the 30-noop protocol. Does this mean that if I want to use C51, QR-DQN and IQN as baselines, I will have to redo all the experiments instead of using the scores reported in their paper?

I don't have a lot of resources so it is not likely that I can finish all of them before NeurIPS deadline. So I am wondering if you @muupan figured out the affect of 30-noop yet? Were there significant differences between with and without 30-noop?

@muupan
Copy link
Author

muupan commented May 6, 2019

@cathera I haven't checked differences between with and without 30-noop.

@mgbellemare
Copy link
Collaborator

The 30-noop are a little bit of a hack, and they don't have a big impact on performance. They were designed to discourage open loop policies like The Brute (discussed in the Machado et al., 2018 paper). With sticky actions (same paper), the 30-noop become less relevant.

@hh0rva1h
Copy link

@psc-g @mgbellemare So what is the status of this issue here? The baseline scores of IQN at least for beamrider still don't match the paper. I was just about the open a duplicate issue here, therefore please let me paste the text that I have already written:

I have been especially interested in reproducing the atari scores of distributional algorithms (qrdqn / iqn) from the original papers / dqn zoo. I have selected one particular game, where the difference between qrdqn and iqn should be very obvious: BeamRider, see the following plot vom dqn_zoo (see https://github.com/deepmind/dqn_zoo/blob/master/plot_atari_individual.svg):
image
where the green line corresponds to IQN and the red one to QRDQN, the gray one is rainbow.

So far I tried exactly replicating the environment from dqn_zoo by registering the custom gym environments (https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/gym_atari.py#L36) but so far have not been successful in reproducing the IQN score (I can reach the exact same qrdqn score with stable_baseline3's QRDQN and the same hyperparams as in the paper) and also from looking at your baseline plots here: https://google.github.io/dopamine/baselines/atari/plots.html IQN is nowhere near the results as in the paper / dqn_zoo's IQN for the beamrider game.

So I have a couple of questions:

  1. Why is there such a large discrepancy between the IQN score of dopamine and dqn_zoo with regards to the beamrider game? I failed to figure out which gym config you use for your benchmark here, would also be highly appreciated if you could point me to the source.
  2. Why are there such large discrepancies between the paper and the code repos in general? IQN paper proposes a score of 42,776 for beamrider, dqn_zoo's implementation gets between 20k and 30k but nowhere near over 40k and dopamine's IQN settles around 7k. Both in the IQN paper as well as in the QRDQN paper beamrider should be over 30k in case of QRDQN, but in dqn_zoo QRDQN gets stuck at around 10k (which I could confirm with different QRDQN implementations).

Generally I have a hard time figuring out, which score's I should rely on when benchmarking my own implementation (to verify it's correctness), so far I found dqn_zoo to be the most promising source, since it's gym config is easy to replicate (compared to xitari of the original papers, which does not follow gym and is therefore hard to use with other implementations) and seems to be closer to the paper than the openai atari configs (e.g. NoFrameskip-v4 versions paired with the common env wrappers, see https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py modulo the FireReset wrapper which Deepmind did not use according to openai/baselines#240).

@hh0rva1h
Copy link

Nevermind about the question to point me to the source of the gym config, I figured it out:

def create_atari_environment(game_name=None, sticky_actions=True):
, so it seems contrary to dqn_zoo your prefer the NoFrameskip-v4 version of the respective game (despite sticky-actions being default in the lib, you explicitly disble it in the gin files to match the paper). Still, I'd like to be able to reproduce the paper scores, help would be highly appreciated here.

@mgbellemare
Copy link
Collaborator

Hi,

Does DQN Zoo use sticky actions? My understanding is no. The 7000 is with sticky actions, so that might explain the difference. (Maybe it's time someone ran a side-by-side comparison).

I can't explain the IQN paper vs DQN Zoo difference, although i would take DQN Zoo as the authoritative source.

Frame preprocessing (colour, cropping, etc.) can make differences that add up, unfortunately. If your agent is learning and achieving a score in the range of [Dopamine, DQN Zoo] I would assume you've mostly done things correctly. If you are trying to reproduce the algorithm perfectly - why not just use the DQN Zoo code?

@mgbellemare
Copy link
Collaborator

FWIW I've been told (but not verified) that DQN Zoo uses terminal on life loss and a larger eval epsilon, both of which would result in different performance than e.g. the Dopamine implementation.

@hh0rva1h
Copy link

Thank you very much @mgbellemare for sharing your thoughts and giving some important hints.
Dopamine is indeed using a different config from the respective papers for the baselines plots (in case of iqn https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile.gin instead of https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin) that is different in the following sense:

  1. some hyper-parameters for the agent are changed to be comparable to the rainbow paper
  2. the sticky actions variant of the environment is used
  3. no terminal on life loss condition

From what I understand the _icml.gin config is supposed to replicate the setup of the paper (which is also the config OP used to replicate the paper) with the only difference being the 30-noop protocol which should not make a lot of difference according to #37 (comment).

I compared the _icml.gin config to the respective config of dqn_zoo (https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/iqn/run_atari.py#L47 see lines 47 to 79) and they seem to be mostly identical however I noticed two differences:

  1. dqn_zoo has exploration_epsilon_decay_frame_fraction=0.02 which should be 1 million frames, which corresponds to 250k steps, while implicit_quantile_icml.gin says RainbowAgent.epsilon_decay_period = 1000000 # agent steps. The papers talk about a 1 million frame decay, therefore the gin file should actually use RainbowAgent.epsilon_decay_period = 250000 (the same applies to dqn_nature.gin where this value is also not matching the publication), shouldn't it?
  2. dqn_zoo says flags.DEFINE_integer('tau_samples_policy', 64, '') while dopamine uses ImplicitQuantileAgent.num_quantile_samples = 32, but from what I understand here, this time dopamine is in accordance with the paper and dqn_zoo is not.

It would be really great to have your feedback here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants