Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce Breakout benchmark using Double DQN #176

gbg141 opened this issue Oct 20, 2017 · 15 comments · May be fixed by #255

Cannot reproduce Breakout benchmark using Double DQN #176

gbg141 opened this issue Oct 20, 2017 · 15 comments · May be fixed by #255


Copy link

@gbg141 gbg141 commented Oct 20, 2017

I haven't been able to reproduce the results of the Breakout benchmark with Double DQN when using similar hyperparameter values than the ones presented in the original paper. After more than 20M observed frames (~100.000 episodes), the mean 100 episode reward still remains around 10, having achieved a maximum value of 12.

I present in the following list the neural network configuration as well as the hyperparameter values that I'm using in case I'm missing or getting something important wrong:

env = gym.make("BreakoutNoFrameskip-v4")
env = ScaledFloatFrame(wrap_dqn(env))
model = deepq.models.cnn_to_mlp(
        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
act = deepq.learn(
        buffer_size=100000, #cannot store 1M frames as the paper suggests
        exploration_fraction=1000000/float(200000000), #so as to finish after !M steps

Does anyone have some idea of what is going wrong? The analogous results exposed in a jupyter notebook in openai/baselines-results indicate that I should be able to get much better scores.

Thanks in advance.

Copy link

@asimmunawar asimmunawar commented Nov 14, 2017

Same here, I also get similar average rewards. I also ran the deepq/experiments/ example without any modifications and still it just converge to 11 in around 5 million steps. Any help or suggestion would be appreciated.

Copy link

@candytalking candytalking commented Nov 24, 2017

I observe the same problem when training using the "learn" function within the "" file, which is the case to use "". When training using "deepq/experiments/atari/" instead it works fine.

Copy link

@BNSneha BNSneha commented Nov 25, 2017

File "", line 244, in
start_time, start_steps = time.time(), info['steps']
KeyError: 'steps'
How to get rid of this error when trying to run atari/

Copy link

@btaba btaba commented Jan 15, 2018

I have been running in baselines/deepq/experiments/atari/ with the following command python --env BeamRider --save-dir 'savedir-dueling' --dueling --prioritized, and I also cannot reproduce results for BeamRider compared to the jupyter notebook (although it seems that the script was used to create those benchmarks). I had to make minor corrections to run the script due to issues referenced by the comment directly above and from this ticket. I'm effectively running this lightly modified version.

@ichaelm ichaelm linked a pull request that will close this issue Jan 17, 2018
Copy link

@btaba btaba commented Jan 25, 2018

@gbg141 Part of your issue might be that the rewards from the environment wrapped with wrap_deepmind are by default clipped to -1, 1 using np.sign, so the reported rewards in deepq/experiments/atari/ are clipped. If you turn reward clipping off and explicitly save the clipped reward in the replay buffer for training, that might work for you.

Copy link

@kdu4108 kdu4108 commented Feb 3, 2018

@btaba Did you try this/did it work for you?

Copy link

@btaba btaba commented Feb 3, 2018

@kdu4108 that actually didn't work for me, I also tried a git reset --hard 1f3c3e33e7891cb3 and wasn't able to reproduce the results in this notebook for Breakout.

Edit: trained on commit 1f3c3e33e7891cb3 using python --env Breakout --target-update-freq 10000 --learning-freq 4 --prioritized --dueling for 50M frames and I am only able to reach a reward of ~250 as opposed to ~400.

Copy link

@kdu4108 kdu4108 commented Feb 9, 2018

@btaba Okay thanks for the response. I tried training the default Pong using that version and successfully reproduced their results. Out of curiosity, have you tried to reproduce results on any other environments using that commit? Or, have you tested any later commits that might hold fixes for the Breakout reward difference?

Copy link

@btaba btaba commented Feb 12, 2018

@kdu4108 I only tried that commit on Breakout and BeamRider, and was not able to reproduce results

Copy link

@AshishMehtaIO AshishMehtaIO commented Feb 20, 2018

I'm facing the same issue.
The only major difference between DQN paper and baselines implementation is the Optimizer. (rmsprop vs ADAM). is there a major difference when using one or the other?

Copy link

@btaba btaba commented Feb 20, 2018

@ashishm-io You can try. Don't forget to log the actual episode rewards and not the clipped ones.

I find this DQN implementation to actually work. It's probably easier from there to add double-q and dueling networks.

Copy link

@AshishMehtaIO AshishMehtaIO commented Mar 13, 2018

Shouldn't Baselines log both clipped and episode rewards by default? Isn't that an essential feature to compare results with other implementations?

Copy link

@benbotto benbotto commented May 14, 2018

@ashishm-io Another difference is the size of the replay buffer. You might try bumping that to 1e6, because by default it's only 1e4. Note that in the ScaledFloatFrame wrapper is used, so 32-bit floats are used to store observations rather than 8-bit ints. In other words, you'll need a ton of memory!

@kdu4108 Yea, but Pong is the simplest of the Atari games as far as I know. In my implementation I achieve an average of over 20 in about 3 million frames. Breakout is significantly harder.

@btaba When you achieved the 250 average, that's the actual score, right? As opposed to the clipped score? And also, is that with or without episodic life? In other words, is that an average of 250 in one life, or in 5 lives?

OpenAI team: How do we reproduce what's reported in the baselines-results repository ( It shows average scores of 400+; however, it references files that no longer exist, like I'm using the script, with dueling off but otherwise default, and getting an average of just over 18 after 10M frames (the default). I'm trying to implement DQN, but most of the code I find online has subtle bugs. It's important to have something out there to reference that has reproducible results!

Copy link

@ppwwyyxx ppwwyyxx commented May 14, 2018

@benbotto The implementation I open sourced two years ago ( can reproduce 400+ average score on Breakout within 10 hours on one GTX1080Ti.

Copy link

@benbotto benbotto commented May 15, 2018

Thank you @ppwwyyxx, I'll definitely run your implementation and compare the results against my own. I'm able to reproduce the 400 score as well in my code in vanilla DQN, but I'm running into trouble with Prioritized Experience Replay. This is the only implementation that I know of that uses PER and takes into account Importance Sampling Weights: most forgo that last part. I've found this implementation, which does not correctly normalize the weights. There's also this one, which ignores the IS weights altogether. The Baselines implementation looks right to me--aside from a minor off-by-one bug that's awaiting a pull. That said, it would be nice to be able to reliably reproduce the reported numbers in the baselines-results repository!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

9 participants