-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing should be not deterministic #12
Comments
I set off a run on Space Invaders last night - it's one where Rainbow is clearly better than alternatives, but it'll take a few days for it to get to the point where I can tell if that's the case or not. Out of the previous runs I've made, making sure that transitions next to the buffer aren't sampled seemed like an important fix, but I've never run anything for that long. You can have a look at the training curves in the paper to see if any other game might be useful to look at. Non-deterministic evaluation does sound good, but I'm wondering why the random no-ops in the environment wouldn't provide a "stochastic" environment. It could well be that it's just not providing enough stochasticity. Also, not sure if sampling via Q-values or simply taking a new draw of weights via the NoisyLinear layers is the better way to go? |
Ok. On my side I will launch a training on Breakout for a little sanity check. Concerning the sampling via Q-values or just taking new weights for the Noisy layers, I really don't know, we should maybe try both to compare (Q-values sampling may lead to way too much exploration but on the other hand in the late stage of training, the agent may have learn to ignore all incoming noise from the Noisy layers...). |
I think it's difficult for a random agent to do really well at Space Invaders. In any case I plot Q-values on a held-out validation memory, and that's somewhat informative as to learning. Let me know how sampling Q-values goes - I've had a skim through the DM papers and they seem to average results over many testing episodes, but I'm not sure I see anything different about NoisyNet evaluation - without it you'd do a random action uniformly with a very low probability. |
So are your training on space invaders doing better than just a random agent now? ^^ |
Looks reasonable so far. The Q-values increased rapidly, and have now stabilised (looking very similar to the values of the Double DQN). The reward itself is clearly increasing (noisily, but at a reasonable level - not one at which I'd say there's definitely a problem). It's pretty much at the level of a trained Double DQN at about 1/3 of the training steps - but of course according to the Rainbow paper the score only really takes off after the halfway mark (and even then many runs may not work out so well, so even if this fails after the full run it's unfortunately not conclusive). |
Hum ok that seems really nice to me and definitely working! (did you add not deterministic test, like by using new weights in the Noisy layers?) |
I ran this as soon as I got in the last few fixes, so testing is completely deterministic. If DM followed previous evaluation protocol, then we should actually use an ε-greedy policy with ε = 0.001 (the below quote is from the Double DQN paper on DQN evaluation but they later on mention using a lower ε):
So if you're able to do quick tests (perhaps on Pong) for evaluation, the first thing is to see if using 100 instead of 10 evaluation episodes does introduce some variance. Otherwise, given how it is trained to maximise reward even with noisy layers, taking different draws of weights seems like a better (albeit non-backwards-compatible) way of evaluating the network. |
Closing this issue as injecting even a small amount of noise via ε-greedy gives a sufficient distribution over test performance, and it is (AFAIK) DM's standard method of evaluating DQN variants. |
There is a parameter
--evaluation-episodes
but in the current implementation, like we are always acting greedly, all the episodes are going to be exactly the same. I think that to get a better testing evaluation, you should add adeterministic=False
when you are testing (i.e. in stead of taking the action with the higher Q value, you can sample on all the action with each Q value as the probability) .I implemented that on my branch on the last commit marintoro/Rainbow@d061caf (it's really straightforward)
Btw I launched a training last night, everything worked properly. But I don't have access to a powerfull computer yet so the agent was still pretty poor in performance (in the early stage of training). I just wanted to know if you already launched a big training, on which game and if you compared it to a standard DRL algo (like simple DQN for example)?
Because there may still be some non-breaking errors in the implementation which could be sneaky to spot and debug (I mean if the agent is learning worse than simple DQN, there must be something wrong for example).
The text was updated successfully, but these errors were encountered: