-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement asynchronous methods #5
Comments
I currently have no plans to implement A3C in this repo because it is quite different, rather than a relatively simple addition to the original DQN. You're welcome to submit a PR if you manage to come up with an Edit: Actually, asynchronous one-step Q-learning could be in scope. And I agree that |
@Kaixhin @lake4790k I was actually working off this repo as a starting point not long ago to attempt to do this (maybe a month?) I evaluated threads and concluded it was not a very good option for one step or n-step due to the way upvalues work within the threads (in particular, the singleton instance+step counts within those algo's in particular cause a number of issues). If you're interested in collaborating, I can invite you to the library I was working on this in. I think that lua---parallel can handle a3c better, and was utilizing a structure similar to the original deepmind code. Edit: To add to @Kaixhin's edit, this is possible. You can pretty easily run multiple instances of the Atari code in parallel. Just serialization at a given step is neither straightforward nor painless between threads. |
@Kaixhin I think these async methods can be done in pure torch using
A network with the global shared
so In the one step case then the learner does n forwards/backwards accumulating gradients in The shared RmsProp they describe with the shared |
@michaelghaben @lake4790k Thanks for the comments - one-step (and possibly n-step) Q-learning can hopefully be integrated with the other features in this repo e.g. the dueling architecture. If that can be achieved then the others like advantage actor-critic can be considered. I'm not going to be able to take the lead on this in the near future, but I'm happy to lend a hand to either a fork or a new repo if I can. For now I've just added an |
@Kaixhin I did a reference implementation as per above in a simple codebase without all the other methods. I hope I will have time to merge it with your codebase, would be interesting to see the performance compared to all the other methods. Will do a PR once ready. |
Someone is trying to replicate this and after skimming through miyosuda/async_deep_reinforce#1 it seems like they got hold of hyperparameters not noted in the paper. Worth keeping an eye on. |
@Kaixhin yes, interesting about the hyperparameters. But I'm not sure that implementing this in Python is a good idea, afaik python does not support multithreaded execution of python script code at all (ie. all of the RL logic...), only code that is outside of GIL (ie. tensorflow operations, cython parts). They also mention slow performance compared to the original, I would not be surprised if that is because of the python issue. (but even single threaded python performance could be poor compared to lua/native for this use case) I'll add my implementation of async to Atari in the coming days, I'm curious how it will work... |
@Kaixhin btw I started with catch comparing cpu and gpu behaviour and noticed that cpu did not converge as gpu did, which should not be the case as all the code is the same for both cases. Except for the random initialization, as by default the code sets manual seed of 1, but in the gpu case it then does a cutorch.manualSeed(torch.random()) before constructing the net. So I think then cpu always gets a poor initialzation from seed 1, while the gpu net gets some other random weights that works better. If I set the cpu seed to random it then converges similarly to gpu (and also faster as you note because of the small net). I'm not sure if this random initialization behaviour is intended, got me quite confused first... |
@lake4790k Interesting - I went with the same initialisation as in DeepMind's code. I don't think a seed of 1 is worse than any other seed for a random number generator - it seems like you just got unlucky. It might be more obvious with Catch, but as far as I know weight initialisation hasn't been looked into for deep reinforcement learning. |
Some results running the async 1-step Q. I used pong to compare the learning speed with the async paper page 20 result. This experiment I ran on 10 hyperthreads (5 physical cores). I would expect the equivalent deepmind performance to be somewhat below the midpoint between the 4 and 8 thread curves as the speed is limited by the 5 physical cores, but having more threads with diverse experiences helps a bit. The time scale of this figure is little less than 14 hours. It achieved the score 0 in about 11 hours. That's exactly where the interpolated deepmind curve would be. I used a learning rate of 0.0007. This experiment I ran an 8 hyperthreads (4 physical cores). The equivalent deepmind curve should be a bit above the 4 thread curve on page 20. The time scale of this figure is little less than 24 hours. At 14 hours it achieved a score of -3. That is exactly where the 4 threaded deepmind curve is. I used a learning rate of 0.0016 In these experiments I did not have learning rate decay as in the paper. The paper says they used the same experiment setup as the double Q paper, but then also says they used gradient norm clipping (which I didn't turn on either) which was introduced in the duelling paper. I also had experiment with the more aggressive 0.0016 learning rate that got stuck in the beginning not improving for long. My guess would be that the gradient clipping would have helped it to get out of there (also the learning rate decay eventually). As the curves in the paper are the average of the 3 best agents out of 50 experiments, and most likely they used an optimized c++ implementation (with tensorflow) and ours is pure torch and I had only a few experiments, these results look pretty good. I still plan to implement n-step Q (in combination with double/PAL/dueling) and A3C and unify as much as possible with experience replay codebase as we discussed. |
@lake4790k looks great - thanks for comparing with DeepMind results. Epochs are more meaningful than training time due to differences in hardware, but your estimates sound about right. Keep at it! |
Closed by #30. |
Ready for the next method based on A3C... |
In A3CAgent:accumulateGradients, what is the reason we have 0.5 in vTarget instead of 2? |
http://arxiv.org/pdf/1602.01783v1.pdf describes asynchronous methods using off policy (1 step /n step Q learning) and even on policy (sarsa and advantage actor-critic (A3C)) reinforcement learning.
These algorithms converge faster with less resources (cpu only multithreaded on a single machine without using a large replay memory) and can achieve better results than other methods.
I think the Hogwild method they use for the lockfree updating of the shared network can be implemented with Torch/Lua threads.sharedserialize
The text was updated successfully, but these errors were encountered: