Implement asynchronous methods #5

lake4790k · 2016-03-19T08:21:04Z

http://arxiv.org/pdf/1602.01783v1.pdf describes asynchronous methods using off policy (1 step /n step Q learning) and even on policy (sarsa and advantage actor-critic (A3C)) reinforcement learning.

These algorithms converge faster with less resources (cpu only multithreaded on a single machine without using a large replay memory) and can achieve better results than other methods.

I think the Hogwild method they use for the lockfree updating of the shared network can be implemented with Torch/Lua threads.sharedserialize

Kaixhin · 2016-03-19T11:51:52Z

I currently have no plans to implement A3C in this repo because it is quite different, rather than a relatively simple addition to the original DQN. You're welcome to submit a PR if you manage to come up with an -asynchronous true-style option.

Edit: Actually, asynchronous one-step Q-learning could be in scope. And I agree that 'threads.sharedserialize' would be one part of the solution. But running multiple (presumably physical) threads multicore and coordinating that with a master thread may not be possible with threads?

ghost · 2016-04-09T05:30:38Z

@Kaixhin @lake4790k I was actually working off this repo as a starting point not long ago to attempt to do this (maybe a month?) I evaluated threads and concluded it was not a very good option for one step or n-step due to the way upvalues work within the threads (in particular, the singleton instance+step counts within those algo's in particular cause a number of issues). If you're interested in collaborating, I can invite you to the library I was working on this in. I think that lua---parallel can handle a3c better, and was utilizing a structure similar to the original deepmind code.

Edit: To add to @Kaixhin's edit, this is possible. You can pretty easily run multiple instances of the Atari code in parallel. Just serialization at a given step is neither straightforward nor painless between threads.

lake4790k · 2016-04-09T15:47:27Z

@Kaixhin I think these async methods can be done in pure torch using threads and sharedserialize. Some implementation remarks (don't read this if you want to give it a try first, it's an interesting exercise...):

T (global shared counter): this would be an atomic variable (eg. c++ std::atomic) that is not available in lua/torch. In the paper this is used to update targetTheta from the threads. This is done less frequently (eg. Nth frames) Instead one can update targetTheta from a non learning thread at fixed time intervals (ie. sleep for x secs) in an unscyhronized way with the same effect, doing at exactly on every Nth step is not necessary.

A network with the global shared theta is created first, then a thread pool is started and the learner threads in the pool do a clone('weight','bias'), so dTheta (=accumulation of gradients and all other internal state) is per thread, but they share the Storage of theta. To acquire a flattened dTheta in the threads one can do:

_,gradParams = self.network:parameters()
dTheta = nn.Module.flatten(gradParams)

so dTheta is a single flattened storage tensor that can be added to theta

In the one step case then the learner does n forwards/backwards accumulating gradients in dTheta before doing a Hogwild theta += dTheta, ie. don't worry about synchronization and trust the cpu caches to be synchronised anyway for most of the updates. Adding the gradient is safe in the sense that the worst that can happen is loosing an update rarely, but theta doesn't get corrupted. Asynchronous = unsynchronised. There's no master thread and coordination, the pool can run forever, no need to stop on synchronize().

The shared RmsProp they describe with the shared g,g2 is trickier as the async neg()and sqrt() will corrupt the shared tensor with NaNs, I think there a thread local g,g2 copy is needed as well for the interim calculations.

Kaixhin · 2016-04-13T08:15:25Z

@michaelghaben @lake4790k Thanks for the comments - one-step (and possibly n-step) Q-learning can hopefully be integrated with the other features in this repo e.g. the dueling architecture. If that can be achieved then the others like advantage actor-critic can be considered.

I'm not going to be able to take the lead on this in the near future, but I'm happy to lend a hand to either a fork or a new repo if I can. For now I've just added an async branch to make it a little cleaner than working directly on master.

lake4790k · 2016-04-20T13:31:36Z

@Kaixhin I did a reference implementation as per above in a simple codebase without all the other methods. I hope I will have time to merge it with your codebase, would be interesting to see the performance compared to all the other methods. Will do a PR once ready.

Kaixhin · 2016-04-28T07:49:28Z

Someone is trying to replicate this and after skimming through miyosuda/async_deep_reinforce#1 it seems like they got hold of hyperparameters not noted in the paper. Worth keeping an eye on.

lake4790k · 2016-04-28T08:12:53Z

@Kaixhin yes, interesting about the hyperparameters. But I'm not sure that implementing this in Python is a good idea, afaik python does not support multithreaded execution of python script code at all (ie. all of the RL logic...), only code that is outside of GIL (ie. tensorflow operations, cython parts). They also mention slow performance compared to the original, I would not be surprised if that is because of the python issue. (but even single threaded python performance could be poor compared to lua/native for this use case)

I'll add my implementation of async to Atari in the coming days, I'm curious how it will work...

lake4790k · 2016-04-28T08:19:11Z

@Kaixhin btw I started with catch comparing cpu and gpu behaviour and noticed that cpu did not converge as gpu did, which should not be the case as all the code is the same for both cases. Except for the random initialization, as by default the code sets manual seed of 1, but in the gpu case it then does a cutorch.manualSeed(torch.random()) before constructing the net. So I think then cpu always gets a poor initialzation from seed 1, while the gpu net gets some other random weights that works better. If I set the cpu seed to random it then converges similarly to gpu (and also faster as you note because of the small net). I'm not sure if this random initialization behaviour is intended, got me quite confused first...

Kaixhin · 2016-04-28T08:29:29Z

@lake4790k Interesting - I went with the same initialisation as in DeepMind's code. I don't think a seed of 1 is worse than any other seed for a random number generator - it seems like you just got unlucky. It might be more obvious with Catch, but as far as I know weight initialisation hasn't been looked into for deep reinforcement learning.

lake4790k · 2016-05-09T20:05:04Z

Some results running the async 1-step Q. I used pong to compare the learning speed with the async paper page 20 result.

This experiment I ran on 10 hyperthreads (5 physical cores). I would expect the equivalent deepmind performance to be somewhat below the midpoint between the 4 and 8 thread curves as the speed is limited by the 5 physical cores, but having more threads with diverse experiences helps a bit.

The time scale of this figure is little less than 14 hours. It achieved the score 0 in about 11 hours. That's exactly where the interpolated deepmind curve would be. I used a learning rate of 0.0007.

This experiment I ran an 8 hyperthreads (4 physical cores). The equivalent deepmind curve should be a bit above the 4 thread curve on page 20.

The time scale of this figure is little less than 24 hours. At 14 hours it achieved a score of -3. That is exactly where the 4 threaded deepmind curve is. I used a learning rate of 0.0016

In these experiments I did not have learning rate decay as in the paper. The paper says they used the same experiment setup as the double Q paper, but then also says they used gradient norm clipping (which I didn't turn on either) which was introduced in the duelling paper.

I also had experiment with the more aggressive 0.0016 learning rate that got stuck in the beginning not improving for long. My guess would be that the gradient clipping would have helped it to get out of there (also the learning rate decay eventually).

As the curves in the paper are the average of the 3 best agents out of 50 experiments, and most likely they used an optimized c++ implementation (with tensorflow) and ours is pure torch and I had only a few experiments, these results look pretty good.

I still plan to implement n-step Q (in combination with double/PAL/dueling) and A3C and unify as much as possible with experience replay codebase as we discussed.

Kaixhin · 2016-05-09T21:19:31Z

@lake4790k looks great - thanks for comparing with DeepMind results. Epochs are more meaningful than training time due to differences in hardware, but your estimates sound about right. Keep at it!

Kaixhin · 2016-05-30T19:15:56Z

Closed by #30.

lake4790k · 2016-06-07T18:18:38Z

Ready for the next method based on A3C...

hym1120 · 2017-02-26T18:02:17Z

In A3CAgent:accumulateGradients, what is the reason we have 0.5 in vTarget instead of 2?
I was thinking "d(R-V)^2 / dTheta" is equal to "-2 * (R- V) * dV/dTheta"

Kaixhin added enhancement help wanted labels Apr 13, 2016

lake4790k self-assigned this May 24, 2016

Kaixhin mentioned this issue May 28, 2016

separate qt display and video #25

Merged

lake4790k mentioned this issue May 30, 2016

Evaluation mode for async and towards unified validation #30

Merged

Kaixhin closed this as completed May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement asynchronous methods #5

Implement asynchronous methods #5

lake4790k commented Mar 19, 2016

Kaixhin commented Mar 19, 2016

ghost commented Apr 9, 2016

lake4790k commented Apr 9, 2016

Kaixhin commented Apr 13, 2016

lake4790k commented Apr 20, 2016

Kaixhin commented Apr 28, 2016

lake4790k commented Apr 28, 2016

lake4790k commented Apr 28, 2016

Kaixhin commented Apr 28, 2016

lake4790k commented May 9, 2016 •

edited

Loading

Kaixhin commented May 9, 2016

Kaixhin commented May 30, 2016

lake4790k commented Jun 7, 2016

hym1120 commented Feb 26, 2017

Implement asynchronous methods #5

Implement asynchronous methods #5

Comments

lake4790k commented Mar 19, 2016

Kaixhin commented Mar 19, 2016

ghost commented Apr 9, 2016

lake4790k commented Apr 9, 2016

Kaixhin commented Apr 13, 2016

lake4790k commented Apr 20, 2016

Kaixhin commented Apr 28, 2016

lake4790k commented Apr 28, 2016

lake4790k commented Apr 28, 2016

Kaixhin commented Apr 28, 2016

lake4790k commented May 9, 2016 • edited Loading

Kaixhin commented May 9, 2016

Kaixhin commented May 30, 2016

lake4790k commented Jun 7, 2016

hym1120 commented Feb 26, 2017

lake4790k commented May 9, 2016 •

edited

Loading