-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to config the code run with GPU? #26
Comments
By default Tensorflow will grab all available memory on the GPU when the first process is created and all subsequent processes will fail since no memory remains. This behavior can be changed by using the 'allow_growth' option which allows the memory for each process to dynamically expand as needed. This is covered in detail here: https://www.tensorflow.org/tutorials/using_gpu With that said, these asynchronous algorithms are not optimized for GPU and, in my experience, perform much worse when forced to run on GPU. The A3C algorithm was released first and then a GPU friendly version called A2C was released some time later. Something similar was done with PPO as OpenAI has released a PPO2 algorithm which is optimized for GPU. These GPU optimized algorithms trade async behavior for batches which is where GPUs really perform. It would be great to compare performance between different systems but I've noticed the global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare. Best I have seen to date is around 1800 global_steps/sec using A3C with close to default settings on a dual Xeon 2669 workstation with 18 workers. |
tmorgan4, Thanks for your comments! Tom |
@vincetom1980,
yes, not a good metric since here BTW I have included option to run several environments for each worker in a batch like this:
|
Andrew, Tom |
A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense. But what about running specific parts on GPU? I'm currently experimenting with So I was thinking maybe I can wrap only the encoder I've made the following changes to the code:
I ran a test using only 1 worker but I couldn't get it to work. (error: cuda is out of memory) The log Tensorflow generate shows there is an active GPU and 'tensorflow gpu' is the only version installed (and works properly). I'm having a hard time understanding the source of the problem. hopefully there is a solution, as CPU power is not enough to experiment on large |
@JacobHanouna,
...is deprecated and not related at all, do not use it; for proper place to configure distributed TF device placement see: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/worker.py#L195 https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L439 https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L264 some explanations: |
Thanks for the guidance @Kismuz |
I've been reading for a couple of hours both the code and on tensorflow distributed, not the most easy topic to follow. this is what I understand so far:
I'm not sure how to modify the code so I have another worker that will be bound to GPU and would not be part of the A3C. not looking for something pretty, just a way to use something like |
@JacobHanouna, it is correct except that's essential to understand that it is tensorflow graph (or even specific part of it) that get assigned to specific device, not python object or process (instance of worker etc.); In a nutshell, there are replicas of graph assigned to each worker process and one replica held by parameter server process; later receives trainable parameters updates from worker's graphs (to be exact it gets computed gradients and applies to own variables following optimiser rule); than each worker copies updated variables to own graph to work with. That's big topic indeed with a lot of pitfalls and I do recommend to dig github for some good-written distributed code from big guys; there is no guaranties that if even one correctly assigns computation-heavy part of the graph ops to GPU device there will be no lock-ups due to workers concurrency; thats why A2C is more efficient here: it forces each worker to put it own batch in synchronous manner, concatenates everything batch-wise and sends to GPU in single pass. |
@JacobHanouna Making this work on GPU will require a fair amount of rework. You are most likely getting the 'cuda is out of memory' error because Tensorflow by default will grab all available memory on the device in the first session so all other workers don't see any available memory. You actually posted the solution above (from BTgymMonitor, which is not being used) where you need to specify 'config.gpu_options.allow_growth = True'. This will tell Tensorflow to allocate a small amount of memory to start and expand as needed. You can also specify a fraction of memory for each process to allocate if that is more convenient. It's all covered in detail under 'Allowing GPU memory growth'. As an aside, I have been digging through @Kismuz's code for a long time (a year?) and just finally understanding how certain parts work together. Andrew has done an extraordinary job especially considering he's done it nearly all himself. |
hi Andrew,
Thanks for your great work! I wonder if I can run this code on my GPU env. I've tried modify the code 'cpu:0' to 'GPU:0' but got error saying that resource are not available.
Could you pls tell me what's the correct operation?
Tom
The text was updated successfully, but these errors were encountered: