Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to config the code run with GPU? #26

Closed
vincetom1980 opened this issue Dec 25, 2017 · 11 comments
Closed

how to config the code run with GPU? #26

vincetom1980 opened this issue Dec 25, 2017 · 11 comments

Comments

@vincetom1980
Copy link

hi Andrew,
Thanks for your great work! I wonder if I can run this code on my GPU env. I've tried modify the code 'cpu:0' to 'GPU:0' but got error saying that resource are not available.

Could you pls tell me what's the correct operation?

Tom

@tmorgan4
Copy link

By default Tensorflow will grab all available memory on the GPU when the first process is created and all subsequent processes will fail since no memory remains. This behavior can be changed by using the 'allow_growth' option which allows the memory for each process to dynamically expand as needed. This is covered in detail here: https://www.tensorflow.org/tutorials/using_gpu

With that said, these asynchronous algorithms are not optimized for GPU and, in my experience, perform much worse when forced to run on GPU. The A3C algorithm was released first and then a GPU friendly version called A2C was released some time later. Something similar was done with PPO as OpenAI has released a PPO2 algorithm which is optimized for GPU. These GPU optimized algorithms trade async behavior for batches which is where GPUs really perform.

It would be great to compare performance between different systems but I've noticed the global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare. Best I have seen to date is around 1800 global_steps/sec using A3C with close to default settings on a dual Xeon 2669 workstation with 18 workers.

@vincetom1980
Copy link
Author

tmorgan4,

Thanks for your comments!

Tom

@Kismuz
Copy link
Owner

Kismuz commented Dec 25, 2017

@vincetom1980,
yes indeed, @tmorgan4 comment is right to the point: A3C is good for those who doesn't have access to cheap GPU resources (like me :). As for performance, here is post from A2C developers:
https://blog.openai.com/baselines-acktr-a2c/
which can be summarised as 'it's better to run A2C on GPU than a3C on CPU'.

global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare.

yes, not a good metric since here golbal_step is defined not as ' number of algorithm training iterations' but as 'number of environment steps made so far by all workers' and better be named sampling number or so. I found it more convenient for this particular task. Anyway train_global_step is easy to insert.

BTW I have included option to run several environments for each worker in a batch like this:

cluster_config = dict(
    host='127.0.0.1',
    port=12230,
    num_workers=4,  
    num_ps=1,
    num_envs=4,
    log_dir=os.path.expanduser('~/tmp/test_4_8'),
)
  • here 4 x 4 environment instances will be run. and each worker will get batch of four rollouts for train step; but it seems such setup further lowers sampling efficiency and I'm not confident if it can be run well on GPU.

@vincetom1980
Copy link
Author

Andrew,
Thank you for your Answer!

Tom

@JaCoderX
Copy link
Contributor

JaCoderX commented Dec 27, 2018

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

But what about running specific parts on GPU?

I'm currently experimenting with conv_1d_casual_encoder using a large time_dim =4096 . My problem is because it adds a lot more parameterization to the model, every step computation take considerably more time.

So I was thinking maybe I can wrap only the encoder with tf.device('/gpu:0'): and force the encoder block to run under GPU. This way everything would run on CPU except the convolution part that is know for working very well under GPU.

I've made the following changes to the code:

  • wrapped the encoder
def conv_1d_casual_encoder(
...
with tf.device('/gpu:0'):
    with tf.variable_scope(name_or_scope=name, reuse=reuse):
...
  • added a config to the Session for GPU log/control.
    **not sure it was the right place to put the config but tensorboard.py was the only place I found tf.Session() in the code
class BTgymMonitor():
...
config=tf.ConfigProto(log_device_placement=True, allow_soft_placement=False)
config.gpu_options.allow_growth=True
self.sess = self.tf.Session(config=config)

I ran a test using only 1 worker but I couldn't get it to work. (error: cuda is out of memory)
I even get this error if I use the with tf.device('/gpu:0'): on a simple operation inside the encoder

The log Tensorflow generate shows there is an active GPU and 'tensorflow gpu' is the only version installed (and works properly).

I'm having a hard time understanding the source of the problem.

hopefully there is a solution, as CPU power is not enough to experiment on large time_dim efficiently.

@Kismuz
Copy link
Owner

Kismuz commented Dec 27, 2018

@JacobHanouna,

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

  • there exists extension of A3C optimised for GPU named A3G;
  • another option is batched version: A2C, both can be found on arxiv / github

class BTgymMonitor()

...is deprecated and not related at all, do not use it; for proper place to configure distributed TF device placement see:

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/worker.py#L195

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L439

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L264

some explanations:

https://www.tensorflow.org/deploy/distributed

@JaCoderX
Copy link
Contributor

Thanks for the guidance @Kismuz
I will try to give it a go :)

@Kismuz Kismuz reopened this Dec 27, 2018
@JaCoderX
Copy link
Contributor

I've been reading for a couple of hours both the code and on tensorflow distributed, not the most easy topic to follow.

this is what I understand so far:

  • we define the number of workers in cluster_config and pass it to the launcher.
  • the launcher will then give each worker a task number and all the config data to run A3C.
  • all workers are then instantiated in worker class
  • each worker will then instantiated A3C model in the BaseAAC class
  • in the A3C class the worker will get assigned with a device that make it bound to CPU calculation.

I'm not sure how to modify the code so I have another worker that will be bound to GPU and would not be part of the A3C.

not looking for something pretty, just a way to use something like with tf.device("/job:worker/task:{}/gpu:0".format(task)): over the encoder block

@Kismuz
Copy link
Owner

Kismuz commented Dec 27, 2018

@JacobHanouna, it is correct except that's essential to understand that it is tensorflow graph (or even specific part of it) that get assigned to specific device, not python object or process (instance of worker etc.);

In a nutshell, there are replicas of graph assigned to each worker process and one replica held by parameter server process; later receives trainable parameters updates from worker's graphs (to be exact it gets computed gradients and applies to own variables following optimiser rule); than each worker copies updated variables to own graph to work with.

That's big topic indeed with a lot of pitfalls and I do recommend to dig github for some good-written distributed code from big guys; there is no guaranties that if even one correctly assigns computation-heavy part of the graph ops to GPU device there will be no lock-ups due to workers concurrency; thats why A2C is more efficient here: it forces each worker to put it own batch in synchronous manner, concatenates everything batch-wise and sends to GPU in single pass.

@tmorgan4
Copy link

@JacobHanouna Making this work on GPU will require a fair amount of rework. You are most likely getting the 'cuda is out of memory' error because Tensorflow by default will grab all available memory on the device in the first session so all other workers don't see any available memory.

You actually posted the solution above (from BTgymMonitor, which is not being used) where you need to specify 'config.gpu_options.allow_growth = True'. This will tell Tensorflow to allocate a small amount of memory to start and expand as needed. You can also specify a fraction of memory for each process to allocate if that is more convenient. It's all covered in detail under 'Allowing GPU memory growth'.
https://www.tensorflow.org/guide/using_gpu

As an aside, I have been digging through @Kismuz's code for a long time (a year?) and just finally understanding how certain parts work together. Andrew has done an extraordinary job especially considering he's done it nearly all himself.

@JaCoderX
Copy link
Contributor

@tmorgan4, @Kismuz Thank you both for your replies.

I think for now I'll stick to CPU :)

@Kismuz Kismuz closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants