New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why keras apps using multi_gpu_model is slower than single gpu? #9204

Open
ghostplant opened this Issue Jan 27, 2018 · 36 comments

Comments

Projects
None yet
@ghostplant
Copy link
Contributor

ghostplant commented Jan 27, 2018

multi_gpu_model is from keras.utils and it wraps the application model to use multiple GPU to train. However, it seems that using multi_gpu_model makes the training heavier and slower, Is it as expected?
The GPU I am using is NVIDIA Tesla P100.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Jan 27, 2018

Besides, the version of Tensorflow I use is 1.4.0, and keras version is 2.1.3. All settings for them are default. The testing example is cifar10_cnn.py

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Jan 27, 2018

I reopen this post because the issue still exists. Whatever I put model weights on CPU or GPU, most model examples such as cifar10_cnn.py have worse performance than using single GPU.

@ghostplant ghostplant closed this Jan 27, 2018

@ghostplant ghostplant reopened this Jan 29, 2018

@anj-s

This comment has been minimized.

Copy link
Collaborator

anj-s commented Jan 30, 2018

What are the benchmarks you are seeing? Is the code you are running without modifications i.e is it exactly the same as the example cifar10_cnn?

@anj-s anj-s self-assigned this Jan 30, 2018

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Feb 2, 2018

I did the benchmarks by myself, and the example is exactly from keras/examples folder.
For example, as for mnist_cnn.py, and I added the following line before model.compile:

model = keras.utils.training_utils.multi_gpu_model(model, 4)

And here is the benchmark for mnist_cnn.py on the NVDIA Tesla P100 (4GPU, 16GB-Mem per device):

original_single: gpu=1, perf = 5s/epoch 75us/step
multi_gpu_model: gpu=2, perf = 5s/epoch 74us/step
multi_gpu_model: gpu=3, perf = 5s/epoch 81us/step
multi_gpu_model: gpu=4, perf = 6s/epoch 103us/step

As for cifar_cnn.py, I alsa added one line above, and the performance is as follow:

# By default, data_augmentation = True in cifar_cnn.py
original_single: gpu=1, perf = 23s 15ms/step
multi_gpu_model: gpu=2, perf = 23s 15ms/step
multi_gpu_model: gpu=3, perf = 24s 15ms/step
multi_gpu_model: gpu=4, perf = 22s 14ms/step

We see there is hardly any difference because CPU-side data_augment is really the bottleneck.
If we turned off the data_augmentation, the the performance is as follow:

# data_augmentation = False
original_single: gpu=1, perf = 14s 286us/step
multi_gpu_model: gpu=2, perf = 16s 325us/step
multi_gpu_model: gpu=3, perf = 19s 389us/step
multi_gpu_model: gpu=4, perf = 22s 445us/step

We see all the performance of multi_gpu_model is hardly better but worse.

@spate141

This comment has been minimized.

Copy link

spate141 commented Mar 2, 2018

@ghostplant @anj-s I'm having the same issue! #9502 Any updates on this?

@mohapatras

This comment has been minimized.

Copy link

mohapatras commented Mar 6, 2018

Any updates yet ? @ghostplant

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 6, 2018

@mohapatras It seems that CPU-side data-preprocessing can be one of the reason that greatly slow down the multi-GPU training, do you try disabling some pre-processing options such as data-augmentation and then see any boost?

Besides, the current version of multi_gpu_model seems to benefit large NN-models only, such as Xception, since weights synchronization is not the bottleneck. When it is wrapped to simple model such as mnist_cnn and cifar_cnn, weights synchronization is pretty frequent and makes the whole time much slower.

I'm also on the way to implement a customized version for multi GPU to see any ways better.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 28, 2018

Hi, I found some explanations to this issue.

It seems that not all models will benefit for multi_gpu_model.
Different models have different scalability due to the overhead of weight synchronization.

ResnetV1 and ResnetV2 are a pair of typical examples to prove that, while ResnetV2 have a better scalability than ResnetV1.

There is a balance between training one mini-batch and weights synchronization. InceptionV3 has heavy computational cost on training one mini-batch while it has sparse weights need to synchronize, so this model will gain a decent boost on multi_gpu_model.

However, any model with large Dense layer usually contributes to a bad scalability, just like mnist_mlp, which have a light computational cost on training one mini-batch while its weights are too large to synchronize efficiently, so in the example of mnist_mlp, the time spent to do one weights synchronization is even able to finish training MANY turns of mini-batch by single GPU, so mnist_mlp will not benefit for multi_gpu_model due to its dense network design to result in a bad scalability.

It also indicates that models to train on different GPU architectures will also have a different answer about whether it will benefit for multi_gpu_model, since it largely depends on whether the GPU is fast enough to perform training one mini-batch than a weights synchronization. So another conclusion is that the faster a GPU is, the less likely that multi_gpu_model can boost your model.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Mar 28, 2018

It's true that different models have different challenges in scalability.
But also note that with ResNetV1, https://github.com/tensorflow/benchmarks can scale with >7.5x speedup on 8 P100s or V100s.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 29, 2018

@ppwwyyxx Are they testing ResnetV1 with same deep layer? Besides, if using distrbuted TF as the backend just like the link you added, maybe their providing benchmarks are based on RDMA or other high-efficiency network media, which will shorten the overhead of weights synchronization.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Mar 29, 2018

I was talking about ResNet50, that's what they are mainly testing with, e.g. https://www.tensorflow.org/performance/benchmarks.
I was not talking about distributed backend. The tensorflow code I linked to, as well as pytorch, caffe2, mxnet, can all scale ResNet50 training to at least 7.5x on 8 P100s on a single machine, under some good parameters (batch size, etc).

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 29, 2018

If using ResNet50 provided from keras.applications, I can see nearly 2x boost using 2 Tesla P100 GPUs. How about your benchmark?

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Mar 29, 2018

@ghostplant could you share your code somewhere?

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 29, 2018

       import tensorflow as tf
       from keras.applications import ResNet50
       from keras.utils import multi_gpu_model
       import numpy as np
       num_samples = 1000
       height = 224
       width = 224
       num_classes = 1000
       with tf.device('/cpu:0'):
           model = ResNet50(weights=None,
                            input_shape=(height, width, 3),
                            classes=num_classes)
       parallel_model = multi_gpu_model(model, 2)
       parallel_model.compile(loss='categorical_crossentropy',
                              optimizer='rmsprop')
       x = np.random.random((num_samples, height, width, 3))
       y = np.random.random((num_samples, num_classes))
       parallel_model.fit(x, y, epochs=20, batch_size=256)
@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Mar 29, 2018

Your code uses NHWC image layout which is slower than NCHW. As you've pointed out, the slower the code is, the better it scales.
https://www.tensorflow.org/performance/benchmarks shows 422 images/s for ResNet50 training on two P100s. So you should expect your code to finish each epoch (1000 samples) in 2.36s. I assume it's not the case.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 29, 2018

In my experiments, if using NHWC format, it can finish training 1000 samples in 5 sec/epoch using 2 GPUs, and 3 sec/epoch using 4 GPUs.
According to https://github.com/keras-team/keras/blob/master/keras/applications/resnet50.py, it says that channels_last for Tensorflow will have the best performance.

@TristanJM

This comment has been minimized.

Copy link

TristanJM commented Mar 29, 2018

@ghostplant Interesting conclusions regarding whether multi_gpu could benefit a model.

I'm also finding that training (on custom models) is taking longer with 2 or 3 Tesla K80s than just 1.

Am I correct in thinking that multi_gpu still has an advantage that you have more GPU memory available and can therefore run more data through training or larger batch sizes?

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Mar 30, 2018

@TristanJM Yes, regardless of whether the model can get boosted by multi GPU (computational scaling), another case we have to use multi_gpu_model is for single GPU not able to train a model with large batch_size, thus the model can benefit from multi_gpu_model for its memory occupation scaling.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Apr 1, 2018

In my experiments, if using NHWC format, it can finish training 1000 samples in 5 sec/epoch using 2 GPUs, and 3 sec/epoch using 4 GPUs.

So it's already 2x slower than what it should be, then certainly it'll be able to scale better. The slower the model is, the better it scales.

it says that channels_last for Tensorflow will have the best performance.

This is definitely not true for Tensorflow on a P100. Cudnn implementation on every GPU architecture before Volta favors NCHW over NHWC.
It may be true for Keras, however. For example, there is a recent performance fix in #8785 that makes it use a faster batchnorm kernel for NCHW. I won't be surprised if NHWC is faster than NCHW before this PR. But the PR seems to suggest that NCHW is faster now.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Apr 2, 2018

@ppwwyyxx
I update the channel mode and resnet50v1 and it does 1-sec faster (only on 2 GPUs):

       import tensorflow as tf
       from keras.applications import ResNet50
       from keras.utils import multi_gpu_model
       import numpy as np
       num_samples = 1000
       height = 224
       width = 224
       num_classes = 1000
       with tf.device('/cpu:0'):
           model = ResNet50(weights=None,
                            input_shape=(3, height, width),
                            classes=num_classes)
       parallel_model = multi_gpu_model(model)
       parallel_model.compile(loss='categorical_crossentropy',
                              optimizer='rmsprop')
       x = np.random.random((num_samples, 3, height, width))
       y = np.random.random((num_samples, num_classes))
       parallel_model.fit(x, y, epochs=20, batch_size=256)

After the changes, it can finish training 1000 samples in 4 sec/epoch using 2 GPUs, but still 3 sec/epoch using 4 GPUs.

@USTClj

This comment has been minimized.

Copy link

USTClj commented Apr 22, 2018

HI, could you give me a help. I installed tensorflow 1.4 and my keras is 2.15. When I specify the input shape NCWH, an error occurs. It seems that it dose not support the format.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Apr 22, 2018

Do you set channels_first in ~/.keras/keras.json?

@USTClj

This comment has been minimized.

Copy link

USTClj commented Apr 22, 2018

Yes. I have changed it already.

@USTClj

This comment has been minimized.

Copy link

USTClj commented Apr 22, 2018

The resnet50 in keras.applications says:
if K.image_data_format() == 'channels_first' and K.backend() == 'tensorflow': warnings.warn('You are using the TensorFlow backend, yet you '
Which tensorflow version do you use?

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Apr 22, 2018

My environment: pip3 install tensorflow-gpu==1.4.1 keras==2.1.5, and there is no such warning using the code I pasted above.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented May 17, 2018

@USTClj
I think Keras multi_gpu_model is inefficient in its input splitting. According to NV profiling, any Keras model using 4GPU pushes huge amount of data inputs to every GPU which makes 4-times larger memory copy than Tensorflow benchmark script, this is at least one bottleneck that reduces Keras' scalability.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented May 17, 2018

It would be better if there will be any example to show to How to use Tensorflow native data tensors (TFrecord) as the input of Keras multi_gpu_model, which might be faster than current inefficient get_slice method.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Jun 15, 2018

@ppwwyyxx I found partial reasons why keras is around 2x slower than tensorflow when training Resnet50.
Firstly, Keras Conv2D uses ConvBias weights for every CNN layer which brings about extra computing complexity for bias forward and backward.

Secondly, even though Keras is working as channels_first image format, it still needs to do extra GPU matrix computing to frequently swap between NCHW and NHWC format. In other words, it is not fully working on NCHW computing. Thus, a lot of tensor conversion also brings about much bias computing.

Thirdly, after Keras==2.2.0, multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory, so it slightly reduce the overhead from 2x to 1.8x~1.9x as well, which is the third reason.

Maybe there are still more other reasons not found yet.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Jun 15, 2018

multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory

The optimal solution is to not split at all. As I put in tensorpack docs here:

Splitting a tensor for data-parallel training makes no sense at all, only to put unnecessary shape constraints on the data. By letting each GPU train on its own input tensors, they can train on inputs of different shapes simultaneously.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Jun 15, 2018

The problem is the full data is at CPU, and different GPUs cannot directly get access to the data without fully copy the data to each GPU, which brings a huge I/O between host and device. I have tested it doesn't just put constraits on the data, and the PCIe traffic is largely reduced as well. You can see there will be a slight boost.

@ghostplant

This comment has been minimized.

Copy link
Contributor

ghostplant commented Jun 15, 2018

I agree that if there is a way to feed data instead of even using split, it will be much efficient.

@majiali1995

This comment has been minimized.

Copy link

majiali1995 commented Jun 27, 2018

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

@mattdornfeld

This comment has been minimized.

Copy link

mattdornfeld commented Aug 10, 2018

That's because you're looking at the wrong metric. You're looking at seconds/step. It makes sense that this would be slower than that for non distributed training. With distributed training it's doing more operations. Each individual gpu has to do backpropagation on a batch, send the gradients back to RAM, apply the gradients to the parameters you're storing in RAM, and finally sync the values of the parameters stored in RAM with those store in GPU memory. With non distributed training it only has to apply gradients to the parameters store in GPU memory, so it makes sense that it would take less time to train a single batch.

Distributed training sees a speedup when you look at the number of global steps / sec (i.e. the number of batches trained by all the GPU workers). This will cause the model to converge faster.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Aug 10, 2018

@mattdornfeld No. The batch is evenly split to each GPU so it's not the wrong metric.

@manuelblancovalentin

This comment has been minimized.

Copy link

manuelblancovalentin commented Sep 20, 2018

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

That's because Keras will split each batch according to the total number of gpus used in multi_gpu_model. A model trained with a single gpu and batch size of 64 will be approx. as fast as that same model trained using multi_gpu_model with 2 gpus and the same batch size, as each gpu will process 32 samples at once. So to be able to compare results you should multiply the original batch size by the number of gpus used.

Thus, if 2 gpus are used, your batch size should be 128. Keras will split the samples into 2 groups of 64 samples for each gpu. That way your processing should be ~2x faster than using a single gpu.

@jianing-sun

This comment has been minimized.

Copy link

jianing-sun commented Nov 7, 2018

@TristanJM Yes, regardless of whether the model can get boosted by multi GPU (computational scaling), another case we have to use multi_gpu_model is for single GPU not able to train a model with large batch_size, thus the model can benefit from multi_gpu_model for its memory occupation scaling.

Hello, I encountered OOM (out of memory) error recently then I used two gpus with almost the same implementation way of your code shared here (with tf.device cpu, parallel model, gpu=2). But I still have the same OOM error even though I used two gpus. I don't understand how to share the memory pressure with multi gpus? My assumption is that with the keras multi-gpus function, it basically just make a same replica on another gpu but doesn't deal with the memory problem. So do you know how to deal with the OOM error with multi-gpus?

Thanks!
Jianing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment