-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train on multi-GPUs when using fit_generator? #9502
Comments
Correct me if I'm wrong, but if your batch_size=8, multiple GPU won't give you that much speedup right? It says here, https://keras.io/utils/#multi_gpu_model that,
|
You're right but I'm actually using batch_size = 64*4 |
|
@JeniaNovellusDx You got any solution for this? I just tried training a model on |
What's # of workers? I believe that the more the better, but how should I know what precise #workers do I have |
workers: Integer. Maximum number of processes to spin up when using process based threading. |
New finding: Just spin up the p2.xlarge with 1 GPU, and 1 epoch is taking approx. 8:00:00 hours to finish, on the other hand, 8 GPUs are taking approx. 7:00:00 hours! This doesn't make any sense!
UPDATE: There was disk I/O bottleneck in my code. If possible, only read once from a file! Solved it by keeping as much as possible data in memory. |
@spate141 Any Updates regarding that ? I am facing the same issue in 1080 GTX 8 GB with tf 1.4.0 and keras 2.1.3. I am using single GPU and getting the issue. |
@mohapatras I don't think there is any issue with single GPU. You can basically get the speedup without making any changes in your code. If you somehow not getting boost, you can check how you are doing pre-process before feeding the data to your model. As I understand, most of the pre-processing is done on CPU and if you are using generator, disk i/o can be the main bottleneck. |
I am reading the data from disk using DataGenerator in Keras. I am simulating 256 x 256 images of total 56k in training and 2k in Val. It takes 6 hours/epoch which is insane.
Any workarounds regarding this ? |
Hi, can you share the benchmark difference between From my experiments, I see no performance improved when use_multiprocessing=False, workers=1, data=disk, gpu=4, perf=88s/epoch
use_multiprocessing=True, workers=8, data=disk, gpu=4, perf=89s/epoch the value of |
@ghostplant Currently I'm not running the instance. My issue was solved by adjusting the way I was fetching data from disk with generator, and pre-processing it before feeding to GPU.
I will post exact log next time when I start the instance again. Cheers! |
Good news! So do you solve the bottleneck by putting some files into memory? |
I launched new EC2 P3.8xlarge with the following packages: I also set the # workers to be 16 (cpu count // 2) use_multiprocessing=True and it seemed to perform well at 3.4x speedup compared to 1 gpu |
Benchmark things separately always helps with understanding. You can just benchmark without worrying about the ImageGenerator part first, for example: import keras
keras.backend.set_image_data_format('channels_first')
from keras.applications.resnet50 import ResNet50
from keras.utils import np_utils
import numpy as np
NUM_GPU = 1
batch_size = 32 * NUM_GPU
img_rows, img_cols = 224, 224
X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32')
Y_train = np.random.random((batch_size,)).astype('int32')
Y_train = np_utils.to_categorical(Y_train, 1000)
def gen():
while True:
yield (X_train, Y_train)
model = ResNet50(weights=None, input_shape=X_train.shape[1:])
if NUM_GPU != 1:
model = keras.utils.multi_gpu_model(model, gpus=NUM_GPU)
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.fit_generator(gen(), epochs=100, steps_per_epoch=50) |
@ghostplant Yes, I did converted data in format which can be loaded in memory and then with little post-processing I was managed to feed data to model on 8GPUs, all 8 GPUs usage was ~70-75% so I can say that it was pretty much the disk i/o in my case. |
@ppwwyyxx Hi, thx for your example and could i discuss something about the model defining?
Recent days, I found that after added the cpu scope with
And GPU usage:
@ghostplant Also, did you test the benchmark of with and without cpu scope?
|
You certainly should not put model on CPU if you want to use a GPU. |
@Hayao41 I did both kinds of benchmarks, and GPUs are always working. However, the performance of using |
@ppwwyyxx @ghostplant thx guys, but it doesn't work for me :( as above. If I drop the cpu scope, all things go to normal, working hard.
|
@Hayao41 I am not using |
@ghostplant This may be a spot, I will go to have a test on mainstream. Thx very much! |
I am using |
@xuzheyuan624 Why keras multi-gpu is slow:
If the above 3 problem solved, keras multi_gpu could have as fast performance as the version of official Tensorflow CNN benchmark. |
@ghostplant So just use |
@xuzheyuan624 Yes, it is just a a work around that looks simple. If using the fastest solution, you have to modify lots of codes related to input processing and gradient updates. |
@ghostplant OK, it's sound too difficult for me to speed up the code with keras. I will train my model on pytorch and then transform it's weights to keras. |
You can also use tensorpack which includes built-in fast multi-GPU solution. Some comparison scripts here. |
Hey Guys! Just found the solution and it has been running properly haha! We cannot use the traditional self-made generator as the GPU may have varying end-training times and may send the information to one model for same batch sizes. Therefore become NOT thread-safe. The generator needs to be sequenced, Hence, thread-safe (tf.keras.utils). I am running my model on g3-16xlarge due to the size of the model and the size of the training set! The generator is now a class and not a method!
` The following code should work when inserted like
|
Hi, I went through all the discussions and the outcome I got is that you need to use a data generator based on the keras.utils.Sequence within the fit_generator, setup the batch size divisible by the number of GPUs available and go for the use_multiprocessing = True. This didn't work in my case. I'm training the system for NLP sequence-to-sequence method. The data is sentences (text) with source and target. Multiprocessing on 4 GPUs is still 50% for that of a single GPU. Something is wrong and from my experience it's related to keras. I'm using TensforFlow GPU v1.15.3 (latest before v2.0), and keras v2.1.5 (tried newer ones, no change). |
Hello guys,
I have opened a p3.8xlarge instance to benefit from its 4-GPUs but the training time improves only by a factor of 2 (x2) compared to my P3.2xlarge machine (not x4 and not even x3.. and I'm a bit disappointed)
I believe that it's due to the data_generator that slows down the process. Is there a way to overcome this issue? I have directories with thousands of images (~30GB) so I compelled to use flow_from_directory attribute.
Here's a sample from my code:
Thanks!
The text was updated successfully, but these errors were encountered: