Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train on multi-GPUs when using fit_generator? #9502

Closed
Golbstein opened this issue Feb 27, 2018 · 31 comments
Closed

How to train on multi-GPUs when using fit_generator? #9502

Golbstein opened this issue Feb 27, 2018 · 31 comments

Comments

@Golbstein
Copy link

Golbstein commented Feb 27, 2018

Hello guys,
I have opened a p3.8xlarge instance to benefit from its 4-GPUs but the training time improves only by a factor of 2 (x2) compared to my P3.2xlarge machine (not x4 and not even x3.. and I'm a bit disappointed)
I believe that it's due to the data_generator that slows down the process. Is there a way to overcome this issue? I have directories with thousands of images (~30GB) so I compelled to use flow_from_directory attribute.

Here's a sample from my code:



def get_batches(dirname, gen=image.ImageDataGenerator(), shuffle=True, batch_size=8, class_mode='categorical',
                target_size=(256,256), classes = None):
    return gen.flow_from_directory(dirname, target_size=target_size, classes=classes,
            class_mode=class_mode, shuffle=shuffle, batch_size=batch_size)

with tf.device('/cpu:0'):
    model = ResNet50(weights=None, include_top=True,
                     input_shape=(224, 224, 3),
                     classes=3)

parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer=Adam(), metrics=['accuracy'])

filepath = '/home/ubuntu/efs/images/IndexNew/toSep/KRAS_RN_multiGPU.h5'
checkpointer = ModelCheckpoint(filepath=filepath, verbose=1, save_best_only=True, save_weights_only=True)
stop_train = EarlyStopping(monitor='val_acc', patience=7, verbose=1)
reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5,
            patience=2, min_lr=0.00001)

callbacks = [checkpointer, reduce_lr, stop_train]

path = '/home/ubuntu/efs/images/IndexNew/toSep/'
gen=image.ImageDataGenerator(horizontal_flip=True, vertical_flip=True)
batches = get_batches(path+'train', gen, batch_size=batch_size, shuffle=True, target_size=(224,224))
val_batches = get_batches(path+'valid', batch_size=batch_size, shuffle=False, target_size=(224,224))
parallel_model.fit_generator(batches, steps_per_epoch=batches.n//batch_size, epochs=20,
                    validation_data = val_batches, validation_steps = val_batches.n//batch_size,
                    verbose=1, callbacks=callbacks, workers=8, use_multiprocessing=True)

Thanks!

@spate141
Copy link

Correct me if I'm wrong, but if your batch_size=8, multiple GPU won't give you that much speedup right? It says here, https://keras.io/utils/#multi_gpu_model that,

if your batch_size is 64 and you use gpus=2, then we will divide the input into 2 sub-batches of 32 samples, process each sub-batch on one GPU, then return the full batch of 64 processed samples.

This induces quasi-linear speedup on up to 8 GPUs.

@Golbstein
Copy link
Author

You're right but I'm actually using batch_size = 64*4
The default of my function is 8 but I change it when I call "get_batches"

@spate141
Copy link

max_queue_size: Integer. Maximum size for the generator queue. If unspecified, max_queue_size will default to 10.

max_queue_size in fit_generator(), default is 10. Is your model training faster than your generator generating batches?

@spate141
Copy link

spate141 commented Mar 2, 2018

@JeniaNovellusDx You got any solution for this? I just tried training a model on p2.8xlarge with 8 GPUs. I'm also not getting the required speedup! Theoretically, 1 epoch should have to take around ~3000s of time, but it's almost taking 6x more time! I'm keeping batch_size = 512*8 so that all 8 GPUs get batch of 512 for training, my max_queue_size = 300 so that I have enough data cached in memory to consumed by all GPUs. I'm using all workers=32 with thread safe generator to generate data from generator, I don't get it where the bottleneck is!? Any thoughts? I tried lr=1e-3 on Adam, still no improvement :(

@Golbstein
Copy link
Author

What's # of workers? I believe that the more the better, but how should I know what precise #workers do I have

@spate141
Copy link

spate141 commented Mar 2, 2018

workers: Integer. Maximum number of processes to spin up when using process based threading.

@spate141
Copy link

spate141 commented Mar 2, 2018

New finding: Just spin up the p2.xlarge with 1 GPU, and 1 epoch is taking approx. 8:00:00 hours to finish, on the other hand, 8 GPUs are taking approx. 7:00:00 hours! This doesn't make any sense!

CPU: 
Epoch 1/100
  501/523828 [..............................] - ETA: 35:47:40 - loss: 7.3692 - acc: 7.9918e-04
  
GPU:
Epoch 1/100
  536/523828 [..............................] - ETA: 8:39:52 - loss: 7.3663 - acc: 7.2513e-04
  
Multi-GPUs (8):
Epoch 1/100
 551/65478 [..............................] - ETA: 7:15:07 - loss: 7.3602 - acc: 7.8515e-04

UPDATE: There was disk I/O bottleneck in my code. If possible, only read once from a file! Solved it by keeping as much as possible data in memory.

@mohapatras
Copy link

@spate141 Any Updates regarding that ? I am facing the same issue in 1080 GTX 8 GB with tf 1.4.0 and keras 2.1.3. I am using single GPU and getting the issue.

@spate141
Copy link

spate141 commented Mar 6, 2018

@mohapatras I don't think there is any issue with single GPU. You can basically get the speedup without making any changes in your code. If you somehow not getting boost, you can check how you are doing pre-process before feeding the data to your model. As I understand, most of the pre-processing is done on CPU and if you are using generator, disk i/o can be the main bottleneck.
P.S: if yo are using any methods/functions from Keras or other package in pre processing, you can write your own version with Numpy!

@mohapatras
Copy link

I am reading the data from disk using DataGenerator in Keras. I am simulating 256 x 256 images of total 56k in training and 2k in Val. It takes 6 hours/epoch which is insane.

if you are using generator, disk i/o can be the main bottleneck.

Any workarounds regarding this ?

@ghostplant
Copy link
Contributor

ghostplant commented Mar 9, 2018

@mohapatras @spate141

Hi, can you share the benchmark difference between use_multiprocessing=True and use_multiprocessing=False, because I just want to know whether you have the same issue.

From my experiments, I see no performance improved when use_multiprocessing=True:

use_multiprocessing=False, workers=1, data=disk, gpu=4, perf=88s/epoch
use_multiprocessing=True, workers=8, data=disk, gpu=4, perf=89s/epoch

the value of workers will take effects only when use_multiprocessing=True

@spate141
Copy link

@ghostplant Currently I'm not running the instance. My issue was solved by adjusting the way I was fetching data from disk with generator, and pre-processing it before feeding to GPU.

  • I also noticed the same behavior, If I use thread safe generator (Proper way of making a data generator which can handle multiple workers #1638) with use_multiprocessing=True, workers=28 or 32, performance was worse actually compared to same thread safe generator but with use_multiprocessing=False.

  • From the nvidia-smi log, all 8 GPUs were ~70% usage when multiprocessing was off, on the other hand when I turned on multiprocessing, usage was oscillating between 0-40% on all 8 GPUs.

I will post exact log next time when I start the instance again.

Cheers!

@ghostplant
Copy link
Contributor

Good news! So do you solve the bottleneck by putting some files into memory?

@Golbstein
Copy link
Author

I launched new EC2 P3.8xlarge with the following packages:
Keras 2.1.3
tensorflow 1.5.0
cudnn 7
cuda 9

I also set the # workers to be 16 (cpu count // 2) use_multiprocessing=True and it seemed to perform well at 3.4x speedup compared to 1 gpu

@ppwwyyxx
Copy link

Benchmark things separately always helps with understanding. You can just benchmark without worrying about the ImageGenerator part first, for example:

import keras
keras.backend.set_image_data_format('channels_first')
from keras.applications.resnet50 import ResNet50
from keras.utils import np_utils
import numpy as np

NUM_GPU = 1
batch_size = 32 * NUM_GPU

img_rows, img_cols = 224, 224

X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32')
Y_train = np.random.random((batch_size,)).astype('int32')
Y_train = np_utils.to_categorical(Y_train, 1000)

def gen():
    while True:
        yield (X_train, Y_train)

model = ResNet50(weights=None, input_shape=X_train.shape[1:])

if NUM_GPU != 1:
    model = keras.utils.multi_gpu_model(model, gpus=NUM_GPU)

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.fit_generator(gen(), epochs=100, steps_per_epoch=50)

@spate141
Copy link

spate141 commented Mar 12, 2018

@ghostplant Yes, I did converted data in format which can be loaded in memory and then with little post-processing I was managed to feed data to model on 8GPUs, all 8 GPUs usage was ~70-75% so I can say that it was pretty much the disk i/o in my case.

@joelxiangnanchen
Copy link

joelxiangnanchen commented Sep 17, 2018

@ppwwyyxx Hi, thx for your example and could i discuss something about the model defining?
In your example, you defined the model on gpu, not like keras doc's suggested on cpu:

model = ResNet50(weights=None, input_shape=X_train.shape[1:])
if NUM_GPU != 1: 
    model = keras.utils.multi_gpu_model(model, gpus=NUM_GPU)

Recent days, I found that after added the cpu scope with with tf.device("/cpu:0"):, something doesn't make sense appeared is that my gpu didn't working for training all the time(E.g usage with lots of 0% but few 100%) :( . So did you ever have faced this problem ?
And this is the block of define the model:

with tf.device('/cpu:0'):
    model_factory = ModelFactory()
    base_model, img_size = model_factory.get_model(conf)
    x = base_model.output
    x = Flatten(name="flat_last")(x)
    prediction = Dense(conf["num_classes"], activation='softmax', name='logits',
                                     kernel_initializer="he_normal")(x)
    model = Model(inputs=base_model.input, outputs=prediction)
    model.summary()
model = multi_gpu_model(model, FLAGS.gpus)

And GPU usage:

-----------------Params-----------------------
Collection interval is 10(s)
Check every 1.00(h)
The minimum GPU average usage is 10.00%
----------------------------------------------
Has 2 GPUs in task, should audit.
[2018-09-17 07:57:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:57:52 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:58:02 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:58:12 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:58:22 UTC] [GPU usage] [GPU0 10%] [GPU1 9%] 
[2018-09-17 07:58:32 UTC] [GPU usage] [GPU0 94%] [GPU1 82%] 
[2018-09-17 07:58:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:58:52 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:02 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:12 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:22 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:32 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 07:59:52 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:00:02 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:00:12 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 08:00:23 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:00:33 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:00:43 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:00:53 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:03 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:13 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:22 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:32 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:01:52 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:02 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:12 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:22 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:32 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:02:52 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:03:02 UTC] [GPU usage] [GPU0 14%] [GPU1 23%] 
[2018-09-17 08:03:12 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:03:22 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:03:32 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:03:42 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:03:52 UTC] [GPU usage] [GPU0 82%] [GPU1 100%] 
[2018-09-17 08:04:02 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:04:13 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:04:23 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:04:33 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:04:43 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:04:53 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:05:03 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:05:12 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 08:05:22 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 

@ghostplant Also, did you test the benchmark of with and without cpu scope?
Thx everybody, it confused me a lot. And my env is as follows:

  • tensorflow = 1.7.0
  • keras = tf.keras

@ppwwyyxx
Copy link

You certainly should not put model on CPU if you want to use a GPU.

@ghostplant
Copy link
Contributor

@Hayao41 I did both kinds of benchmarks, and GPUs are always working. However, the performance of using tf.device('/cpu':0) is a bit faster.

@joelxiangnanchen
Copy link

@ppwwyyxx @ghostplant thx guys, but it doesn't work for me :( as above. If I drop the cpu scope, all things go to normal, working hard.

-----------------Params-----------------------
Collection interval is 10(s)
Check every 1.00(h)
The minimum GPU average usage is 10.00%
----------------------------------------------
Has 2 GPUs in task, should audit.
[2018-09-17 06:43:06 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:43:16 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:43:26 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:43:36 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:43:46 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:43:56 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:44:06 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:44:16 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:44:27 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:44:37 UTC] [GPU usage] [GPU0 0%] [GPU1 0%] 
[2018-09-17 06:44:47 UTC] [GPU usage] [GPU0 86%] [GPU1 88%] 
[2018-09-17 06:44:57 UTC] [GPU usage] [GPU0 100%] [GPU1 98%] 
[2018-09-17 06:45:07 UTC] [GPU usage] [GPU0 100%] [GPU1 0%] 
[2018-09-17 06:45:17 UTC] [GPU usage] [GPU0 100%] [GPU1 45%] 
[2018-09-17 06:45:27 UTC] [GPU usage] [GPU0 94%] [GPU1 100%] 
[2018-09-17 06:45:37 UTC] [GPU usage] [GPU0 79%] [GPU1 11%] 
[2018-09-17 06:45:47 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:45:57 UTC] [GPU usage] [GPU0 64%] [GPU1 98%] 
[2018-09-17 06:46:07 UTC] [GPU usage] [GPU0 0%] [GPU1 100%] 
[2018-09-17 06:46:17 UTC] [GPU usage] [GPU0 100%] [GPU1 0%] 
[2018-09-17 06:46:27 UTC] [GPU usage] [GPU0 47%] [GPU1 43%] 
[2018-09-17 06:46:37 UTC] [GPU usage] [GPU0 100%] [GPU1 81%] 
[2018-09-17 06:46:47 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:46:57 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:47:07 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:47:16 UTC] [GPU usage] [GPU0 100%] [GPU1 99%] 
[2018-09-17 06:47:26 UTC] [GPU usage] [GPU0 100%] [GPU1 57%] 
[2018-09-17 06:47:36 UTC] [GPU usage] [GPU0 62%] [GPU1 44%] 
[2018-09-17 06:47:46 UTC] [GPU usage] [GPU0 68%] [GPU1 52%] 
[2018-09-17 06:47:56 UTC] [GPU usage] [GPU0 77%] [GPU1 82%] 
[2018-09-17 06:48:06 UTC] [GPU usage] [GPU0 59%] [GPU1 88%] 
[2018-09-17 06:48:16 UTC] [GPU usage] [GPU0 56%] [GPU1 43%] 
[2018-09-17 06:48:27 UTC] [GPU usage] [GPU0 78%] [GPU1 90%] 
[2018-09-17 06:48:37 UTC] [GPU usage] [GPU0 100%] [GPU1 60%] 
[2018-09-17 06:48:47 UTC] [GPU usage] [GPU0 100%] [GPU1 79%] 
[2018-09-17 06:48:57 UTC] [GPU usage] [GPU0 100%] [GPU1 50%] 
[2018-09-17 06:49:07 UTC] [GPU usage] [GPU0 100%] [GPU1 47%] 
[2018-09-17 06:49:17 UTC] [GPU usage] [GPU0 100%] [GPU1 0%] 
[2018-09-17 06:49:27 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:49:37 UTC] [GPU usage] [GPU0 100%] [GPU1 85%] 
[2018-09-17 06:49:47 UTC] [GPU usage] [GPU0 100%] [GPU1 0%] 
[2018-09-17 06:49:57 UTC] [GPU usage] [GPU0 100%] [GPU1 100%] 
[2018-09-17 06:50:07 UTC] [GPU usage] [GPU0 100%] [GPU1 87%] 
[2018-09-17 06:50:17 UTC] [GPU usage] [GPU0 100%] [GPU1 71%] 
[2018-09-17 06:50:27 UTC] [GPU usage] [GPU0 70%] [GPU1 40%] 
[2018-09-17 06:50:37 UTC] [GPU usage] [GPU0 78%] [GPU1 100%] 
[2018-09-17 06:50:47 UTC] [GPU usage] [GPU0 100%] [GPU1 80%]

@ghostplant
Copy link
Contributor

@Hayao41 I am not using tf.keras which might be of older version than keras mainstream. I used keras mainstream version installed from pip.

@joelxiangnanchen
Copy link

@ghostplant This may be a spot, I will go to have a test on mainstream. Thx very much!

@xuzheyuan624
Copy link

xuzheyuan624 commented Mar 17, 2019

I am using tf.keras and tf.keras.sequence to load data. I also rewrite my code in pytorch. I found it's much slower when using keras. I want to kown if it's the reason that torch.utils.data.Dataset is running faster than tf.keras.sequence when loading data

@ghostplant
Copy link
Contributor

@xuzheyuan624 Why keras multi-gpu is slow:

  1. Each GPU should have a self-owned data generators;
  2. data generator should store image inputs to memories allocated by cuMemHostAlloc();
  3. Allreduce should be used instead of using parameter-server;

If the above 3 problem solved, keras multi_gpu could have as fast performance as the version of official Tensorflow CNN benchmark.

@xuzheyuan624
Copy link

@ghostplant So just use keras.utils.multi_gpu_modeland set use_multiprocessing=True when using fit_generator is not enough to run fast ? Are there any examples to solve this problem?

@ghostplant
Copy link
Contributor

@xuzheyuan624 Yes, it is just a a work around that looks simple. If using the fastest solution, you have to modify lots of codes related to input processing and gradient updates.

@xuzheyuan624
Copy link

@ghostplant OK, it's sound too difficult for me to speed up the code with keras. I will train my model on pytorch and then transform it's weights to keras.

@ppwwyyxx
Copy link

ppwwyyxx commented Mar 17, 2019

You can also use tensorpack which includes built-in fast multi-GPU solution. Some comparison scripts here.

@NiksanJP
Copy link

NiksanJP commented Feb 7, 2020

Hey Guys!

Just found the solution and it has been running properly haha!

We cannot use the traditional self-made generator as the GPU may have varying end-training times and may send the information to one model for same batch sizes. Therefore become NOT thread-safe. The generator needs to be sequenced, Hence, thread-safe (tf.keras.utils).

I am running my model on g3-16xlarge due to the size of the model and the size of the training set!

The generator is now a class and not a method!
`

    class traingen(tf.keras.utils.Sequence):

       def __init__(self, batchSize):
           self.dataset = pf.read_csv('mydataset.csv')
           self.batchSize = batchSize


       def __len__(self):
           return self.dataset.shape[0] // self.batchSize
   
       def getLen(self):
           return self.dataset.shape[0] // self.batchSize

       def __getitem__(self, idx):
           rows = random.sample(range(0, self.dataset.shape[0]), batchSize)
           return self.dataset[rows]

`

The following code should work when inserted like

history = model.fit( traingen, epochs = 5, steps_per_epoch = trainSeq.getLen(), verbose = 1, use_multiprocessing = True, workers = 32 )
This works on 4 GPUs parallel training!
Good Luck yall!

@kattan1969
Copy link

Hi, I went through all the discussions and the outcome I got is that you need to use a data generator based on the keras.utils.Sequence within the fit_generator, setup the batch size divisible by the number of GPUs available and go for the use_multiprocessing = True. This didn't work in my case. I'm training the system for NLP sequence-to-sequence method. The data is sentences (text) with source and target. Multiprocessing on 4 GPUs is still 50% for that of a single GPU. Something is wrong and from my experience it's related to keras. I'm using TensforFlow GPU v1.15.3 (latest before v2.0), and keras v2.1.5 (tried newer ones, no change).
I have no reported errors nor warnings! I have 20+ years experience in programming / A.I. and I can say that this Python based framework is lacking, not easy to work with and makes things hard. Documentation is poor, and they keep rolling out new versions "deprecating" former methods and classes. That is simply BAD as you keep on modifying your source and face the ugly reality of "dependency" issues.
Based, on Amdahl's law, there should be some noticeable speedup as this task is for sure paralellizable. I believe it's a keras issue ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants