Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

ghost · 2018-07-17T12:41:52Z

Hello,

Running Keras 2.2.0 + TensorFlow 1.9.0 w/ GPU

I'm having what is probably a simple issue that I've yet to understand with fit_generator and multi_gpu_model. I'm using the Keras PointNet implementation with some small modifications. I have a custom data generator that does some basic preprocessing that feeds the fit_generator function. It all works fine when I don't use multi_gpu_model.

When I do I get this traceback (using multi_gpu_model with 2 GPUs)

Traceback (most recent call last):
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-be732ed3db25>", line 1, in <module>
    machinelearn.current_working_aws_build(traindata_pth=r'F:\NN\NRT6_F00632\traindata_pts', valdata_pth=r'F:\NN\NRT6_F00632\valdata_pts', multi_gpu=True, modeltype='pointnet', stepsperepoch=100, epochs=10)
  File "C:\Users\eric.g.younkin\PycharmProjects\machinelearn\machinelearn.py", line 878, in current_working_aws_build
    workers=4)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training_generator.py", line 191, in fit_generator
    class_weight=class_weight)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1220, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2661, in __call__
    return self._call(inputs)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2631, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\client\session.py", line 1454, in __call__
    self._session._session, self._handle, args, status, None)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0].dim(0) and In[1].dim(0) must be the same: [4,2560,3] vs [8,3,3]
	 [[Node: replica_0/PointNet/lambda_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/lambda_4/Slice, reshape_1/Reshape)]]
	 [[Node: training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1/_1415 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_5623_training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Which I interpret to mean that the inputs to the mat_mul Lambda layer are getting different sample counts (first part of network shown below). In the run that produced the traceback above, I had a batch that was shape (8, 2560, 3) with 2 GPUs, which I believe means the network would see an input of (4, 2560, 3). However, I would think the output of the Input_T layer would be (4, 3, 3) and not (8, 3, 3), as I think the traceback suggests it is actually seeing.

Is there something I don't understand about how multi_gpu_model works? I thought I'd verified that the input batch is split by # of GPUs, but maybe there is something else going on?

Thanks in advance.

def mat_mul(a, b):
    return tf.matmul(a, b)

def build_pointnet_model(input_shape, n_labels, output_mode="softmax"):
    inputs = Input(shape=input_shape)

    # input_Transformation_net
    conv_1 = Convolution1D(64, 1, activation='relu', input_shape=(input_shape[0], 3))(inputs)
    conv_1 = BatchNormalization()(conv_1)
    conv_2 = Convolution1D(128, 1, activation='relu')(conv_1)
    conv_2 = BatchNormalization()(conv_2)
    conv_3 = Convolution1D(1024, 1, activation='relu')(conv_2)
    conv_3 = BatchNormalization()(conv_3)
    pool_1 = MaxPooling1D(pool_size=input_shape[0])(conv_3)
    dense_1 = Dense(512, activation='relu')(pool_1)
    dense_1 = BatchNormalization()(dense_1)
    dense_2 = Dense(256, activation='relu')(dense_1)
    dense_2 = BatchNormalization()(dense_2)
    dense_3 = Dense(9, weights=[np.zeros([256, 9]), np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32)])(dense_2)
    input_T = Reshape((3, 3))(dense_3)

    # forward net
    matmul_1 = Lambda(mat_mul, arguments={'b': input_T})(inputs)
    conv_4 = Convolution1D(64, 1, input_shape=(input_shape[0], 3), activation='relu')(matmul_1)
    conv_4 = BatchNormalization()(conv_4)
    conv_5 = Convolution1D(64, 1, input_shape=(input_shape[0], 3), activation='relu')(conv_4)
    conv_5 = BatchNormalization()(conv_5)

The text was updated successfully, but these errors were encountered:

ghost · 2018-07-17T17:53:49Z

Hello again,

I believe I found out an answer of sorts, but it kinda just leaves me with more answers. All my tests seem to indicate that the Lambda layer does not use GPU devices, therefore the batch shape is all wrong as soon as it gets to my Lambda layer. Good news is that I think I should have been using the Dot layer all along. I ran a few tests, see below:

a = K.random_uniform_variable(shape=(4, 2560, 3), low=0, high=1)
b = K.random_uniform_variable(shape=(4, 3, 3), low=0, high=1)

K.dot(a,b)
Out[28]: <tf.Tensor 'Reshape_9:0' shape=(4, 2560, 4, 3) dtype=float32>
Dot(axes=2)([a, b])
Out[29]: <tf.Tensor 'dot_7/MatMul:0' shape=(4, 2560, 3) dtype=float32>
tf.matmul(a, b)
Out[30]: <tf.Tensor 'MatMul_4:0' shape=(4, 2560, 3) dtype=float32>

Now I could be way off, but I believe the dot from keras.backend and the Dot layer give different results, and the results of the keras.backend.dot conflict with the documentation, which states that:

When attempting to multiply a nD tensor with a nD tensor, it reproduces the Theano behavior. (e.g. (2, 3) * (4, 3, 5) -> (2, 4, 5))

Both tf.matmul and the Dot layer seem to perform as you would expect. I'm going to continue with Dot layer to see if I get results as I'd expect.

DineshChandra94 · 2019-03-13T12:44:01Z

Did you find any solution to this error?

ghost mentioned this issue Jul 17, 2018

PointNet and multi_gpu_model garyli1019/pointnet-keras#2

Open

bricewalker mentioned this issue Feb 5, 2019

Multi-GPU scenario bricewalker/Hey-Jetson#2

Closed

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

ghost commented Jul 17, 2018

ghost commented Jul 17, 2018

DineshChandra94 commented Mar 13, 2019

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

Comments

ghost commented Jul 17, 2018

ghost commented Jul 17, 2018

DineshChandra94 commented Mar 13, 2019