Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

Closed
ghost opened this issue Jul 17, 2018 · 2 comments
Closed

Keras TensorFlow w/ GPU - fit_generator and multi_gpu_model #10708

ghost opened this issue Jul 17, 2018 · 2 comments

Comments

@ghost
Copy link

ghost commented Jul 17, 2018

Hello,

Running Keras 2.2.0 + TensorFlow 1.9.0 w/ GPU

I'm having what is probably a simple issue that I've yet to understand with fit_generator and multi_gpu_model. I'm using the Keras PointNet implementation with some small modifications. I have a custom data generator that does some basic preprocessing that feeds the fit_generator function. It all works fine when I don't use multi_gpu_model.

When I do I get this traceback (using multi_gpu_model with 2 GPUs)

Traceback (most recent call last):
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-be732ed3db25>", line 1, in <module>
    machinelearn.current_working_aws_build(traindata_pth=r'F:\NN\NRT6_F00632\traindata_pts', valdata_pth=r'F:\NN\NRT6_F00632\valdata_pts', multi_gpu=True, modeltype='pointnet', stepsperepoch=100, epochs=10)
  File "C:\Users\eric.g.younkin\PycharmProjects\machinelearn\machinelearn.py", line 878, in current_working_aws_build
    workers=4)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training_generator.py", line 191, in fit_generator
    class_weight=class_weight)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1220, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2661, in __call__
    return self._call(inputs)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2631, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\client\session.py", line 1454, in __call__
    self._session._session, self._handle, args, status, None)
  File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0].dim(0) and In[1].dim(0) must be the same: [4,2560,3] vs [8,3,3]
	 [[Node: replica_0/PointNet/lambda_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/lambda_4/Slice, reshape_1/Reshape)]]
	 [[Node: training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1/_1415 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_5623_training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

Which I interpret to mean that the inputs to the mat_mul Lambda layer are getting different sample counts (first part of network shown below). In the run that produced the traceback above, I had a batch that was shape (8, 2560, 3) with 2 GPUs, which I believe means the network would see an input of (4, 2560, 3). However, I would think the output of the Input_T layer would be (4, 3, 3) and not (8, 3, 3), as I think the traceback suggests it is actually seeing.

Is there something I don't understand about how multi_gpu_model works? I thought I'd verified that the input batch is split by # of GPUs, but maybe there is something else going on?

Thanks in advance.

def mat_mul(a, b):
    return tf.matmul(a, b)

def build_pointnet_model(input_shape, n_labels, output_mode="softmax"):
    inputs = Input(shape=input_shape)

    # input_Transformation_net
    conv_1 = Convolution1D(64, 1, activation='relu', input_shape=(input_shape[0], 3))(inputs)
    conv_1 = BatchNormalization()(conv_1)
    conv_2 = Convolution1D(128, 1, activation='relu')(conv_1)
    conv_2 = BatchNormalization()(conv_2)
    conv_3 = Convolution1D(1024, 1, activation='relu')(conv_2)
    conv_3 = BatchNormalization()(conv_3)
    pool_1 = MaxPooling1D(pool_size=input_shape[0])(conv_3)
    dense_1 = Dense(512, activation='relu')(pool_1)
    dense_1 = BatchNormalization()(dense_1)
    dense_2 = Dense(256, activation='relu')(dense_1)
    dense_2 = BatchNormalization()(dense_2)
    dense_3 = Dense(9, weights=[np.zeros([256, 9]), np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32)])(dense_2)
    input_T = Reshape((3, 3))(dense_3)

    # forward net
    matmul_1 = Lambda(mat_mul, arguments={'b': input_T})(inputs)
    conv_4 = Convolution1D(64, 1, input_shape=(input_shape[0], 3), activation='relu')(matmul_1)
    conv_4 = BatchNormalization()(conv_4)
    conv_5 = Convolution1D(64, 1, input_shape=(input_shape[0], 3), activation='relu')(conv_4)
    conv_5 = BatchNormalization()(conv_5)
@ghost
Copy link
Author

ghost commented Jul 17, 2018

Hello again,

I believe I found out an answer of sorts, but it kinda just leaves me with more answers. All my tests seem to indicate that the Lambda layer does not use GPU devices, therefore the batch shape is all wrong as soon as it gets to my Lambda layer. Good news is that I think I should have been using the Dot layer all along. I ran a few tests, see below:

a = K.random_uniform_variable(shape=(4, 2560, 3), low=0, high=1)
b = K.random_uniform_variable(shape=(4, 3, 3), low=0, high=1)

K.dot(a,b)
Out[28]: <tf.Tensor 'Reshape_9:0' shape=(4, 2560, 4, 3) dtype=float32>
Dot(axes=2)([a, b])
Out[29]: <tf.Tensor 'dot_7/MatMul:0' shape=(4, 2560, 3) dtype=float32>
tf.matmul(a, b)
Out[30]: <tf.Tensor 'MatMul_4:0' shape=(4, 2560, 3) dtype=float32>

Now I could be way off, but I believe the dot from keras.backend and the Dot layer give different results, and the results of the keras.backend.dot conflict with the documentation, which states that:

When attempting to multiply a nD tensor with a nD tensor, it reproduces the Theano behavior. (e.g. (2, 3) * (4, 3, 5) -> (2, 4, 5))

Both tf.matmul and the Dot layer seem to perform as you would expect. I'm going to continue with Dot layer to see if I get results as I'd expect.

@DineshChandra94
Copy link

Did you find any solution to this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants