You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having what is probably a simple issue that I've yet to understand with fit_generator and multi_gpu_model. I'm using the Keras PointNet implementation with some small modifications. I have a custom data generator that does some basic preprocessing that feeds the fit_generator function. It all works fine when I don't use multi_gpu_model.
When I do I get this traceback (using multi_gpu_model with 2 GPUs)
Traceback (most recent call last):
File "C:\PydroXL\envs\machinelearning\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-be732ed3db25>", line 1, in <module>
machinelearn.current_working_aws_build(traindata_pth=r'F:\NN\NRT6_F00632\traindata_pts', valdata_pth=r'F:\NN\NRT6_F00632\valdata_pts', multi_gpu=True, modeltype='pointnet', stepsperepoch=100, epochs=10)
File "C:\Users\eric.g.younkin\PycharmProjects\machinelearn\machinelearn.py", line 878, in current_working_aws_build
workers=4)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training_generator.py", line 191, in fit_generator
class_weight=class_weight)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\engine\training.py", line 1220, in train_on_batch
outputs = self.train_function(ins)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2661, in __call__
return self._call(inputs)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\keras\backend\tensorflow_backend.py", line 2631, in _call
fetched = self._callable_fn(*array_vals)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\client\session.py", line 1454, in __call__
self._session._session, self._handle, args, status, None)
File "C:\PydroXL\envs\machinelearning\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0].dim(0) and In[1].dim(0) must be the same: [4,2560,3] vs [8,3,3]
[[Node: replica_0/PointNet/lambda_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/lambda_4/Slice, reshape_1/Reshape)]]
[[Node: training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1/_1415 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_5623_training/Adam/gradients/conv1d_16_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Which I interpret to mean that the inputs to the mat_mul Lambda layer are getting different sample counts (first part of network shown below). In the run that produced the traceback above, I had a batch that was shape (8, 2560, 3) with 2 GPUs, which I believe means the network would see an input of (4, 2560, 3). However, I would think the output of the Input_T layer would be (4, 3, 3) and not (8, 3, 3), as I think the traceback suggests it is actually seeing.
Is there something I don't understand about how multi_gpu_model works? I thought I'd verified that the input batch is split by # of GPUs, but maybe there is something else going on?
I believe I found out an answer of sorts, but it kinda just leaves me with more answers. All my tests seem to indicate that the Lambda layer does not use GPU devices, therefore the batch shape is all wrong as soon as it gets to my Lambda layer. Good news is that I think I should have been using the Dot layer all along. I ran a few tests, see below:
Now I could be way off, but I believe the dot from keras.backend and the Dot layer give different results, and the results of the keras.backend.dot conflict with the documentation, which states that:
When attempting to multiply a nD tensor with a nD tensor, it reproduces the Theano behavior. (e.g. (2, 3) * (4, 3, 5) -> (2, 4, 5))
Both tf.matmul and the Dot layer seem to perform as you would expect. I'm going to continue with Dot layer to see if I get results as I'd expect.
Hello,
Running Keras 2.2.0 + TensorFlow 1.9.0 w/ GPU
I'm having what is probably a simple issue that I've yet to understand with fit_generator and multi_gpu_model. I'm using the Keras PointNet implementation with some small modifications. I have a custom data generator that does some basic preprocessing that feeds the fit_generator function. It all works fine when I don't use multi_gpu_model.
When I do I get this traceback (using multi_gpu_model with 2 GPUs)
Which I interpret to mean that the inputs to the mat_mul Lambda layer are getting different sample counts (first part of network shown below). In the run that produced the traceback above, I had a batch that was shape (8, 2560, 3) with 2 GPUs, which I believe means the network would see an input of (4, 2560, 3). However, I would think the output of the Input_T layer would be (4, 3, 3) and not (8, 3, 3), as I think the traceback suggests it is actually seeing.
Is there something I don't understand about how multi_gpu_model works? I thought I'd verified that the input batch is split by # of GPUs, but maybe there is something else going on?
Thanks in advance.
The text was updated successfully, but these errors were encountered: