Satellite Unet in multi-gpu #12

LJ-20 · 2020-02-12T22:23:50Z

Hello
I wasn't able to run the Satellite Unet in multi-gpu. I didn't have this problem with the custom unet.

karolzak · 2020-02-13T18:52:09Z

Hi @LJ-20
Honestly I haven't tested any of UNet implementations from this repo on multi gpu setup but in theory there shouldn't be any issues. You said you were able to run custom unet on multi gpu but it's not working for satellite unet which I find weird because there's no significant difference in implementation or dependencies between custom vs satellite. My educated guess would be that you either had some errors in your code or problems with allocating resources on GPU.
Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.
Same for you @muminoff, share the same information and I'll look into it.
Thanks,
Karol

LJ-20 · 2020-02-13T21:54:24Z

These are the lines for each of the codes:
model = satellite_unet(input_shape=(256,256,3))

model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid')

with the command:

model = multi_gpu_model(model, gpus=4,cpu_relocation=True)

The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable

The error was the following:
The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU.
I used both tensorflow 1.14 and 1.15.

karolzak · 2020-02-14T02:05:54Z

These are the lines for each of the codes:
model = satellite_unet(input_shape=(256,256,3))
model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid')
with the command:
model = multi_gpu_model(model, gpus=4,cpu_relocation=True)
The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable
The error was the following:
The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU.
I used both tensorflow 1.14 and 1.15.

@LJ-20 So it's not that you're not able to run satellite unet on multi gpu but more about it's not converging in multi gpu setup.. Hmm very interesting, thank you for brining that up! I will look into it and try debugging it however as of right now I don't see anything that could be causing this from the model implementation perspective.
My best quess would be that there's something wrong with your input data.
Check the following:

make sure you have your input data pixel values in 0-1 range
make sure your input dtype is set to float32
as a last resort try playing around with multi_gpu_model functions params (cpu_merge, etc.) and see if that helps

Let me know how did it go!

EDIT:
Removed part related to another topic in this discussion which was moved to issue #14

karolzak · 2020-02-14T03:26:05Z

@muminoff I was able to reproduce your problem and debug it and it is related to Keras/tf.keras dependencies mismatch.
I will introduce a fix for this problem in next PR but could you please create a separate issue for your problem as it is not directly related to @LJ-20 issue? Feel free to copy the content of your comments from this issue to your new issue.
Thanks!

LJ-20 · 2020-02-18T17:01:54Z

@karolzak the pixel range is read from 0 to 1, the numpy arrays are dtype float32 and also tried the multi_gpu parameters. Our first thought was the batch normalization or the way the weights are merged in multi_gpu but we didn't have this problem with the custom_unet using the exact same code.

LJ-20 · 2020-02-18T19:35:33Z

Update. Upon revision, it seems like the problem was the float32. I had it set up as float64. Why is this an issue?

karolzak · 2020-02-18T19:48:31Z

@LJ-20 , so you used float64 for both custom_unet and satellite_unet? Or just the for satellite_unet?
In general single precision (float32) is most commonly used (also its the default for TF and maybe that's the root of the problem?) and I haven't seen examples of float64 being used in any experiments. Half-precision (float16) on the other hand can be used in cases where you want to squeeze more data into memory

LJ-20 · 2020-02-18T21:05:55Z

I used float64 for both custom_unet and satellite_unet and it only worked with custom_unet

This comment has been minimized.

Sign in to view

karolzak self-assigned this Feb 13, 2020

karolzak added the question Further information is requested label Feb 13, 2020

karolzak pinned this issue Feb 13, 2020

karolzak unpinned this issue Feb 13, 2020

This comment has been minimized.

Sign in to view

muminoff mentioned this issue Feb 14, 2020

Issue with running on multiple GPUs #14

Closed

karolzak added this to To do in Release 1.0 Feb 14, 2020

karolzak removed this from To do in Release 1.0 Feb 14, 2020

karolzak closed this as completed Feb 18, 2020

karolzak reopened this Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Satellite Unet in multi-gpu #12

Satellite Unet in multi-gpu #12

LJ-20 commented Feb 12, 2020

This comment has been minimized.

karolzak commented Feb 13, 2020

LJ-20 commented Feb 13, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

karolzak commented Feb 14, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

karolzak commented Feb 14, 2020

LJ-20 commented Feb 18, 2020

LJ-20 commented Feb 18, 2020

karolzak commented Feb 18, 2020

LJ-20 commented Feb 18, 2020

Satellite Unet in multi-gpu #12

Satellite Unet in multi-gpu #12

Comments

LJ-20 commented Feb 12, 2020

This comment has been minimized.

karolzak commented Feb 13, 2020

LJ-20 commented Feb 13, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

karolzak commented Feb 14, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

karolzak commented Feb 14, 2020

LJ-20 commented Feb 18, 2020

LJ-20 commented Feb 18, 2020

karolzak commented Feb 18, 2020

LJ-20 commented Feb 18, 2020

LJ-20 commented Feb 13, 2020 •

edited

Loading

karolzak commented Feb 14, 2020 •

edited

Loading