Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Satellite Unet in multi-gpu #12

Open
LJ-20 opened this issue Feb 12, 2020 · 13 comments
Open

Satellite Unet in multi-gpu #12

LJ-20 opened this issue Feb 12, 2020 · 13 comments
Assignees
Labels
question Further information is requested

Comments

@LJ-20
Copy link

LJ-20 commented Feb 12, 2020

Hello
I wasn't able to run the Satellite Unet in multi-gpu. I didn't have this problem with the custom unet.

@muminoff

This comment has been minimized.

@karolzak karolzak self-assigned this Feb 13, 2020
@karolzak
Copy link
Owner

Hi @LJ-20
Honestly I haven't tested any of UNet implementations from this repo on multi gpu setup but in theory there shouldn't be any issues. You said you were able to run custom unet on multi gpu but it's not working for satellite unet which I find weird because there's no significant difference in implementation or dependencies between custom vs satellite. My educated guess would be that you either had some errors in your code or problems with allocating resources on GPU.
Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.
Same for you @muminoff, share the same information and I'll look into it.
Thanks,
Karol

@karolzak karolzak added the question Further information is requested label Feb 13, 2020
@karolzak karolzak pinned this issue Feb 13, 2020
@karolzak karolzak unpinned this issue Feb 13, 2020
@LJ-20
Copy link
Author

LJ-20 commented Feb 13, 2020

These are the lines for each of the codes:
model = satellite_unet(input_shape=(256,256,3))

model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid')

with the command:

model = multi_gpu_model(model, gpus=4,cpu_relocation=True)

The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable

The error was the following:
The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU.
I used both tensorflow 1.14 and 1.15.

@muminoff

This comment has been minimized.

@muminoff

This comment has been minimized.

@karolzak
Copy link
Owner

karolzak commented Feb 14, 2020

These are the lines for each of the codes:
model = satellite_unet(input_shape=(256,256,3))
model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid')
with the command:
model = multi_gpu_model(model, gpus=4,cpu_relocation=True)
The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable
The error was the following:
The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU.
I used both tensorflow 1.14 and 1.15.

@LJ-20 So it's not that you're not able to run satellite unet on multi gpu but more about it's not converging in multi gpu setup.. Hmm very interesting, thank you for brining that up! I will look into it and try debugging it however as of right now I don't see anything that could be causing this from the model implementation perspective.
My best quess would be that there's something wrong with your input data.
Check the following:

  • make sure you have your input data pixel values in 0-1 range
  • make sure your input dtype is set to float32
  • as a last resort try playing around with multi_gpu_model functions params (cpu_merge, etc.) and see if that helps

Let me know how did it go!

EDIT:
Removed part related to another topic in this discussion which was moved to issue #14

@muminoff

This comment has been minimized.

@muminoff

This comment has been minimized.

@karolzak
Copy link
Owner

@muminoff I was able to reproduce your problem and debug it and it is related to Keras/tf.keras dependencies mismatch.
I will introduce a fix for this problem in next PR but could you please create a separate issue for your problem as it is not directly related to @LJ-20 issue? Feel free to copy the content of your comments from this issue to your new issue.
Thanks!

@karolzak karolzak added this to To do in Release 1.0 Feb 14, 2020
@karolzak karolzak removed this from To do in Release 1.0 Feb 14, 2020
@LJ-20
Copy link
Author

LJ-20 commented Feb 18, 2020

@karolzak the pixel range is read from 0 to 1, the numpy arrays are dtype float32 and also tried the multi_gpu parameters. Our first thought was the batch normalization or the way the weights are merged in multi_gpu but we didn't have this problem with the custom_unet using the exact same code.

@karolzak karolzak reopened this Feb 18, 2020
@LJ-20
Copy link
Author

LJ-20 commented Feb 18, 2020

Update. Upon revision, it seems like the problem was the float32. I had it set up as float64. Why is this an issue?

@karolzak
Copy link
Owner

@LJ-20 , so you used float64 for both custom_unet and satellite_unet? Or just the for satellite_unet?
In general single precision (float32) is most commonly used (also its the default for TF and maybe that's the root of the problem?) and I haven't seen examples of float64 being used in any experiments. Half-precision (float16) on the other hand can be used in cases where you want to squeeze more data into memory

@LJ-20
Copy link
Author

LJ-20 commented Feb 18, 2020

I used float64 for both custom_unet and satellite_unet and it only worked with custom_unet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants