Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with running on multiple GPUs #14

Closed
muminoff opened this issue Feb 14, 2020 · 11 comments · Fixed by #15
Closed

Issue with running on multiple GPUs #14

muminoff opened this issue Feb 14, 2020 · 11 comments · Fixed by #15
Assignees
Labels
enhancement New feature or request

Comments

@muminoff
Copy link
Contributor

Copy-pasting comments from #12


@muminoff:
I cannot run custom unet with multi-gpu. I followed distributed training part in Tensorflow documentation, but no luck. It seems I need to refactor code and use custom distributed training (namely strategy.experimental_distribute_dataset).


@karolzak:
Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.


@muminoff:
I haven't tried tf.keras.utils.multi_gpu_model since it is deprecated. But, I tried with tf.distribute.MirroredStrategy().

And, here is my code:

from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

    history = model.fit_generator(
        train_gen,
        steps_per_epoch=200,
        epochs=50,
        validation_data=(x_val, y_val),
        callbacks=[callback_checkpoint]
    )

Error:

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

fyi, using multi_gpu_model raises following exception:

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.training.Model object at 0x7f1b347372d0>)

@karolzak:
can you specify the version that you're using for TF/Keras? This seem to be related to that problem.


@muminoff:

tf.__version__
'2.1.0'

keras.__version__
'2.3.1'
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

model.summary()

model_filename = 'model-v2.h5'

callback_checkpoint = ModelCheckpoint(
    model_filename, 
    verbose=1, 
    monitor='val_loss', 
    save_best_only=True,
)

model.compile(
    optimizer=Adam(), 
    #optimizer=SGD(lr=0.01, momentum=0.99),
    loss='binary_crossentropy',
    #loss=jaccard_distance,
    metrics=[iou, iou_thresholded]
)

history = model.fit_generator(
    train_gen,
    steps_per_epoch=200,
    epochs=50,
    validation_data=(x_val, y_val),
    callbacks=[callback_checkpoint]
)
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
@karolzak karolzak self-assigned this Feb 14, 2020
@karolzak karolzak added the enhancement New feature or request label Feb 14, 2020
@karolzak
Copy link
Owner

karolzak commented Feb 14, 2020

Thanks for reposting this as a new issue @muminoff
This issue is caused by internally using import keras (old Keras package) versus import tensorflow.keras.
I just merged a PR which adds support for TF2.x and the behavior is that now depending on TF version and presence keras_unet will either use regular Keras package or tf.keras.
Could you please check if it now works for your multi gpu setup?
Make sure you pip uninstall keras_unet first and then use this to install the most recent 'patched' version:
pip install git+https://github.com/karolzak/keras-unet

@muminoff
Copy link
Contributor Author

Uninstalled the previous one, installed git version, but I don't figure out why it shows previous version.

image

@muminoff
Copy link
Contributor Author

I finally managed to reinstall keras_unet from Git version. Unfortunately, same error occurs again. I think, it has no relation with datagenerator, even I comment model.fit(...) part it raises the exception.

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    
    input_shape = x_train[0].shape
    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

#     history = model.fit_generator(
#         train_gen,
#         steps_per_epoch=200,
#         epochs=50,
#         validation_data=(x_val, y_val),
#         callbacks=[callback_checkpoint]
#     )

Output:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-336fbdac5ffd> in <module>
     17         dropout=0.3,
     18         dropout_change_per_layer=0.0,
---> 19         num_layers=6
     20     )
     21 

~/unetnew/env/lib/python3.7/site-packages/keras_unet/models/custom_unet.py in custom_unet(input_shape, num_classes, use_batch_norm, upsample_mode, use_dropout_on_upsampling, dropout, dropout_change_per_layer, filters, num_layers, output_activation)
     51     down_layers = []
     52     for l in range(num_layers):
---> 53         x = conv2d_block(inputs=x, filters=filters, use_batch_norm=use_batch_norm, dropout=dropout)
     54         down_layers.append(x)
     55         x = MaxPooling2D((2, 2)) (x)

~/unetnew/env/lib/python3.7/site-packages/keras_unet/models/custom_unet.py in conv2d_block(inputs, use_batch_norm, dropout, filters, kernel_size, activation, kernel_initializer, padding)
     20     c = Conv2D(filters, kernel_size, activation=activation, kernel_initializer=kernel_initializer, padding=padding) (inputs)
     21     if use_batch_norm:
---> 22         c = BatchNormalization()(c)
     23     if dropout > 0.0:
     24         c = Dropout(dropout)(c)

~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in symbolic_fn_wrapper(*args, **kwargs)
     73         if _SYMBOLIC_SCOPE.value:
     74             with get_graph().as_default():
---> 75                 return func(*args, **kwargs)
     76         else:
     77             return func(*args, **kwargs)

~/unetnew/env/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
    487             # Actually call the layer,
    488             # collecting output(s), mask(s), and shape(s).
--> 489             output = self.call(inputs, **kwargs)
    490             output_mask = self.compute_mask(inputs, previous_mask)
    491 

~/unetnew/env/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
    197         self.add_update([K.moving_average_update(self.moving_mean,
    198                                                  mean,
--> 199                                                  self.momentum),
    200                          K.moving_average_update(self.moving_variance,
    201                                                  variance,

~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in symbolic_fn_wrapper(*args, **kwargs)
     73         if _SYMBOLIC_SCOPE.value:
     74             with get_graph().as_default():
---> 75                 return func(*args, **kwargs)
     76         else:
     77             return func(*args, **kwargs)

~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in moving_average_update(x, value, momentum)
   1294         An operation to update the variable.
   1295     """
-> 1296     with tf_ops.colocate_with(x):
   1297         decay = tf_ops.convert_to_tensor(1.0 - momentum)
   1298         if decay.dtype != x.dtype.base_dtype:

/usr/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _colocate_with_for_gradient(self, op, gradient_uid, ignore_existing)
   4110   def _colocate_with_for_gradient(self, op, gradient_uid,
   4111                                   ignore_existing=False):
-> 4112     with self.colocate_with(op, ignore_existing):
   4113       if gradient_uid is not None and self._control_flow_context is not None:
   4114         self._control_flow_context.EnterGradientColocation(op, gradient_uid)

/usr/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in colocate_with(self, op, ignore_existing)
   4159       raise ValueError("Trying to reset colocation (op is None) but "
   4160                        "ignore_existing is not True")
-> 4161     op = _op_to_colocate_with(op, self)
   4162 
   4163     # By default, colocate_with resets the device function stack,

~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _op_to_colocate_with(v, graph)
   6546   # happen soon, perhaps this hack to work around the circular
   6547   # import dependency is acceptable.
-> 6548   if hasattr(v, "handle") and isinstance(v.handle, Tensor):
   6549     if graph.building_function:
   6550       return graph.capture(v.handle).op

~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/distribute/values.py in handle(self)
    718       device = distribute_lib.get_update_device()
    719       if device is None:
--> 720         raise ValueError("`handle` is not available outside the replica context"
    721                          " or a `tf.distribute.Strategy.update()` call.")
    722     return self.get(device=device).handle

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

@karolzak
Copy link
Owner

karolzak commented Feb 14, 2020

~/unetnew/env/lib/python3.7/site-packages/keras_unet/models/custom_unet.py in conv2d_block(inputs, use_batch_norm, dropout, filters, kernel_size, activation, kernel_initializer, padding)
     20     c = Conv2D(filters, kernel_size, activation=activation, kernel_initializer=kernel_initializer, padding=padding) (inputs)
     21     if use_batch_norm:
---> 22         c = BatchNormalization()(c)
     23     if dropout > 0.0:
     24         c = Dropout(dropout)(c)

@muminoff this code is definitely coming from 'old' version. Newest code looks like this:

    c = Conv2D(
        filters,
        kernel_size,
        activation=activation,
        kernel_initializer=kernel_initializer,
        padding=padding,
        use_bias=not use_batch_norm,
    )(inputs)
    if use_batch_norm:
        c = BatchNormalization()(c)
    if dropout > 0.0:
        c = DO(dropout)(c)

Make sure you're using the latest version.
If you're installing through pip in notebooks please make sure you're addressing current kernels env like below:

import sys
! {sys.executable} -m pip install git+https://github.com/karolzak/keras-unet

Also, please ignore keras_unet.__version__ - I totally forgot to bump it while rushing the PR...

@muminoff
Copy link
Contributor Author

My example code still imports old keras (from keras.callbacks import ModelCheckpoint) which was directly taken from kz-whale-tails notebook example. Should I change those imports also?

@muminoff
Copy link
Contributor Author

Before:

from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras.utils import multi_gpu_model
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

After:

from keras_unet.models import custom_unet
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import multi_gpu_model
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

Both (before and after) are giving same error message.

@karolzak
Copy link
Owner

My example code still imports old keras (from keras.callbacks import ModelCheckpoint) which was directly taken from kz-whale-tails notebook example. Should I change those imports also?

Yes, change everything to from tensorflow.keras...

@karolzak
Copy link
Owner

karolzak commented Feb 14, 2020

@muminoff can you paste again the most recent error msg + stack trace?

@muminoff
Copy link
Contributor Author

Jupyter Notebook exported to Markdown format

import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import glob
import os
import sys
from PIL import Image
masks = glob.glob("/home/user/beefscan/experiments/new_models_201912/images/newtrain/*.png")
orgs = list(map(lambda x: x.replace(".png", ".jpg"), masks))
imgs_list = []
masks_list = []
for image, mask in zip(orgs, masks):
    imgs_list.append(np.array(Image.open(image).resize((64,64))))
    masks_list.append(np.array(Image.open(mask).resize((64,64))))

imgs_np = np.asarray(imgs_list)
masks_np = np.asarray(masks_list)
print(imgs_np.shape, masks_np.shape)
(1970, 64, 64, 3) (1970, 64, 64)
from keras_unet.utils import plot_imgs

plot_imgs(org_imgs=imgs_np, mask_imgs=masks_np, nm_img_to_plot=10, figsize=6)
print(imgs_np.max(), masks_np.max())
255 255
x = np.asarray(imgs_np, dtype=np.float32)/255
y = np.asarray(masks_np, dtype=np.float32)
print(x.max(), y.max())
1.0 255.0
print(x.shape, y.shape)
(1970, 64, 64, 3) (1970, 64, 64)
y = y.reshape(y.shape[0], y.shape[1], y.shape[2], 1)
print(x.shape, y.shape)
(1970, 64, 64, 3) (1970, 64, 64, 1)
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.9, shuffle=True)

print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)
print("x_val: ", x_val.shape)
print("y_val: ", y_val.shape)
x_train:  (197, 64, 64, 3)
y_train:  (197, 64, 64, 1)
x_val:  (1773, 64, 64, 3)
y_val:  (1773, 64, 64, 1)
from keras_unet.utils import get_augmented

train_gen = get_augmented(
    x_train, y_train, batch_size=8,
    data_gen_args = dict(
        rotation_range=5.,
        width_shift_range=0.05,
        height_shift_range=0.05,
        shear_range=40,
        zoom_range=0.2,
        horizontal_flip=True,
        vertical_flip=False,
        fill_mode='constant'
    ))
sample_batch = next(train_gen)
xx, yy = sample_batch
print(xx.shape, yy.shape)
from keras_unet.utils import plot_imgs

plot_imgs(org_imgs=xx, mask_imgs=yy, nm_img_to_plot=2, figsize=6)
from keras_unet.models import custom_unet
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import multi_gpu_model
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance


strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    
    input_shape = x_train[0].shape
    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

#     history = model.fit_generator(
#         train_gen,
#         steps_per_epoch=200,
#         epochs=50,
#         validation_data=(x_val, y_val),
#         callbacks=[callback_checkpoint]
#     )
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-33-336fbdac5ffd> in <module>
     17         dropout=0.3,
     18         dropout_change_per_layer=0.0,
---> 19         num_layers=6
     20     )
     21 


~/unetnew/env/lib/python3.7/site-packages/keras_unet/models/custom_unet.py in custom_unet(input_shape, num_classes, activation, use_batch_norm, upsample_mode, dropout, dropout_change_per_layer, dropout_type, use_dropout_on_upsampling, filters, num_layers, output_activation)
    155             dropout=dropout,
    156             dropout_type=dropout_type,
--> 157             activation=activation,
    158         )
    159         down_layers.append(x)


~/unetnew/env/lib/python3.7/site-packages/keras_unet/models/custom_unet.py in conv2d_block(inputs, use_batch_norm, dropout, dropout_type, filters, kernel_size, activation, kernel_initializer, padding)
     68     )(inputs)
     69     if use_batch_norm:
---> 70         c = BatchNormalization()(c)
     71     if dropout > 0.0:
     72         c = DO(dropout)(c)


~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in symbolic_fn_wrapper(*args, **kwargs)
     73         if _SYMBOLIC_SCOPE.value:
     74             with get_graph().as_default():
---> 75                 return func(*args, **kwargs)
     76         else:
     77             return func(*args, **kwargs)


~/unetnew/env/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
    487             # Actually call the layer,
    488             # collecting output(s), mask(s), and shape(s).
--> 489             output = self.call(inputs, **kwargs)
    490             output_mask = self.compute_mask(inputs, previous_mask)
    491 


~/unetnew/env/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
    197         self.add_update([K.moving_average_update(self.moving_mean,
    198                                                  mean,
--> 199                                                  self.momentum),
    200                          K.moving_average_update(self.moving_variance,
    201                                                  variance,


~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in symbolic_fn_wrapper(*args, **kwargs)
     73         if _SYMBOLIC_SCOPE.value:
     74             with get_graph().as_default():
---> 75                 return func(*args, **kwargs)
     76         else:
     77             return func(*args, **kwargs)


~/unetnew/env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in moving_average_update(x, value, momentum)
   1294         An operation to update the variable.
   1295     """
-> 1296     with tf_ops.colocate_with(x):
   1297         decay = tf_ops.convert_to_tensor(1.0 - momentum)
   1298         if decay.dtype != x.dtype.base_dtype:


/usr/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None


~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _colocate_with_for_gradient(self, op, gradient_uid, ignore_existing)
   4110   def _colocate_with_for_gradient(self, op, gradient_uid,
   4111                                   ignore_existing=False):
-> 4112     with self.colocate_with(op, ignore_existing):
   4113       if gradient_uid is not None and self._control_flow_context is not None:
   4114         self._control_flow_context.EnterGradientColocation(op, gradient_uid)


/usr/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None


~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in colocate_with(self, op, ignore_existing)
   4159       raise ValueError("Trying to reset colocation (op is None) but "
   4160                        "ignore_existing is not True")
-> 4161     op = _op_to_colocate_with(op, self)
   4162 
   4163     # By default, colocate_with resets the device function stack,


~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _op_to_colocate_with(v, graph)
   6546   # happen soon, perhaps this hack to work around the circular
   6547   # import dependency is acceptable.
-> 6548   if hasattr(v, "handle") and isinstance(v.handle, Tensor):
   6549     if graph.building_function:
   6550       return graph.capture(v.handle).op


~/unetnew/env/lib/python3.7/site-packages/tensorflow_core/python/distribute/values.py in handle(self)
    718       device = distribute_lib.get_update_device()
    719       if device is None:
--> 720         raise ValueError("`handle` is not available outside the replica context"
    721                          " or a `tf.distribute.Strategy.update()` call.")
    722     return self.get(device=device).handle


ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
!pip uninstall keras_unet
Found existing installation: keras-unet 0.0.8
Uninstalling keras-unet-0.0.8:
  Would remove:
    /home/user/unetnew/env/lib/python3.7/site-packages/keras_unet-0.0.8.dist-info/*
    /home/user/unetnew/env/lib/python3.7/site-packages/keras_unet/*
    /home/user/unetnew/env/lib/python3.7/site-packages/tests/*
  Would not remove (might be manually added):
    /home/user/unetnew/env/lib/python3.7/site-packages/tests/host_fake.py
    /home/user/unetnew/env/lib/python3.7/site-packages/tests/host_test.py
    /home/user/unetnew/env/lib/python3.7/site-packages/tests/lib_test.py
    /home/user/unetnew/env/lib/python3.7/site-packages/tests/tool_test.py
Proceed (y/n)? ^C
�[31mERROR: Operation cancelled by user�[0m
from keras_unet.utils import plot_segm_history

plot_segm_history(history)
model.load_weights(model_filename)
y_pred = model.predict(x_val)
from keras_unet.utils import plot_imgs

plot_imgs(org_imgs=x_val, mask_imgs=y_val, pred_imgs=y_pred, nm_img_to_plot=10)
tf.__version__
'2.1.0'
import keras
keras.__version__
'2.3.1'

@karolzak
Copy link
Owner

@muminoff
Ok, I think I finally got it.
I'm using packaging module internally do compare TF version and I was convinced that it's a base python package which turns out it's not and it needs to be installed so:

pip install packaging

Sorry for the trouble and let me know if that finally solved the issue

@muminoff
Copy link
Contributor Author

@karolzak I have followed your instruction. Now it works. Thanks a lot for your support!
image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants