Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demogen: loading resnet models fails on GPU #39

Open
uricohen opened this issue Jul 31, 2019 · 8 comments
Open

demogen: loading resnet models fails on GPU #39

uricohen opened this issue Jul 31, 2019 · 8 comments

Comments

@uricohen
Copy link

The problem described in previous issue is resolved when working with tensorflow with enabled GPU support, but then there is a zoo of behaviors:

  • 5 of the saved resnet models loads correctly
  • Most fail with Not found: Key resnet/group_norm/beta not found in checkpoint
  • Many fail with Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [1,32,1,1] rhs shape= [32]
  • Few fail with ValueError: Trying to share variable resnet/conv2d/kernel, but specified shape (3, 3, 3, 32) and found shape (3, 3, 3, 16).

Correctly loaded:

resnet cifar10 resnet_wide_1.0x_batchnorm_aug_decay_0.0_1
resnet cifar10 resnet_wide_1.0x_batchnorm_aug_decay_0.0_lr_0.001_1

resnet cifar100 resnet_wide_1.0x_batchnorm_aug_decay_0.0_1
resnet cifar100 resnet_wide_1.0x_batchnorm_aug_decay_0.0_lr_0.001_1
resnet cifar100 resnet_wide_1.0x_batchnorm_aug_decay_0.0_lr_0.1_1

Not found:

resnet cifar10 resnet_wide_1.0x_batchnorm_aug_decay_0.0_2
2019-07-31 13:22:08.859328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
2019-07-31 13:22:08.859377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:22:08.859386: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-31 13:22:08.859393: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-31 13:22:08.859405: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-31 13:22:08.859413: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-31 13:22:08.859420: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-31 13:22:08.859428: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-31 13:22:08.859802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-31 13:22:08.859822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 13:22:08.859826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-31 13:22:08.859829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-31 13:22:08.860214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10481 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Collecting 3072 neurons from 4 layers (5024 samples, 10 objects)
W0731 13:22:08.969132 139832240281408 deprecation.py:323] From ~/google-research/demogen/models/resnet.py:47: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
2019-07-31 13:22:10.279506: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key resnet/group_norm/beta not found in checkpoint
Traceback (most recent call last):
  File "demogen/parse_tuning.py", line 84, in <module>
    all_activations, samples_per_object, layer_names, layer_indices, layer_n_neurons = elu.extract_layers(input_fn, root_dir, model_config)
  File "~/google-research/demogen/extract_layers_util.py", line 98, in extract_layers
    model_config.load_parameters(param_path, sess)
  File "~/google-research/demogen/model_config.py", line 262, in load_parameters
    saver.restore(tf_session, model_dir)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Not found: Key resnet/group_norm/beta not found in checkpoint
         [[node save_2/RestoreV2 (defined at ~/google-research/demogen/model_config.py:261) ]]
  (1) Not found: Key resnet/group_norm/beta not found in checkpoint
         [[node save_2/RestoreV2 (defined at ~/google-research/demogen/model_config.py:261) ]]
         [[save_2/RestoreV2/_383]]
0 successful operations.
0 derived errors ignored.

Original stack trace for u'save_2/RestoreV2':
  File "demogen/parse_tuning.py", line 84, in <module>
    all_activations, samples_per_object, layer_names, layer_indices, layer_n_neurons = elu.extract_layers(input_fn, root_dir, model_config)
  File "~/google-research/demogen/extract_layers_util.py", line 98, in extract_layers
    model_config.load_parameters(param_path, sess)
  File "~/google-research/demogen/model_config.py", line 261, in load_parameters
    saver = tf.train.Saver(model_var_list)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Invalid argument:

resnet cifar10 resnet_wide_1.0x_groupnorm_aug_decay_0.0_1
W0731 13:25:40.192379 140184543594304 deprecation_wrapper.py:119] From ~/google-research/demogen/extract_layers_util.py:68: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-31 13:25:40.193601: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-31 13:25:40.530512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
2019-07-31 13:25:40.530699: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:25:40.531574: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-31 13:25:40.532371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-31 13:25:40.532577: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-31 13:25:40.533520: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-31 13:25:40.534268: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-31 13:25:40.536462: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-31 13:25:40.537216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-31 13:25:40.537544: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-07-31 13:25:40.596500: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0d5471f80 executing computations on platform CUDA. Devices:
2019-07-31 13:25:40.596528: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-07-31 13:25:40.627506: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3300000000 Hz
2019-07-31 13:25:40.628479: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b0d3885e60 executing computations on platform Host. Devices:
2019-07-31 13:25:40.628495: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-31 13:25:40.628967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
2019-07-31 13:25:40.629006: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:25:40.629014: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-31 13:25:40.629021: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-31 13:25:40.629036: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-31 13:25:40.629043: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-31 13:25:40.629066: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-31 13:25:40.629073: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-31 13:25:40.629738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-31 13:25:40.629757: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:25:40.630519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 13:25:40.630526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-31 13:25:40.630529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-31 13:25:40.631262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10481 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
W0731 13:25:40.637204 140184543594304 deprecation.py:323] From ~/.local/lib64/python2.7/site-packages/tensor2tensor/data_generators/problem.py:680: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0731 13:25:40.649481 140184543594304 deprecation_wrapper.py:119] From ~/.local/lib64/python2.7/site-packages/tensor2tensor/data_generators/image_utils.py:169: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W0731 13:25:40.820377 140184543594304 deprecation.py:323] From ~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/image_ops_impl.py:1514: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
W0731 13:25:40.825278 140184543594304 deprecation.py:323] From ~/google-research/demogen/data_util.py:76: make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
W0731 13:25:40.837275 140184543594304 deprecation_wrapper.py:119] From ~/google-research/demogen/extract_layers_util.py:76: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Collecting 3072 neurons from 4 layers (5024 samples, 10 objects)
W0731 13:25:40.838011 140184543594304 deprecation_wrapper.py:119] From ~/google-research/demogen/models/resnet.py:383: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0731 13:25:40.838236 140184543594304 deprecation.py:323] From ~/google-research/demogen/models/resnet.py:136: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
W0731 13:25:41.450165 140184543594304 deprecation.py:323] From ~/google-research/demogen/models/resnet.py:430: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0731 13:25:41.451108 140184543594304 deprecation.py:506] From ~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0731 13:25:42.371058 140184543594304 deprecation_wrapper.py:119] From ~/google-research/demogen/model_config.py:261: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W0731 13:25:42.421227 140184543594304 deprecation.py:323] From ~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
  File "demogen/parse_tuning.py", line 84, in <module>
    all_activations, samples_per_object, layer_names, layer_indices, layer_n_neurons = elu.extract_layers(input_fn, root_dir, model_config)
  File "~/google-research/demogen/extract_layers_util.py", line 98, in extract_layers
    model_config.load_parameters(param_path, sess)
  File "~/google-research/demogen/model_config.py", line 262, in load_parameters
    saver.restore(tf_session, model_dir)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 1322, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [1,32,1,1] rhs shape= [32]
         [[node save/Assign_50 (defined at ~/google-research/demogen/model_config.py:261) ]]
         [[save/RestoreV2/_120]]
  (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [1,32,1,1] rhs shape= [32]
         [[node save/Assign_50 (defined at ~/google-research/demogen/model_config.py:261) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node save/Assign_50:
 resnet/group_norm_15/beta (defined at ~/google-research/demogen/models/resnet.py:66)

Input Source operations connected to node save/Assign_50:
 resnet/group_norm_15/beta (defined at ~/google-research/demogen/models/resnet.py:66)

Original stack trace for u'save/Assign_50':
  File "demogen/parse_tuning.py", line 84, in <module>
    all_activations, samples_per_object, layer_names, layer_indices, layer_n_neurons = elu.extract_layers(input_fn, root_dir, model_config)
  File "~/google-research/demogen/extract_layers_util.py", line 98, in extract_layers
    model_config.load_parameters(param_path, sess)
  File "~/google-research/demogen/model_config.py", line 261, in load_parameters
    saver = tf.train.Saver(model_var_list)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saver.py", line 350, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 72, in restore
    self.op.get_shape().is_fully_defined())
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 227, in assign
    validate_shape=validate_shape)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 66, in assign
    use_locking=use_locking, name=name)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

ValueError logs:

resnet cifar10 resnet_wide_1.0x_groupnorm__decay_0.002_lr_0.001_3
2019-07-31 13:19:29.317723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
2019-07-31 13:19:29.317779: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:19:29.317789: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-31 13:19:29.317803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-31 13:19:29.317811: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-31 13:19:29.317819: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-31 13:19:29.317826: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-31 13:19:29.317834: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-31 13:19:29.318213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-31 13:19:29.318235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 13:19:29.318239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-31 13:19:29.318243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-31 13:19:29.318637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10481 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Collecting 3072 neurons from 4 layers (5024 samples, 10 objects)
2019-07-31 13:19:36.919128: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key resnet/batch_normalization/beta not found in checkpoint
resnet cifar10 resnet_wide_2.0x_batchnorm_aug_decay_0.0_1
2019-07-31 13:19:37.275569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
2019-07-31 13:19:37.275624: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-31 13:19:37.275643: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-31 13:19:37.275652: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-31 13:19:37.275659: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-31 13:19:37.275667: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-31 13:19:37.275675: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-31 13:19:37.275691: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-07-31 13:19:37.276063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-31 13:19:37.276086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 13:19:37.276090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-31 13:19:37.276094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-31 13:19:37.276490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10481 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Collecting 3072 neurons from 4 layers (5024 samples, 10 objects)
Traceback (most recent call last):
  File "demogen/parse_tuning.py", line 84, in <module>
    all_activations, samples_per_object, layer_names, layer_indices, layer_n_neurons = elu.extract_layers(input_fn, root_dir, model_config)
  File "~/google-research/demogen/extract_layers_util.py", line 89, in extract_layers
    end_points_collection=end_points_collection)
  File "~/google-research/demogen/models/resnet.py", line 391, in __call__
    strides=self.conv_stride, data_format=self.data_format)
  File "~/google-research/demogen/models/resnet.py", line 136, in conv2d_fixed_padding
    data_format=data_format)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/layers/convolutional.py", line 424, in conv2d
    return layer.apply(inputs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/layers/base.py", line 537, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 591, in __call__
    self._maybe_build(inputs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1881, in _maybe_build
    self.build(input_shapes)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/keras/layers/convolutional.py", line 165, in build
    dtype=self.dtype)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/layers/base.py", line 450, in add_weight
    **kwargs)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 384, in add_weight
    aggregation=aggregation)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/training/tracking/base.py", line 663, in _add_variable_with_custom_getter
    **kwargs_for_getter)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable
    aggregation=aggregation)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable
    aggregation=aggregation)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable
    aggregation=aggregation)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter
    aggregation=aggregation)
  File "~/.local/lib64/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 869, in _get_single_variable
    (name, shape, found_var.get_shape()))
ValueError: Trying to share variable resnet/conv2d/kernel, but specified shape (3, 3, 3, 32) and found shape (3, 3, 3, 16).
@uricohen
Copy link
Author

An example which reproduce those failures is available at my fork:

python demogen/parse_tuning.py

@yidingjiang
Copy link
Contributor

yidingjiang commented Aug 6, 2019

Are the problems with resnet's only?

@uricohen
Copy link
Author

uricohen commented Aug 6, 2019 via email

@yidingjiang
Copy link
Contributor

I couldn't find extract_layers_util in your repo, but one quick thing to try: can you try to call tf.reset_default_graph between loading different models?

@uricohen
Copy link
Author

uricohen commented Aug 7, 2019 via email

@uricohen
Copy link
Author

uricohen commented Aug 7, 2019

This indeed solve the issue for all batchnorm models in resnet, but not for groupnorm!

The following error are no longer there:

  • Most fail with Not found: Key resnet/group_norm/beta not found in checkpoint
  • Few fail with ValueError: Trying to share variable resnet/conv2d/kernel, but specified shape (3, 3, 3, 32) and found shape (3, 3, 3, 16).

The following error is still there, in all groupnorm models:

  • Many fail with Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [1,32,1,1] rhs shape= [32]

That is, for resnet I could read 108 / 216 cifar10 models and 162 / 324 of cifar100 models.

@yidingjiang
Copy link
Contributor

yidingjiang commented Aug 7, 2019

I think the issue is that in the original code the tensor shapes are initialized as [c] and reshaped to [1, c, 1,1] but it was changed later to initializing the tensorshape with [1, c, 1, 1] directly. My bad that I didn't catch it. It might take a me bit of time to push the change, but if you do the following it should fix the issue:

  1. Go to models/resent.py
  2. Go to the function group_norm
  3. Change:
    gamma = tf.get_variable('gamma', [1, c, 1, 1],
                            initializer=tf.constant_initializer(1.0))
    beta = tf.get_variable('beta', [1, c, 1, 1],
                           initializer=tf.constant_initializer(0.0))

to

    gamma = tf.get_variable('gamma', [c],
                            initializer=tf.constant_initializer(1.0))
    beta = tf.get_variable('beta', [c],
                           initializer=tf.constant_initializer(0.0))
    gamma = tf.reshape(gamma, [1, c, 1, 1])
    beta = tf.reshape(beta, [1, c, 1, 1])

@uricohen
Copy link
Author

uricohen commented Aug 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants