Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_INTERNAL_ERROR when training on Market dataset #93

Open
chaaarlieee opened this issue Jul 23, 2021 · 0 comments
Open

CUDNN_STATUS_INTERNAL_ERROR when training on Market dataset #93

chaaarlieee opened this issue Jul 23, 2021 · 0 comments

Comments

@chaaarlieee
Copy link

chaaarlieee commented Jul 23, 2021

I think I'm running out of memory when I'm trying to run the training code on GPU. When using nvidia-smi I can see the memory loads to max and then goes down. However I don't know if this is TF trying to claim all available memory or the code. I changed the batch size to 16 but still have the same problem.
After loading all the cuda libraries I'm getting Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR at 13:47:58.783613
I'm using python3.6.9 and 10 GB of GPU

$ python train_market1501.py --mode=train --batch_size=16
2021-07-23 13:47:53.804516: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Using TensorFlow backend.
Train set size: 11606 images, 676 identities
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:229: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:236: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:238: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py:373: The name tf.FIFOQueue is deprecated. Please use tf.queue.FIFOQueue instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:252: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:19: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:28: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

feature dimensionality:  128
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py:97: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/losses.py:142: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:258: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:266: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:268: The name tf.losses.get_total_loss is deprecated. Please use tf.compat.v1.losses.get_total_loss instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:270: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py:276: The name tf.losses.get_regularization_loss is deprecated. Please use tf.compat.v1.losses.get_regularization_loss instead.

---------------------------------------
Run ID:  RLKUKK
Log directory:  ./output/market1501/RLKUKK
---------------------------------------
WARNING:tensorflow:From /workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py:464: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2021-07-23 13:47:56.598199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-23 13:47:56.630062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.630541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1634] Found device 0 with properties: 
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.56
pciBusID: 0000:01:00.0
2021-07-23 13:47:56.630588: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.632912: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:56.633926: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-23 13:47:56.634333: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-23 13:47:56.636568: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-23 13:47:56.637146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-23 13:47:56.637365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:56.637504: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.638077: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.638349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1762] Adding visible gpu devices: 0
2021-07-23 13:47:56.643456: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2021-07-23 13:47:56.643902: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61dc1a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-07-23 13:47:56.643913: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-07-23 13:47:56.690559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.691044: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61aefc0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-07-23 13:47:56.691061: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1650, Compute Capability 7.5
2021-07-23 13:47:56.691354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.691822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1634] Found device 0 with properties: 
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.56
pciBusID: 0000:01:00.0
2021-07-23 13:47:56.691863: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.691943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:56.692006: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-23 13:47:56.692055: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-23 13:47:56.692082: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-23 13:47:56.692113: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-23 13:47:56.692127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:56.692242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.692557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.692838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1762] Adding visible gpu devices: 0
2021-07-23 13:47:56.692879: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-23 13:47:56.922074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1175] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-23 13:47:56.922102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181]      0 
2021-07-23 13:47:56.922108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1194] 0:   N 
2021-07-23 13:47:56.922484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.923047: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:985] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-23 13:47:56.923448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2647 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-07-23 13:47:58.050318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-23 13:47:58.454611: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-23 13:47:58.783613: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.791662: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.799769: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.807322: E tensorflow/stream_executor/cuda/cuda_dnn.cc:341] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-07-23 13:47:58.914962: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915018: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915063: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-07-23 13:47:58.915090: W tensorflow/core/kernels/queue_base.cc:277] _0_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv1_1/Conv2D}}]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv1_1/Conv2D}}]]
	 [[train_op/control_dependency/_373]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1004, in managed_session
    yield sess
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775, in train
    train_step_kwargs)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py", line 613, in _train_step_fn
    session, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[train_op/control_dependency/_373]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'conv1_1/Conv2D':
  File "train_market1501.py", line 130, in <module>
    main()
  File "train_market1501.py", line 71, in main
    **train_kwargs)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 177, in train_loop
    trainable_scopes=trainable_scopes)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 254, in create_trainer
    feature_var, logit_var = network_factory(image_var)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 120, in factory_fn
    weight_decay=weight_decay)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 26, in create_network
    weights_regularizer=conv_regularizer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1162, in convolution2d
    conv_dims=2)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1060, in convolution
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 201, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1176, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 662, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 252, in __call__
    name=self.name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2052, in conv2d
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_market1501.py", line 130, in <module>
    main()
  File "train_market1501.py", line 71, in main
    **train_kwargs)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 188, in train_loop
    save_interval_secs=save_interval_secs, number_of_steps=number_of_steps)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/queued_trainer.py", line 468, in run
    **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 790, in train
    ignore_live_threads=ignore_live_threads)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1014, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 839, in stop
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/coordinator.py", line 495, in run
    self.run_loop()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/supervisor.py", line 1045, in run_loop
    [self._sv.summary_op, self._sv.global_step])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[conv4_1/1/Elu-0-1-TransposeNCHWToNHWC-LayoutOptimizer/_333]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1_1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'conv1_1/Conv2D':
  File "train_market1501.py", line 130, in <module>
    main()
  File "train_market1501.py", line 71, in main
    **train_kwargs)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 177, in train_loop
    trainable_scopes=trainable_scopes)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/train_app.py", line 254, in create_trainer
    feature_var, logit_var = network_factory(image_var)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 120, in factory_fn
    weight_decay=weight_decay)
  File "/workspace/rvai_algorithms/cells/trackers/deep_sort/rvai/cells/deep_sort/cosine_metric_learning_code/nets/deep_sort/network_definition.py", line 26, in create_network
    weights_regularizer=conv_regularizer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1162, in convolution2d
    conv_dims=2)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py", line 1060, in convolution
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 201, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1176, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 662, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 252, in __call__
    name=self.name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2052, in conv2d
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant