CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

andreykramer · 2020-02-21T13:29:36Z

Hi, I'm trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I'm getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:

Tensorflow 2.1
CUDA 10.1
cudnn 7.6.5 for CUDA 10.1

2020-02-21 13:39:39.259132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:41.110202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I0221 13:39:43.156791  2672 train_util.py:56] Defaulting to MirroredStrategy
2020-02-21 13:39:43.164404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-02-21 13:39:43.237886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.241122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.246274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.250949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.253287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.257189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.261498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.269133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.271574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.272927: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-02-21 13:39:43.275556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.278705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.280447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.282142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.283834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.285671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.287438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.289994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.291835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.970857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-21 13:39:43.973353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0
2020-02-21 13:39:43.974871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N
2020-02-21 13:39:43.976781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6306 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:43.974044  2672 mirrored_strategy.py:501] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:44.343264  2672 train_util.py:201] Building the model...
WARNING:tensorflow:From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0221 13:39:48.817270  3952 deprecation.py:506] From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-21 13:39:52.821030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:53.103556: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:53.327462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
I0221 13:39:54.833573  2672 train_util.py:172] Restoring from checkpoint...
I0221 13:39:54.833573  2672 train_util.py:184] No checkpoint, skipping.
I0221 13:39:54.833573  2672 train_util.py:256] Creating metrics for ListWrapper(['spectral_loss', 'total_loss'])
2020-02-21 13:40:02.551385: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-02-21 13:40:02.554137: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00000a70 (most recent call first):
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\execute.py", line 60 in quick_execute
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 598 in call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1741 in _call_flat
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1660 in _filtered_call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 646 in _call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 576 in __call__
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\train_util.py", line 273 in train
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\gin\config.py", line 1055 in gin_wrapper
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 151 in main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 250 in _run_main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 299 in run
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 172 in console_entry_point
  File "C:\Users\andrey\Anaconda3\envs\test\Scripts\ddsp_run.exe\__main__.py", line 7 in <module>
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 85 in _run_code
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 193 in _run_module_as_main

I can't point my finger on where's the problem because:

Tensorflow trains on GPU correctly with a toy example training, so it is configured correctly to work with CUDA
Tensorflow trains DDSP correctly if run on CPU

This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Any help will be appreciated.

The text was updated successfully, but these errors were encountered:

jesseengel · 2020-02-22T09:01:05Z

Can you add the exact command you're running and any other details (dataset etc.) that might be relevant?

andreykramer · 2020-02-24T09:14:28Z

This is the command being run

ddsp_run --mode=train --alsologtostderr --model_dir="C:\Users\andrey\Desktop\winDDSP\MODEL" --gin_file="C:/Users/andrey/Desktop/winDDSP/soloinstrument.gin" --gin_file="C:/Users/andrey/Anaconda3/envs/test/lib/site-packages/ddsp/training/gin/datasets/tfrecord.gin" --gin_param="batch_size=16" --gin_param="TFRecordProvider.file_pattern='C:/Users/andrey/Desktop/winDDSP/data/train.tfrecord*'" --gin_param="train_util.train.num_steps=30000" --gin_param="train_util.train.steps_per_save=100" --gin_param="train_util.Trainer.checkpoints_to_keep=10"

And I believe that the dataset at the moment is just a single wav (around 15 seconds) that I prepared with ddsp_prepare_tfrecord. You can find the tfrecord files attached.
data.zip

As I said, the thing that confuses me most is that the same command runs perfectly fine when only the CPU is used to train. At the same time, judging from a tensorflow toy example code execution, tf and cuda seem to be configured correctly to work together.

andreykramer · 2020-02-25T14:05:59Z

The problem was also discussed in this issue with tensorflow tensorflow/tensorflow#24496

Pasting this code inside train_util.py has solved the problem.

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

What was happening is that the process started filling the gpu memory very quickly and when it exceeded the available memory the aforementioned error popped out.

jesseengel · 2020-02-26T02:08:32Z

Thanks for looking into this!

It seems you're using a GPU with about half what we've been testing on (v100), so sorry you bumped into this edge case.

I am a little confused why that code snippet works (since we don't use sessions in 2.0), but I assume it's somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

andreykramer · 2020-02-26T13:52:12Z

You're welcome :) Yes, that code also does the job for the training. By the way, ddsp_prepare_tfrecord also has the same (or similar) problem. The console output is different but still I can see that it just allocates the whole GPU memory and then crashes. Where should I put that fix in? I've put it everywhere I can think of (prepare_tfrecord.py, prepare_tfrecord_lib.py, spectral_ops.py, core.py) and it doesn't seem to work.

Edit: I was trying to prepare a bigger dataset when I got this error (970 audios, 264mb), and found out it didn't work even on cpu. A small dataset with only one wav is prepared correctly both with GPU and CPU. How can I go around this? Thank you very much.

(base) andrey@andrey-PC:~/Escritorio/voicemodIA/DDSP$ ddsp_prepare_tfrecord --input_audio_filepatterns="/media/andrey/DATOS/Datasets/english/train/voice/male/*" --output_tfrecord_path="data/train.tfrecord" --num_shards=10 --alsologtostderr
2020-02-26 14:33:58.648031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-26 14:33:58.649092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-26 14:33:59.577649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-26 14:33:59.598016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.598330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.598359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.598407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.599519: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.599689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.600654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.601324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.601352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.601433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.601771: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.602051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.602304: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-26 14:33:59.606149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2020-02-26 14:33:59.606322: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a7ac960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.606332: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-26 14:33:59.670131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670429: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a79a280 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.670443: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060 SUPER, Compute Capability 7.5
2020-02-26 14:33:59.670574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.670810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.670818: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.670834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.670847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.670858: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.670869: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.670877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.670913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.671355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.826876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-26 14:33:59.826903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-02-26 14:33:59.826908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-02-26 14:33:59.827068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7028 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
I0226 14:34:00.700125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function annotate_downstream_side_inputs at 0x7f61f93db4d0> ====================
I0226 14:34:00.700693 140064621770560 fn_api_runner_transforms.py:540] ==================== <function fix_side_input_pcoll_coders at 0x7f61f93db5f0> ====================
I0226 14:34:00.700984 140064621770560 fn_api_runner_transforms.py:540] ==================== <function lift_combiners at 0x7f61f93db680> ====================
I0226 14:34:00.701103 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_sdf at 0x7f61f93db710> ====================
I0226 14:34:00.701330 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_gbk at 0x7f61f93db7a0> ====================
I0226 14:34:00.701719 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sink_flattens at 0x7f61f93db8c0> ====================
I0226 14:34:00.701858 140064621770560 fn_api_runner_transforms.py:540] ==================== <function greedily_fuse at 0x7f61f93db950> ====================
I0226 14:34:00.702906 140064621770560 fn_api_runner_transforms.py:540] ==================== <function read_to_impulse at 0x7f61f93db9e0> ====================
I0226 14:34:00.703006 140064621770560 fn_api_runner_transforms.py:540] ==================== <function impulse_to_input at 0x7f61f93dba70> ====================
I0226 14:34:00.703125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function inject_timer_pcollections at 0x7f61f93dbc20> ====================
I0226 14:34:00.703323 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sort_stages at 0x7f61f93dbcb0> ====================
I0226 14:34:00.703435 140064621770560 fn_api_runner_transforms.py:540] ==================== <function window_pcollection_coders at 0x7f61f93dbd40> ====================
I0226 14:34:00.704764 140064621770560 statecache.py:150] Creating state cache with size 100
I0226 14:34:00.704909 140064621770560 fn_api_runner.py:1797] Created Worker handler <apache_beam.runners.portability.fn_api_runner.EmbeddedWorkerHandler object at 0x7f61f935ac10> for environment urn: "beam:env:embedded_python:v1"

I0226 14:34:00.705106 140064621770560 fn_api_runner.py:822] Running ((((ref_AppliedPTransform_Create/Impulse_3)+(ref_AppliedPTransform_Create/FlatMap(<lambda at core.py:2597>)_4))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/AddRandomKeys_7))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_9))+(Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.731557 140064621770560 fn_api_runner.py:822] Running ((((((((((Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_14))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_15))+(ref_AppliedPTransform_Create/Map(decode)_16))+(ref_AppliedPTransform_Map(_load_audio)_17))+(ref_AppliedPTransform_Map(_add_f0_estimate)_18))+(ref_AppliedPTransform_Map(_add_loudness)_19))+(ref_AppliedPTransform_FlatMap(_split_example)_20))+(ref_AppliedPTransform_Reshuffle/AddRandomKeys_22))+(ref_AppliedPTransform_Reshuffle/ReshufflePerKey/Map(reify_timestamps)_24))+(Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.753737 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001595577.wav'.
2020-02-26 14:34:01.440541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:34:01.586315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
/home/andrey/anaconda3/lib/python3.7/site-packages/librosa/core/time_frequency.py:1208: RuntimeWarning: divide by zero encountered in log10
  - 0.5 * np.log10(f_sq + const[3]))
I0226 14:34:04.901932 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001866840.wav'.
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/andrey/anaconda3/bin/ddsp_prepare_tfrecord", line 10, in <module>
    sys.exit(console_entry_point())
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 105, in console_entry_point
    app.run(main)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 100, in main
    run()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 95, in run
    pipeline_options=FLAGS.pipeline_options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 170, in prepare_tfrecord
    coder=beam.coders.ProtoCoder(tf.train.Example))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 481, in __exit__
    self.run().wait_until_finish()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run
    self._options).run(False)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run
    return self.runner.run_pipeline(self, self._options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 486, in run_pipeline
    default_environment=self._default_environment))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 494, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 583, in run_stages
    stage_context.safe_coders)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 904, in _run_stage
    result, splits = bundle_manager.process_bundle(data_input, data_output)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2105, in process_bundle
    for result, split_result in executor.map(execute, part_inputs):
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2102, in execute
    return bundle_manager.process_bundle(part_map, expected_outputs)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2025, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 1358, in push
    response = self.worker.do_instruction(request)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 352, in do_instruction
    request.instruction_id)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 386, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 812, in process_bundle
    data.transform_id].process_encoded(data.data)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 205, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 302, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 304, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 747, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 956, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/future/utils/__init__.py", line 421, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
RuntimeError: AssertionError [while running 'Map(_add_f0_estimate)']

jesseengel · 2020-02-27T00:08:47Z

Cool, any interest in adding that to the code? I think it should probably just be a function allow_memory_growth() in train_util.py that gets called from ddsp_run.py when a boolean flag --allow_memory_growth flag is set (default to false).

The dataset creation seems to be a different issue perhaps as it's being caught by this assert:

  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

Would you like to create a different issue for that?

andreykramer · 2020-02-27T21:40:07Z

I found out that the problem was with a specific .wav file and not because of the size of the dataset. It would be interesting to find out why's the code crashing with it, so I will open a new issue later. Also created a PR with the fix for this issue in the way you suggested, so I'm closing it.

Thank you for your responsiveness!

erl-j · 2020-05-08T08:21:06Z

I got a similar issue while training on a T4

failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan. Fatal Python error: Aborted

The code suggested by jesseengel (#29 (comment)) fixed the issue.

andreykramer closed this as completed Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

andreykramer commented Feb 21, 2020

jesseengel commented Feb 22, 2020

andreykramer commented Feb 24, 2020

andreykramer commented Feb 25, 2020 •

edited

Loading

jesseengel commented Feb 26, 2020 •

edited

Loading

andreykramer commented Feb 26, 2020 •

edited

Loading

jesseengel commented Feb 27, 2020 •

edited

Loading

andreykramer commented Feb 27, 2020

erl-j commented May 8, 2020

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

Comments

andreykramer commented Feb 21, 2020

jesseengel commented Feb 22, 2020

andreykramer commented Feb 24, 2020

andreykramer commented Feb 25, 2020 • edited Loading

jesseengel commented Feb 26, 2020 • edited Loading

andreykramer commented Feb 26, 2020 • edited Loading

jesseengel commented Feb 27, 2020 • edited Loading

andreykramer commented Feb 27, 2020

erl-j commented May 8, 2020

andreykramer commented Feb 25, 2020 •

edited

Loading

jesseengel commented Feb 26, 2020 •

edited

Loading

andreykramer commented Feb 26, 2020 •

edited

Loading

jesseengel commented Feb 27, 2020 •

edited

Loading