Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

Closed
andreykramer opened this issue Feb 21, 2020 · 8 comments
Closed

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29

andreykramer opened this issue Feb 21, 2020 · 8 comments

Comments

@andreykramer
Copy link
Contributor

Hi, I'm trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I'm getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:

  • Tensorflow 2.1
  • CUDA 10.1
  • cudnn 7.6.5 for CUDA 10.1
2020-02-21 13:39:39.259132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:41.110202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I0221 13:39:43.156791  2672 train_util.py:56] Defaulting to MirroredStrategy
2020-02-21 13:39:43.164404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-02-21 13:39:43.237886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.241122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.246274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.250949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.253287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.257189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.261498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.269133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.271574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.272927: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-02-21 13:39:43.275556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.278705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.280447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.282142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.283834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.285671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.287438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.289994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.291835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.970857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-21 13:39:43.973353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0
2020-02-21 13:39:43.974871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N
2020-02-21 13:39:43.976781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6306 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:43.974044  2672 mirrored_strategy.py:501] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:44.343264  2672 train_util.py:201] Building the model...
WARNING:tensorflow:From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0221 13:39:48.817270  3952 deprecation.py:506] From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-21 13:39:52.821030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:53.103556: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:53.327462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
I0221 13:39:54.833573  2672 train_util.py:172] Restoring from checkpoint...
I0221 13:39:54.833573  2672 train_util.py:184] No checkpoint, skipping.
I0221 13:39:54.833573  2672 train_util.py:256] Creating metrics for ListWrapper(['spectral_loss', 'total_loss'])
2020-02-21 13:40:02.551385: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-02-21 13:40:02.554137: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00000a70 (most recent call first):
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\execute.py", line 60 in quick_execute
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 598 in call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1741 in _call_flat
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1660 in _filtered_call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 646 in _call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 576 in __call__
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\train_util.py", line 273 in train
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\gin\config.py", line 1055 in gin_wrapper
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 151 in main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 250 in _run_main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 299 in run
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 172 in console_entry_point
  File "C:\Users\andrey\Anaconda3\envs\test\Scripts\ddsp_run.exe\__main__.py", line 7 in <module>
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 85 in _run_code
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 193 in _run_module_as_main

I can't point my finger on where's the problem because:

  • Tensorflow trains on GPU correctly with a toy example training, so it is configured correctly to work with CUDA
  • Tensorflow trains DDSP correctly if run on CPU

This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Any help will be appreciated.

@jesseengel
Copy link
Contributor

Can you add the exact command you're running and any other details (dataset etc.) that might be relevant?

@andreykramer
Copy link
Contributor Author

This is the command being run

ddsp_run --mode=train --alsologtostderr --model_dir="C:\Users\andrey\Desktop\winDDSP\MODEL" --gin_file="C:/Users/andrey/Desktop/winDDSP/soloinstrument.gin" --gin_file="C:/Users/andrey/Anaconda3/envs/test/lib/site-packages/ddsp/training/gin/datasets/tfrecord.gin" --gin_param="batch_size=16" --gin_param="TFRecordProvider.file_pattern='C:/Users/andrey/Desktop/winDDSP/data/train.tfrecord*'" --gin_param="train_util.train.num_steps=30000" --gin_param="train_util.train.steps_per_save=100" --gin_param="train_util.Trainer.checkpoints_to_keep=10"

And I believe that the dataset at the moment is just a single wav (around 15 seconds) that I prepared with ddsp_prepare_tfrecord. You can find the tfrecord files attached.
data.zip

As I said, the thing that confuses me most is that the same command runs perfectly fine when only the CPU is used to train. At the same time, judging from a tensorflow toy example code execution, tf and cuda seem to be configured correctly to work together.

@andreykramer
Copy link
Contributor Author

andreykramer commented Feb 25, 2020

The problem was also discussed in this issue with tensorflow tensorflow/tensorflow#24496

Pasting this code inside train_util.py has solved the problem.

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

What was happening is that the process started filling the gpu memory very quickly and when it exceeded the available memory the aforementioned error popped out.

@jesseengel
Copy link
Contributor

jesseengel commented Feb 26, 2020

Thanks for looking into this!

It seems you're using a GPU with about half what we've been testing on (v100), so sorry you bumped into this edge case.

I am a little confused why that code snippet works (since we don't use sessions in 2.0), but I assume it's somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

@andreykramer
Copy link
Contributor Author

andreykramer commented Feb 26, 2020

You're welcome :) Yes, that code also does the job for the training. By the way, ddsp_prepare_tfrecord also has the same (or similar) problem. The console output is different but still I can see that it just allocates the whole GPU memory and then crashes. Where should I put that fix in? I've put it everywhere I can think of (prepare_tfrecord.py, prepare_tfrecord_lib.py, spectral_ops.py, core.py) and it doesn't seem to work.

Edit: I was trying to prepare a bigger dataset when I got this error (970 audios, 264mb), and found out it didn't work even on cpu. A small dataset with only one wav is prepared correctly both with GPU and CPU. How can I go around this? Thank you very much.

(base) andrey@andrey-PC:~/Escritorio/voicemodIA/DDSP$ ddsp_prepare_tfrecord --input_audio_filepatterns="/media/andrey/DATOS/Datasets/english/train/voice/male/*" --output_tfrecord_path="data/train.tfrecord" --num_shards=10 --alsologtostderr
2020-02-26 14:33:58.648031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-26 14:33:58.649092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-26 14:33:59.577649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-26 14:33:59.598016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.598330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.598359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.598407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.599519: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.599689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.600654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.601324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.601352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.601433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.601771: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.602051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.602304: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-26 14:33:59.606149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2020-02-26 14:33:59.606322: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a7ac960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.606332: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-26 14:33:59.670131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670429: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a79a280 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.670443: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060 SUPER, Compute Capability 7.5
2020-02-26 14:33:59.670574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.670810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.670818: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.670834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.670847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.670858: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.670869: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.670877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.670913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.671355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.826876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-26 14:33:59.826903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-02-26 14:33:59.826908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-02-26 14:33:59.827068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7028 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
I0226 14:34:00.700125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function annotate_downstream_side_inputs at 0x7f61f93db4d0> ====================
I0226 14:34:00.700693 140064621770560 fn_api_runner_transforms.py:540] ==================== <function fix_side_input_pcoll_coders at 0x7f61f93db5f0> ====================
I0226 14:34:00.700984 140064621770560 fn_api_runner_transforms.py:540] ==================== <function lift_combiners at 0x7f61f93db680> ====================
I0226 14:34:00.701103 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_sdf at 0x7f61f93db710> ====================
I0226 14:34:00.701330 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_gbk at 0x7f61f93db7a0> ====================
I0226 14:34:00.701719 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sink_flattens at 0x7f61f93db8c0> ====================
I0226 14:34:00.701858 140064621770560 fn_api_runner_transforms.py:540] ==================== <function greedily_fuse at 0x7f61f93db950> ====================
I0226 14:34:00.702906 140064621770560 fn_api_runner_transforms.py:540] ==================== <function read_to_impulse at 0x7f61f93db9e0> ====================
I0226 14:34:00.703006 140064621770560 fn_api_runner_transforms.py:540] ==================== <function impulse_to_input at 0x7f61f93dba70> ====================
I0226 14:34:00.703125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function inject_timer_pcollections at 0x7f61f93dbc20> ====================
I0226 14:34:00.703323 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sort_stages at 0x7f61f93dbcb0> ====================
I0226 14:34:00.703435 140064621770560 fn_api_runner_transforms.py:540] ==================== <function window_pcollection_coders at 0x7f61f93dbd40> ====================
I0226 14:34:00.704764 140064621770560 statecache.py:150] Creating state cache with size 100
I0226 14:34:00.704909 140064621770560 fn_api_runner.py:1797] Created Worker handler <apache_beam.runners.portability.fn_api_runner.EmbeddedWorkerHandler object at 0x7f61f935ac10> for environment urn: "beam:env:embedded_python:v1"

I0226 14:34:00.705106 140064621770560 fn_api_runner.py:822] Running ((((ref_AppliedPTransform_Create/Impulse_3)+(ref_AppliedPTransform_Create/FlatMap(<lambda at core.py:2597>)_4))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/AddRandomKeys_7))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_9))+(Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.731557 140064621770560 fn_api_runner.py:822] Running ((((((((((Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_14))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_15))+(ref_AppliedPTransform_Create/Map(decode)_16))+(ref_AppliedPTransform_Map(_load_audio)_17))+(ref_AppliedPTransform_Map(_add_f0_estimate)_18))+(ref_AppliedPTransform_Map(_add_loudness)_19))+(ref_AppliedPTransform_FlatMap(_split_example)_20))+(ref_AppliedPTransform_Reshuffle/AddRandomKeys_22))+(ref_AppliedPTransform_Reshuffle/ReshufflePerKey/Map(reify_timestamps)_24))+(Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.753737 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001595577.wav'.
2020-02-26 14:34:01.440541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:34:01.586315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
/home/andrey/anaconda3/lib/python3.7/site-packages/librosa/core/time_frequency.py:1208: RuntimeWarning: divide by zero encountered in log10
  - 0.5 * np.log10(f_sq + const[3]))
I0226 14:34:04.901932 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001866840.wav'.
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/andrey/anaconda3/bin/ddsp_prepare_tfrecord", line 10, in <module>
    sys.exit(console_entry_point())
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 105, in console_entry_point
    app.run(main)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 100, in main
    run()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 95, in run
    pipeline_options=FLAGS.pipeline_options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 170, in prepare_tfrecord
    coder=beam.coders.ProtoCoder(tf.train.Example))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 481, in __exit__
    self.run().wait_until_finish()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run
    self._options).run(False)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run
    return self.runner.run_pipeline(self, self._options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 486, in run_pipeline
    default_environment=self._default_environment))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 494, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 583, in run_stages
    stage_context.safe_coders)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 904, in _run_stage
    result, splits = bundle_manager.process_bundle(data_input, data_output)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2105, in process_bundle
    for result, split_result in executor.map(execute, part_inputs):
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2102, in execute
    return bundle_manager.process_bundle(part_map, expected_outputs)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2025, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 1358, in push
    response = self.worker.do_instruction(request)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 352, in do_instruction
    request.instruction_id)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 386, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 812, in process_bundle
    data.transform_id].process_encoded(data.data)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 205, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 302, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 304, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 747, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 956, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/future/utils/__init__.py", line 421, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
RuntimeError: AssertionError [while running 'Map(_add_f0_estimate)']

@jesseengel
Copy link
Contributor

jesseengel commented Feb 27, 2020

Cool, any interest in adding that to the code? I think it should probably just be a function allow_memory_growth() in train_util.py that gets called from ddsp_run.py when a boolean flag --allow_memory_growth flag is set (default to false).

The dataset creation seems to be a different issue perhaps as it's being caught by this assert:

  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

Would you like to create a different issue for that?

@andreykramer
Copy link
Contributor Author

I found out that the problem was with a specific .wav file and not because of the size of the dataset. It would be interesting to find out why's the code crashing with it, so I will open a new issue later. Also created a PR with the fix for this issue in the way you suggested, so I'm closing it.

Thank you for your responsiveness!

@erl-j
Copy link

erl-j commented May 8, 2020

I got a similar issue while training on a T4

failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan. Fatal Python error: Aborted

The code suggested by jesseengel (#29 (comment)) fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants