-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_ERROR_LAUNCH_FAILED when training on GPU locally #29
Comments
Can you add the exact command you're running and any other details (dataset etc.) that might be relevant? |
This is the command being run
And I believe that the dataset at the moment is just a single wav (around 15 seconds) that I prepared with ddsp_prepare_tfrecord. You can find the tfrecord files attached. As I said, the thing that confuses me most is that the same command runs perfectly fine when only the CPU is used to train. At the same time, judging from a tensorflow toy example code execution, tf and cuda seem to be configured correctly to work together. |
The problem was also discussed in this issue with tensorflow tensorflow/tensorflow#24496 Pasting this code inside train_util.py has solved the problem.
What was happening is that the process started filling the gpu memory very quickly and when it exceeded the available memory the aforementioned error popped out. |
Thanks for looking into this! It seems you're using a GPU with about half what we've been testing on (v100), so sorry you bumped into this edge case. I am a little confused why that code snippet works (since we don't use sessions in 2.0), but I assume it's somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too? gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e) |
You're welcome :) Yes, that code also does the job for the training. By the way, ddsp_prepare_tfrecord also has the same (or similar) problem. The console output is different but still I can see that it just allocates the whole GPU memory and then crashes. Where should I put that fix in? I've put it everywhere I can think of (prepare_tfrecord.py, prepare_tfrecord_lib.py, spectral_ops.py, core.py) and it doesn't seem to work. Edit: I was trying to prepare a bigger dataset when I got this error (970 audios, 264mb), and found out it didn't work even on cpu. A small dataset with only one wav is prepared correctly both with GPU and CPU. How can I go around this? Thank you very much.
|
Cool, any interest in adding that to the code? I think it should probably just be a function The dataset creation seems to be a different issue perhaps as it's being caught by this assert:
Would you like to create a different issue for that? |
I found out that the problem was with a specific .wav file and not because of the size of the dataset. It would be interesting to find out why's the code crashing with it, so I will open a new issue later. Also created a PR with the fix for this issue in the way you suggested, so I'm closing it. Thank you for your responsiveness! |
I got a similar issue while training on a T4
The code suggested by jesseengel (#29 (comment)) fixed the issue. |
Hi, I'm trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I'm getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:
I can't point my finger on where's the problem because:
This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Any help will be appreciated.
The text was updated successfully, but these errors were encountered: