RuntimeError: CUDA error: invalid configuration argument #702

huangruizhe · 2022-11-23T04:48:19Z

When I was trying a zipformer (pruned_transducer_stateless7) on spgispeech, I did the following:

python pruned_transducer_stateless7/train.py --world-size 2 --max-duration 250

I got the following error after the training run for a while:

2022-11-22 23:32:27,044 INFO [train.py:876] (1/2) Epoch 1, batch 14850, loss[loss=0.4185, simple_loss=0.4207, pruned_loss=0.2081, over 4933.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.3939, pruned_loss=0.1859, over 1194952.18 frames. ], batch size: 40, lr: 2.82e-02,
2022-11-22 23:32:27,046 INFO [train.py:876] (0/2) Epoch 1, batch 14850, loss[loss=0.3625, simple_loss=0.366, pruned_loss=0.1795, over 6045.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.3933, pruned_loss=0.1856, over 1200748.57 frames. ], batch size: 17, lr: 2.82e-02,
2022-11-22 23:32:33,171 INFO [zipformer.py:1414] (0/2) attn_weights_entropy = tensor([2.0664, 2.2180, 2.0900, 2.1552, 1.5869, 1.5967, 1.0766, 1.7589],
       device='cuda:0'), covar=tensor([0.0361, 0.0796, 0.3119, 0.0494, 0.0348, 0.0510, 0.0635, 0.0371],
       device='cuda:0'), in_proj_covar=tensor([0.0022, 0.0021, 0.0021, 0.0026, 0.0021, 0.0027, 0.0025, 0.0024],
       device='cuda:0'), out_proj_covar=tensor([3.2804e-05, 3.6589e-05, 3.2693e-05, 4.0759e-05, 3.4965e-05, 4.2606e-05,
        4.0940e-05, 3.8753e-05], device='cuda:0')
2022-11-22 23:32:38,075 INFO [train.py:1134] (0/2) Saving batch to pruned_transducer_stateless7/exp/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
2022-11-22 23:32:38,115 INFO [train.py:1140] (0/2) features shape: torch.Size([40, 621, 80])
2022-11-22 23:32:38,117 INFO [train.py:1144] (0/2) num tokens: 1462
Traceback (most recent call last):
  File "pruned_transducer_stateless7/train.py", line 1207, in <module>
    main()
  File "pruned_transducer_stateless7/train.py", line 1198, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 1078, in run
    train_one_epoch(
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 809, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seems not an OOM error. If setting --max-duration 300, this error can happen at batch 50.
On the other hand, if I try --max-duration 100 as default, it goes well after many batches but the GPU memory usage is very low.
Do you know what may be the issue?

The text was updated successfully, but these errors were encountered:

csukuangfj · 2022-11-23T05:06:09Z

which version of cuda and pytorch are you using?

huangruizhe · 2022-11-23T05:36:29Z

CUDA 11.1 and Pytorch 1.10.0

csukuangfj · 2022-11-23T05:39:06Z

Could you switch to another cuda version, e.g., cuda 10.2?

RuntimeError: CUDA error: invalid configuration argument

Most people are using cuda 11.1 when they have such an issue.

huangruizhe · 2022-11-23T05:40:00Z

Sure, I will try. Thanks for the suggestion!

csukuangfj · 2022-11-23T05:47:40Z

For future reference, the following issues are related to this one using cuda 11.1

danpovey · 2022-11-23T15:30:22Z

Looks like this is most likely a PyTorch bug that we just happen to be triggering, so probably would be easiest to try different versions of PyTorch and/or CUDA because we would not be able to fix this ourselves.

huangruizhe · 2022-11-23T18:40:52Z

After we switch to CUDA 10.2, the issue is resolved. Thanks a lot!

desh2608 · 2022-11-23T18:42:17Z

(We can use --max-duration 600 and GPU memory utilization is very good.)

huangruizhe closed this as completed Nov 23, 2022

csukuangfj mentioned this issue Jul 7, 2023

RuntimeError: CUDA error: invalid configuration argument k2-fsa/k2#1220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid configuration argument #702

RuntimeError: CUDA error: invalid configuration argument #702

huangruizhe commented Nov 23, 2022

csukuangfj commented Nov 23, 2022

huangruizhe commented Nov 23, 2022 •

edited

Loading

csukuangfj commented Nov 23, 2022

huangruizhe commented Nov 23, 2022

csukuangfj commented Nov 23, 2022

danpovey commented Nov 23, 2022

huangruizhe commented Nov 23, 2022

desh2608 commented Nov 23, 2022

RuntimeError: CUDA error: invalid configuration argument #702

RuntimeError: CUDA error: invalid configuration argument #702

Comments

huangruizhe commented Nov 23, 2022

csukuangfj commented Nov 23, 2022

huangruizhe commented Nov 23, 2022 • edited Loading

csukuangfj commented Nov 23, 2022

huangruizhe commented Nov 23, 2022

csukuangfj commented Nov 23, 2022

danpovey commented Nov 23, 2022

huangruizhe commented Nov 23, 2022

desh2608 commented Nov 23, 2022

huangruizhe commented Nov 23, 2022 •

edited

Loading