Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid configuration argument #702

Closed
huangruizhe opened this issue Nov 23, 2022 · 8 comments
Closed

RuntimeError: CUDA error: invalid configuration argument #702

huangruizhe opened this issue Nov 23, 2022 · 8 comments

Comments

@huangruizhe
Copy link
Contributor

When I was trying a zipformer (pruned_transducer_stateless7) on spgispeech, I did the following:

python pruned_transducer_stateless7/train.py --world-size 2 --max-duration 250

I got the following error after the training run for a while:

2022-11-22 23:32:27,044 INFO [train.py:876] (1/2) Epoch 1, batch 14850, loss[loss=0.4185, simple_loss=0.4207, pruned_loss=0.2081, over 4933.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.3939, pruned_loss=0.1859, over 1194952.18 frames. ], batch size: 40, lr: 2.82e-02,
2022-11-22 23:32:27,046 INFO [train.py:876] (0/2) Epoch 1, batch 14850, loss[loss=0.3625, simple_loss=0.366, pruned_loss=0.1795, over 6045.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.3933, pruned_loss=0.1856, over 1200748.57 frames. ], batch size: 17, lr: 2.82e-02,
2022-11-22 23:32:33,171 INFO [zipformer.py:1414] (0/2) attn_weights_entropy = tensor([2.0664, 2.2180, 2.0900, 2.1552, 1.5869, 1.5967, 1.0766, 1.7589],
       device='cuda:0'), covar=tensor([0.0361, 0.0796, 0.3119, 0.0494, 0.0348, 0.0510, 0.0635, 0.0371],
       device='cuda:0'), in_proj_covar=tensor([0.0022, 0.0021, 0.0021, 0.0026, 0.0021, 0.0027, 0.0025, 0.0024],
       device='cuda:0'), out_proj_covar=tensor([3.2804e-05, 3.6589e-05, 3.2693e-05, 4.0759e-05, 3.4965e-05, 4.2606e-05,
        4.0940e-05, 3.8753e-05], device='cuda:0')
2022-11-22 23:32:38,075 INFO [train.py:1134] (0/2) Saving batch to pruned_transducer_stateless7/exp/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
2022-11-22 23:32:38,115 INFO [train.py:1140] (0/2) features shape: torch.Size([40, 621, 80])
2022-11-22 23:32:38,117 INFO [train.py:1144] (0/2) num tokens: 1462
Traceback (most recent call last):
  File "pruned_transducer_stateless7/train.py", line 1207, in <module>
    main()
  File "pruned_transducer_stateless7/train.py", line 1198, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 1078, in run
    train_one_epoch(
  File "/home/hltcoe/rhuang/icefall/egs/spgispeech/ASR/pruned_transducer_stateless7/train.py", line 809, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/rhuang/mambaforge/envs/icefall/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seems not an OOM error. If setting --max-duration 300, this error can happen at batch 50.
On the other hand, if I try --max-duration 100 as default, it goes well after many batches but the GPU memory usage is very low.
Do you know what may be the issue?

@csukuangfj
Copy link
Collaborator

which version of cuda and pytorch are you using?

@huangruizhe
Copy link
Contributor Author

huangruizhe commented Nov 23, 2022

CUDA 11.1 and Pytorch 1.10.0

@csukuangfj
Copy link
Collaborator

Could you switch to another cuda version, e.g., cuda 10.2?

RuntimeError: CUDA error: invalid configuration argument

Most people are using cuda 11.1 when they have such an issue.

@huangruizhe
Copy link
Contributor Author

Sure, I will try. Thanks for the suggestion!

@csukuangfj
Copy link
Collaborator

For future reference, the following issues are related to this one using cuda 11.1

@danpovey
Copy link
Collaborator

Looks like this is most likely a PyTorch bug that we just happen to be triggering, so probably would be easiest to try different versions of PyTorch and/or CUDA because we would not be able to fix this ourselves.

@huangruizhe
Copy link
Contributor Author

After we switch to CUDA 10.2, the issue is resolved. Thanks a lot!

@desh2608
Copy link
Collaborator

(We can use --max-duration 600 and GPU memory utilization is very good.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants