Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

Closed
andrenatal opened this issue May 18, 2020 · 6 comments
Closed
Labels

Comments

@andrenatal
Copy link

Hello everyone,

I started to get No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" errors intermittently when training ASPIRE's tdnn recipe (local/chain/tuning/run_tdnn_lstm_1a.sh) after successfully training local/chain/run_tdnn.sh with no errors.

I have a local setup with 4 RTX 2080 Ti running CUDA 10.2 and the crash usually occurred when 8 to 12 jobs were running. I tried then to reduce the number of minimum and maximum jobs to match the four GPUs and the crash occurred earlier than before.

I'm wondering if someone had this issue before start to consider some physical issue, since this server suffered an accidental shutdown in the middle of a training couple of weeks ago while it was pretty late in the training, without failures.

Maybe if I use --use-gpu=optional I might see some jobs being sent to CPU if for some reason SelectGpuId fails, and then have the training mixed between CPUs and GPUs without bringing further consequences to the final model?

Thanks in advance.

Andre

more exp/chain/tdnn_lstm_1a/log/train.1399.2.log
# nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_lstm_1a/cache.1399 --xent-regularize=0.025 --optimization.min-deriv-time=-25 --optimization.max-d
eriv-time-relative=35 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.25 --srand=1399 "nnet3-am-copy --raw=true --learning-rate=0.00308267
122033 --scale=1.0 exp/chain/tdnn_lstm_1a/1399.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_lstm_1a/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1
              ark:exp/chain/tdnn_lstm_1a/egs/cegs.102.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1399 ark:- ark:- | nnet3-chain-merge-egs                         --miniba
tch-size=64,32 ark:- ark:- |" exp/chain/tdnn_lstm_1a/1400.2.raw
# Started at Mon May 18 06:25:13 PDT 2020
#
nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_lstm_1a/cache.1399 --xent-regularize=0.025 --optimization.min-deriv-time=-25 --optimization.max-der
iv-time-relative=35 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.25 --srand=1399 "nnet3-am-copy --raw=true --learning-rate=0.0030826712
2033 --scale=1.0 exp/chain/tdnn_lstm_1a/1399.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_lstm_1a/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1
            ark:exp/chain/tdnn_lstm_1a/egs/cegs.102.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1399 ark:- ark:- | nnet3-chain-merge-egs                         --minibatc
h-size=64,32 ark:- ark:- |' exp/chain/tdnn_lstm_1a/1400.2.raw
ERROR (nnet3-chain-train[5.5.569~1-6f329]:SelectGpuId():cu-device.cc:166) No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error", in cu-device.cc:166

[ Stack-Trace: ]
nnet3-chain-train(kaldi::MessageLogger::LogMessage() const+0xb42) [0x56242619d3d8]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x562425e36fd7]
nnet3-chain-train(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x40b) [0x56242604aa8d]
nnet3-chain-train(main+0x469) [0x562425e35ba3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7efc675d2b97]
nnet3-chain-train(_start+0x2a) [0x562425e3565a]

kaldi::KaldiFatalError
# Accounting: time=0 threads=1
# Ended (code 255) at Mon May 18 06:25:13 PDT 2020, elapsed time 0 seconds
@andrenatal andrenatal added the bug label May 18, 2020
@andrenatal
Copy link
Author

Ok, I tried to run --use-gpu=optional and that's an invalid choice for this model I guess.


train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false',
'yes', 'no', 'wait')

@jtrmal
Copy link
Contributor

jtrmal commented May 18, 2020 via email

@jtrmal
Copy link
Contributor

jtrmal commented May 18, 2020 via email

@andrenatal
Copy link
Author

Hi @jtrmal I had set both already

@jtrmal
Copy link
Contributor

jtrmal commented May 18, 2020 via email

@andrenatal
Copy link
Author

andrenatal commented Jun 1, 2020

I believe I had a driver issue, since after rebuilding the OS, cuda and kaldi installations, the training managed to complete fully. Thanks @jtrmal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants