No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

andrenatal · 2020-05-18T20:46:56Z

Hello everyone,

I started to get No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" errors intermittently when training ASPIRE's tdnn recipe (local/chain/tuning/run_tdnn_lstm_1a.sh) after successfully training local/chain/run_tdnn.sh with no errors.

I have a local setup with 4 RTX 2080 Ti running CUDA 10.2 and the crash usually occurred when 8 to 12 jobs were running. I tried then to reduce the number of minimum and maximum jobs to match the four GPUs and the crash occurred earlier than before.

I'm wondering if someone had this issue before start to consider some physical issue, since this server suffered an accidental shutdown in the middle of a training couple of weeks ago while it was pretty late in the training, without failures.

Maybe if I use --use-gpu=optional I might see some jobs being sent to CPU if for some reason SelectGpuId fails, and then have the training mixed between CPUs and GPUs without bringing further consequences to the final model?

Thanks in advance.

Andre

more exp/chain/tdnn_lstm_1a/log/train.1399.2.log
# nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_lstm_1a/cache.1399 --xent-regularize=0.025 --optimization.min-deriv-time=-25 --optimization.max-d
eriv-time-relative=35 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.25 --srand=1399 "nnet3-am-copy --raw=true --learning-rate=0.00308267
122033 --scale=1.0 exp/chain/tdnn_lstm_1a/1399.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_lstm_1a/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1
              ark:exp/chain/tdnn_lstm_1a/egs/cegs.102.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1399 ark:- ark:- | nnet3-chain-merge-egs                         --miniba
tch-size=64,32 ark:- ark:- |" exp/chain/tdnn_lstm_1a/1400.2.raw
# Started at Mon May 18 06:25:13 PDT 2020
#
nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --read-cache=exp/chain/tdnn_lstm_1a/cache.1399 --xent-regularize=0.025 --optimization.min-deriv-time=-25 --optimization.max-der
iv-time-relative=35 --print-interval=10 --momentum=0.0 --max-param-change=2.0 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.25 --srand=1399 "nnet3-am-copy --raw=true --learning-rate=0.0030826712
2033 --scale=1.0 exp/chain/tdnn_lstm_1a/1399.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_lstm_1a/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1
            ark:exp/chain/tdnn_lstm_1a/egs/cegs.102.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1399 ark:- ark:- | nnet3-chain-merge-egs                         --minibatc
h-size=64,32 ark:- ark:- |' exp/chain/tdnn_lstm_1a/1400.2.raw
ERROR (nnet3-chain-train[5.5.569~1-6f329]:SelectGpuId():cu-device.cc:166) No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error", in cu-device.cc:166

[ Stack-Trace: ]
nnet3-chain-train(kaldi::MessageLogger::LogMessage() const+0xb42) [0x56242619d3d8]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x562425e36fd7]
nnet3-chain-train(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x40b) [0x56242604aa8d]
nnet3-chain-train(main+0x469) [0x562425e35ba3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7efc675d2b97]
nnet3-chain-train(_start+0x2a) [0x562425e3565a]

kaldi::KaldiFatalError
# Accounting: time=0 threads=1
# Ended (code 255) at Mon May 18 06:25:13 PDT 2020, elapsed time 0 seconds

The text was updated successfully, but these errors were encountered:

andrenatal · 2020-05-18T21:08:23Z

Ok, I tried to run --use-gpu=optional and that's an invalid choice for this model I guess.


train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false',
'yes', 'no', 'wait')

jtrmal · 2020-05-18T21:11:46Z

sorry, there was never 'optional', I believe. y.

…

On Mon, May 18, 2020 at 5:09 PM Andre Natal ***@***.***> wrote: Ok, I tried to run --use-gpu=optional and that's an invalid choice for this model I guess. train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false', 'yes', 'no', 'wait') — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4076 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXYQVPJUOXTWBPMOXALRSGPWZANCNFSM4NEND24A> .

jtrmal · 2020-05-18T21:13:51Z

I would assume it's because of memory -- the feed-forward network might need much less memory. set exclusive mode on your gpus and set --use-gpu='wait' y.

…

On Mon, May 18, 2020 at 5:11 PM Jan Trmal ***@***.***> wrote: sorry, there was never 'optional', I believe. y. On Mon, May 18, 2020 at 5:09 PM Andre Natal ***@***.***> wrote: > Ok, I tried to run --use-gpu=optional and that's an invalid choice for > this model I guess. > > > train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false', > 'yes', 'no', 'wait') > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#4076 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACUKYXYQVPJUOXTWBPMOXALRSGPWZANCNFSM4NEND24A> > . >

andrenatal · 2020-05-18T21:14:34Z

Hi @jtrmal I had set both already

jtrmal · 2020-05-18T21:16:55Z

then try to track if it's always the same gpu or different gpu every time... Might indeed be either HW issue or power issue. y.

…

On Mon, May 18, 2020 at 5:14 PM Andre Natal ***@***.***> wrote: Hi @jtrmal <https://github.com/jtrmal> I had set both already — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4076 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX4Y4LW7QEJR44MZVXLRSGQMRANCNFSM4NEND24A> .

andrenatal · 2020-06-01T18:54:27Z

I believe I had a driver issue, since after rebuilding the OS, cuda and kaldi installations, the training managed to complete fully. Thanks @jtrmal !

andrenatal added the bug label May 18, 2020

andrenatal closed this as completed Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

andrenatal commented May 18, 2020

andrenatal commented May 18, 2020

jtrmal commented May 18, 2020 via email

jtrmal commented May 18, 2020 via email

andrenatal commented May 18, 2020

jtrmal commented May 18, 2020 via email

andrenatal commented Jun 1, 2020 •

edited

Loading

No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076

Comments

andrenatal commented May 18, 2020

andrenatal commented May 18, 2020

jtrmal commented May 18, 2020 via email

jtrmal commented May 18, 2020 via email

andrenatal commented May 18, 2020

jtrmal commented May 18, 2020 via email

andrenatal commented Jun 1, 2020 • edited Loading

andrenatal commented Jun 1, 2020 •

edited

Loading