"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

jtrmal · 2021-04-20T20:25:59Z

Hi, in a proprietary code I have discovered some issue with the cudadecoder. The repro is quite easy --
in file src/cudadecoderbin/batched-wav-nnet3-cuda2.cc simply replace

     BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);

by

{
       BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);
}
     BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);

(i.e. allocate an object and deallocate it and try to alloc it again)
The backtrace bellow. Any suggestion on how to fix that? Imagine allocating the decoder again with different params (wil still crash)

LOG (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
ERROR (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:CopyFromVec<double>():cu-vector.cc:892) cublasStatus_t 1 : "CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)'

[ Stack-Trace: ]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-base.so(+0x8e9e) [0x7f51fbd19e9e]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x6bc) [0x7f51fbd1aaec]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55993466fcab]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudamatrix.so(void kaldi::CuVectorBase<float>::CopyFromVec<double>(kaldi::CuVectorBase<double> const&)+0x269) [0x7f51fc2891c7]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::ComputeDerived()+0x1c0) [0x7f51fdade202]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::BatchNormComponent(kaldi::nnet3::BatchNormComponent const&)+0xe6) [0x7f51fdade3e4]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::Copy() const+0x25) [0x7f51fdae0c81]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Nnet(kaldi::nnet3::Nnet const&)+0x458) [0x7f51fdb2fbee]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::ReadParametersFromModel()+0x3c2) [0x7f51fdda515c]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::Initialize(fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&)+0x15) [0x7f51fdda57cf]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::BatchedThreadedNnet3CudaOnlinePipeline(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipelineConfig const&, fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&, kaldi::nnet3::AmNnetSimple const&, kaldi::TransitionModel const&)+0xaa6) [0x7f51fddd5500]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline2::BatchedThreadedNnet3CudaPipeline2(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline2Config const&, fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&, kaldi::nnet3::AmNnetSimple const&, kaldi::TransitionModel const&)+0x43) [0x7f51fddcc515]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(main+0xea3) [0x55993466e2ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f51faed40b3]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(_start+0x2e) [0x55993466a20e]

LOG (lattice-scale[5.5.906~3-e7455]:main():lattice-scale.cc:107) Done 0 lattices.```

The text was updated successfully, but these errors were encountered:

jtrmal · 2021-04-21T12:35:59Z

CUDA 11.2, verified on latest kaldi master.
@hugovbraun any suggestion?

hugovbraun · 2021-04-22T05:27:00Z

Could you check if the cuda device destructor gets called between the two?

kaldi/src/cudamatrix/cu-device.cc

Line 605 in 5caf2c0

if (cublas_handle_)

hugovbraun · 2021-04-22T23:34:17Z

I can repro. Looking at this. Thanks

jtrmal · 2021-04-23T13:14:34Z

Sorry, I didn't get to this yesterday -- please let me know if I can be of any help. y.

…

On Thu, Apr 22, 2021 at 7:34 PM Hugo Braun ***@***.***> wrote: I can repro. Looking at this. Thanks — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4501 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX5DYU6OFMHO3H76YZTTKCXARANCNFSM43I7Y7EA> .

hugovbraun · 2021-04-23T21:46:19Z

Thanks, it seems to be an exotic bug. I've already spent some time with gdb and nothing obvious shows up. I'm wondering if some previous cublas call goes wrong and the cublas handle somehow reset. I'll continue searching

kkm000 · 2021-05-02T01:29:45Z

@hugovbraun, FWIW, if it helps your analysis, BatchedThreadedNnet3CudaOnlinePipeline doesn't seem to be affected. CUDA 11.3, driver 465.19.01, Debian x64 kernel 4.19, patch level 181. The repro code is basically same, only with an added call to the .WaitForLatticeCallbacks() method before letting the object go out of inner scope. I was getting a different crash without this Wait call. I also tried both constructing and immediately destroying the dynamic batcher (also with its own `.WaitForCompletion() call), and going vanilla without it; no repro for this crash in either case.

Wondering if a similar call to .WaitForAllTasks() would change the outcome in the @jtrmal's repro code (the simple loop in this method looks benign to skip if there are no tasks pending completion; I just do not know if there are any).

galv · 2021-05-04T17:33:18Z

First of all, for those curious, I'm going to be working with Hugo a lot more now from within NVIDIA. Anyway, I found what is likely to be the bug @jtrmal

I couldn't exactly reproduce your issue. Instead, I encountered a cuda runtime error (rather than cublas error) at

kaldi/src/cudamatrix/cu-matrix.cc

Line 253 in e28927f

CU_SAFE_CALL(cudaGetLastError());

This is probably because I am using ivectors in my pipeline, while you are not.

It became clear fairly quickly that the source of the error was somewhere in the destructor of BatchedThreadedNnet3CudaPipeline2. The destructor of a an object runs before the destructor of any of its members, so I learned as well that one of the members was the problem. I used ltrace to figure out which cuda call in particular was the cause of the issue:

    ltrace -i -l 'libcu*' ./batched-wav-nnet3-cuda2 \
                              --frame-subsampling-factor=3 \
                              --config=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/conf/online.conf \
                              --max-active=7000 \
                              --beam=15.0 \
                              --lattice-beam=6.0 \
                              --acoustic-scale=1.0 \
                              --cuda-decoder-copy-threads=2 \
                              --cuda-worker-threads=2 \
                              --word-symbol-table=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/words.txt \
                              /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/chain/tdnn_7b/final.mdl \
                              /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
                              scp:wav.scp \
                              ark,t:-

And then I learned that cudaFreeHost() was being called on a pointer allocated via cudaMalloc (therefore, on the device) in the destructor of BatchedStaticNnet3. So that's most likely the issue. A fix is on the way.

In addition, I found a memory leak in ThreadPoolLight (circular shared_ptr references), but that can wait until later.

jtrmal · 2021-05-04T17:37:05Z

💝

…

On Tue, May 4, 2021 at 1:33 PM Daniel Galvez ***@***.***> wrote: First of all, for those curious, I'm going to be working with Hugo a lot more now from within NVIDIA. Anyway, I found what is likely to be the bug @jtrmal <https://github.com/jtrmal> I couldn't exactly reproduce your issue. Instead, I encountered a cuda runtime error (rather than cublas error) athttps:// github.com/kaldi-asr/kaldi/blob/e28927fd17b22318e73faf2cf903a7566fa1b724/src/cudamatrix/cu-matrix.cc#L253 This is probably because I am using ivectors in my pipeline, while you are not. It became clear fairly quickly that the source of the error was somewhere in the destructor of BatchedThreadedNnet3CudaPipeline2. The destructor of a an object runs before the destructor of any of its members, so I learned as well that one of the members was the problem. I used ltrace to figure out which cuda call in particular was the cause of the issue: ltrace -i -l 'libcu*' ./batched-wav-nnet3-cuda2 \ --frame-subsampling-factor=3 \ --config=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/conf/online.conf \ --max-active=7000 \ --beam=15.0 \ --lattice-beam=6.0 \ --acoustic-scale=1.0 \ --cuda-decoder-copy-threads=2 \ --cuda-worker-threads=2 \ --word-symbol-table=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/words.txt \ /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/chain/tdnn_7b/final.mdl \ /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \ scp:wav.scp \ ark,t:- And then I learned that cudaFreeHost() was being called on a pointer allocated via cudaMalloc (therefore, on the device) in the destructor of BatchedStaticNnet3. So that's most likely the issue. A fix is on the way. In addition, I found a memory leak in ThreadPoolLight (circular shared_ptr references), but that can wait until later. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4501 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXYAPO2ZD4L4KDKYNN3TMAVXFANCNFSM43I7Y7EA> .

cudaFreeHost() was called instead of cudaFree() on d_batch_slot_assignment_, which is a pointer to device memory, causing an error. This hadn't been noticd before because people usually destroyed the BatchedThreadedNnet3CudaPipeline2 only when terminating the program. Testing: I manualy applid the change described in kaldi-asr#4501 (comment) No unit test. Additionally, add several defensive CU_SAFE_CALL guards that weren't there before.

jtrmal added the bug label Apr 20, 2021

galv mentioned this issue May 4, 2021

Fix spurious CUBLAS_STATUS_NOT_INITIALIZED from cublas_copy #4517

Merged

kkm000 linked a pull request May 8, 2021 that will close this issue

Fix spurious CUBLAS_STATUS_NOT_INITIALIZED from cublas_copy #4517

Merged

kkm000 closed this as completed in #4517 May 26, 2021

galv mentioned this issue Jun 16, 2021

Insane indexes in CUDA decoder h_infotoken_concat_ when 2 CudaDecoder instances lifetime overlaps #4556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

jtrmal commented Apr 20, 2021

jtrmal commented Apr 21, 2021

hugovbraun commented Apr 22, 2021

hugovbraun commented Apr 22, 2021

jtrmal commented Apr 23, 2021 via email

hugovbraun commented Apr 23, 2021

kkm000 commented May 2, 2021

galv commented May 4, 2021 •

edited

Loading

jtrmal commented May 4, 2021 via email

"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

Comments

jtrmal commented Apr 20, 2021

jtrmal commented Apr 21, 2021

hugovbraun commented Apr 22, 2021

hugovbraun commented Apr 22, 2021

jtrmal commented Apr 23, 2021 via email

hugovbraun commented Apr 23, 2021

kkm000 commented May 2, 2021

galv commented May 4, 2021 • edited Loading

jtrmal commented May 4, 2021 via email

galv commented May 4, 2021 •

edited

Loading