Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)' #4501

Closed
jtrmal opened this issue Apr 20, 2021 · 8 comments · Fixed by #4517
Labels

Comments

@jtrmal
Copy link
Contributor

jtrmal commented Apr 20, 2021

Hi, in a proprietary code I have discovered some issue with the cudadecoder. The repro is quite easy --
in file src/cudadecoderbin/batched-wav-nnet3-cuda2.cc simply replace

     BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);

by

{
       BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);
}
     BatchedThreadedNnet3CudaPipeline2 cuda_pipeline(
         batched_decoder_config, *decode_fst, am_nnet, *trans_model);

(i.e. allocate an object and deallocate it and try to alloc it again)
The backtrace bellow. Any suggestion on how to fix that? Imagine allocating the decoder again with different params (wil still crash)

LOG (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
ERROR (batched-wav-nnet3-cuda2[5.5.906~3-e7455]:CopyFromVec<double>():cu-vector.cc:892) cublasStatus_t 1 : "CUBLAS_STATUS_NOT_INITIALIZED" returned from 'cublas_copy(GetCublasHandle(), dim_, src.Data(), 1, data_, 1)'

[ Stack-Trace: ]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-base.so(+0x8e9e) [0x7f51fbd19e9e]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x6bc) [0x7f51fbd1aaec]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x55993466fcab]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudamatrix.so(void kaldi::CuVectorBase<float>::CopyFromVec<double>(kaldi::CuVectorBase<double> const&)+0x269) [0x7f51fc2891c7]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::ComputeDerived()+0x1c0) [0x7f51fdade202]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::BatchNormComponent(kaldi::nnet3::BatchNormComponent const&)+0xe6) [0x7f51fdade3e4]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::BatchNormComponent::Copy() const+0x25) [0x7f51fdae0c81]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::Nnet::Nnet(kaldi::nnet3::Nnet const&)+0x458) [0x7f51fdb2fbee]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::ReadParametersFromModel()+0x3c2) [0x7f51fdda515c]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::Initialize(fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&)+0x15) [0x7f51fdda57cf]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipeline::BatchedThreadedNnet3CudaOnlinePipeline(kaldi::cuda_decoder::BatchedThreadedNnet3CudaOnlinePipelineConfig const&, fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&, kaldi::nnet3::AmNnetSimple const&, kaldi::TransitionModel const&)+0xaa6) [0x7f51fddd5500]
/home/jtrmal/projects/kaldi/src/lib/libkaldi-cudadecoder.so(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline2::BatchedThreadedNnet3CudaPipeline2(kaldi::cuda_decoder::BatchedThreadedNnet3CudaPipeline2Config const&, fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const&, kaldi::nnet3::AmNnetSimple const&, kaldi::TransitionModel const&)+0x43) [0x7f51fddcc515]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(main+0xea3) [0x55993466e2ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f51faed40b3]
/home/jtrmal/projects/kaldi/src/cudadecoderbin/batched-wav-nnet3-cuda2(_start+0x2e) [0x55993466a20e]

LOG (lattice-scale[5.5.906~3-e7455]:main():lattice-scale.cc:107) Done 0 lattices.```
@jtrmal jtrmal added the bug label Apr 20, 2021
@jtrmal
Copy link
Contributor Author

jtrmal commented Apr 21, 2021

CUDA 11.2, verified on latest kaldi master.
@hugovbraun any suggestion?

@hugovbraun
Copy link
Contributor

Could you check if the cuda device destructor gets called between the two?

if (cublas_handle_)

@hugovbraun
Copy link
Contributor

I can repro. Looking at this. Thanks

@jtrmal
Copy link
Contributor Author

jtrmal commented Apr 23, 2021 via email

@hugovbraun
Copy link
Contributor

Thanks, it seems to be an exotic bug. I've already spent some time with gdb and nothing obvious shows up. I'm wondering if some previous cublas call goes wrong and the cublas handle somehow reset. I'll continue searching

@kkm000
Copy link
Contributor

kkm000 commented May 2, 2021

@hugovbraun, FWIW, if it helps your analysis, BatchedThreadedNnet3CudaOnlinePipeline doesn't seem to be affected. CUDA 11.3, driver 465.19.01, Debian x64 kernel 4.19, patch level 181. The repro code is basically same, only with an added call to the .WaitForLatticeCallbacks() method before letting the object go out of inner scope. I was getting a different crash without this Wait call. I also tried both constructing and immediately destroying the dynamic batcher (also with its own `.WaitForCompletion() call), and going vanilla without it; no repro for this crash in either case.

Wondering if a similar call to .WaitForAllTasks() would change the outcome in the @jtrmal's repro code (the simple loop in this method looks benign to skip if there are no tasks pending completion; I just do not know if there are any).

@galv
Copy link
Contributor

galv commented May 4, 2021

First of all, for those curious, I'm going to be working with Hugo a lot more now from within NVIDIA. Anyway, I found what is likely to be the bug @jtrmal

I couldn't exactly reproduce your issue. Instead, I encountered a cuda runtime error (rather than cublas error) at

CU_SAFE_CALL(cudaGetLastError());
This is probably because I am using ivectors in my pipeline, while you are not.

It became clear fairly quickly that the source of the error was somewhere in the destructor of BatchedThreadedNnet3CudaPipeline2. The destructor of a an object runs before the destructor of any of its members, so I learned as well that one of the members was the problem. I used ltrace to figure out which cuda call in particular was the cause of the issue:

    ltrace -i -l 'libcu*' ./batched-wav-nnet3-cuda2 \
                              --frame-subsampling-factor=3 \
                              --config=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/conf/online.conf \
                              --max-active=7000 \
                              --beam=15.0 \
                              --lattice-beam=6.0 \
                              --acoustic-scale=1.0 \
                              --cuda-decoder-copy-threads=2 \
                              --cuda-worker-threads=2 \
                              --word-symbol-table=/home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/words.txt \
                              /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/chain/tdnn_7b/final.mdl \
                              /home/dgalvez/code/asr/kaldi/egs/aspire/s5/exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
                              scp:wav.scp \
                              ark,t:-

And then I learned that cudaFreeHost() was being called on a pointer allocated via cudaMalloc (therefore, on the device) in the destructor of BatchedStaticNnet3. So that's most likely the issue. A fix is on the way.

In addition, I found a memory leak in ThreadPoolLight (circular shared_ptr references), but that can wait until later.

@jtrmal
Copy link
Contributor Author

jtrmal commented May 4, 2021 via email

galv added a commit to galv/kaldi that referenced this issue May 4, 2021
cudaFreeHost() was called instead of cudaFree() on
d_batch_slot_assignment_, which is a pointer to device memory, causing
an error.

This hadn't been noticd before because people usually destroyed the
BatchedThreadedNnet3CudaPipeline2 only when terminating the program.

Testing: I manualy applid the change described in
kaldi-asr#4501 (comment)

No unit test.

Additionally, add several defensive CU_SAFE_CALL guards that weren't
there before.
@kkm000 kkm000 linked a pull request May 8, 2021 that will close this issue
galv added a commit to galv/kaldi that referenced this issue May 25, 2021
cudaFreeHost() was called instead of cudaFree() on
d_batch_slot_assignment_, which is a pointer to device memory, causing
an error.

This hadn't been noticd before because people usually destroyed the
BatchedThreadedNnet3CudaPipeline2 only when terminating the program.

Testing: I manualy applid the change described in
kaldi-asr#4501 (comment)

No unit test.

Additionally, add several defensive CU_SAFE_CALL guards that weren't
there before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants