Update intersect_dense.cu #797

luomingshuang · 2021-08-05T14:58:22Z

When I run the new mmi_bigram_{train, decode}.py, there is a bug such as the following.
With kangwei's help, I remove some lines code in k2/csrc/intersect_dense.cu and the script can run successfully.
So I make a PR to update the intersect_dense.cu. I'm not sure if there are other ways to modify it.

batch 2170, epoch 1/10 global average objf: 0.242149 over 28408356.0 frames (100.0% kept), current batch average objf: 0.255091 over 12872 frames (100.0% kept) avg time waiting for batch 0.005s
batch 2180, epoch 1/10 global average objf: 0.242089 over 28538822.0 frames (100.0% kept), current batch average objf: 0.192952 over 13173 frames (100.0% kept) avg time waiting for batch 0.005s
[F] /ceph-ly/open-source/k2/k2/csrc/intersect_dense.cu:851:lambda [](signed int)->void::operator()(signed int)->void block:[0,0,0], thread: [41,0,0] Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0 -658531.687500 vs -658530.625000
/ceph-ly/open-source/k2/k2/csrc/intersect_dense.cu:851: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [41,0,0] Assertion `Some bad things happened` failed.
[F] /ceph-ly/open-source/k2/k2/csrc/array.h:329:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int] Check failed: ret == cudaSuccess (710 vs. 0)  Error: device-side assert triggered. 


[ Stack-Trace: ]
/ceph-ly/open-source/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7f4da74943c7]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Array1<int>::operator[](int) const+0xeb9) [0x7f4da77795b9]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Renumbering::ComputeOld2New()+0x14e) [0x7f4da777451e]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x7f8) [0x7f4da7775c98]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*)+0x7ec) [0x7f4da78d344c]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, k2::Array1<int> const*, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x420) [0x7f4da78c3bc0]
/ceph-ly/open-source/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x59c40) [0x7f4da8594c40]
/ceph-ly/open-source/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x1c4ee) [0x7f4da85574ee]
python(PyCFunction_Call+0x56) [0x5ff8a6]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x6095) [0x57e855]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(PyVectorcall_Call+0x51) [0x5ff3b1]
/root/luomingshuang/py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x8fd) [0x7f4e9fc5298d]
python(PyCFunction_Call+0xfb) [0x5ff94b]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x5b9e) [0x57e35e]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyFunction_Vectorcall+0x19c) [0x602b2c]
python() [0x4ffa96]
python(PyVectorcall_Call+0x51) [0x5ff3b1]
python(_PyEval_EvalFrameDefault+0x1c4a) [0x57a40a]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x247) [0x602bd7]
python(_PyObject_FastCallDict+0x4a) [0x60261a]
python() [0x5b034b]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x5553) [0x57dd13]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyFunction_Vectorcall+0x19c) [0x602b2c]
python(_PyEval_EvalFrameDefault+0x619) [0x578dd9]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python() [0x662c2e]
python(PyRun_FileExFlags+0x97) [0x662d07]
python(PyRun_SimpleFileExFlags+0x17f) [0x663a1f]
python(Py_RunMain+0x20e) [0x687dbe]
python(Py_BytesMain+0x29) [0x688149]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f4ea4011bf7]
python(_start+0x2a) [0x607daa]

Traceback (most recent call last):
  File "mmi_bigram_train_new.py", line 537, in <module>
    main()
  File "mmi_bigram_train_new.py", line 466, in main
    objf, valid_objf, global_batch_idx_train = train_one_epoch(
  File "mmi_bigram_train_new.py", line 187, in train_one_epoch
    curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf(
  File "mmi_bigram_train_new.py", line 105, in get_objf
    mmi_loss, tot_frames, all_frames = loss_fn(nnet_output, texts, supervision_segments)
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/snowfall-0.1-py3.8.egg/snowfall/objectives/mmi.py", line 212, in forward
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/snowfall-0.1-py3.8.egg/snowfall/objectives/mmi.py", line 88, in _compute_mmi_loss_exact_optimized
  File "/ceph-ly/open-source/k2/k2/python/k2/autograd.py", line 768, in intersect_dense
    _IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam,
  File "/ceph-ly/open-source/k2/k2/python/k2/autograd.py", line 538, in forward
    ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(
RuntimeError: Some bad things happed.

danpovey · 2021-08-06T03:22:01Z

M.
I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed.
You may need Kangwei's help with this.

luomingshuang · 2021-08-06T03:37:10Z

OK, we will have a try.

M.
I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed.
You may need Kangwei's help with this.

danpovey · 2021-08-06T03:42:13Z

And please also try changing clip-grad to clip-grad-norm .. sometimes in LSTMs, we can get grad explosion. I don't know whether it might be possible to print out the total gradient norm for each step. that will let us know if gradient explosion is happening (caution: if this does happen, it might be quite rare).

…

On Fri, Aug 6, 2021 at 11:37 AM Mingshuang Luo ***@***.***> wrote: OK, we will have a try. M. I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed. You may need Kangwei's help with this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#797 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7QP75ASU3UVXS24ATT3NKHDANCNFSM5BUBAGMQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

luomingshuang · 2021-08-06T04:42:12Z

Get it.

And please also try changing clip-grad to clip-grad-norm .. sometimes in LSTMs, we can get grad explosion. I don't know whether it might be possible to print out the total gradient norm for each step. that will let us know if gradient explosion is happening (caution: if this does happen, it might be quite rare).
…
On Fri, Aug 6, 2021 at 11:37 AM Mingshuang Luo @.***> wrote: OK, we will have a try. M. I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed. You may need Kangwei's help with this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#797 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7QP75ASU3UVXS24ATT3NKHDANCNFSM5BUBAGMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

luomingshuang · 2021-08-11T03:17:01Z

When I change clip_grad_value_ to clip_grad_norm_ in mmi_bigram_train.py, there are two good results.

The error about the score difference doesn't happen based on the original version k2. And I also use clip_grad_norm_ for ctc_train.py, it also dosen't happen to error. So I think maybe clip_grad_norm_ can solve the error about the score difference. The some line codes in k2/csrc/intersect_dense.cu needn't to be removed. Maybe I will close this PR.
The WER (test-clean) based on clip_grad_norm_ is 10.44% which is better than the result (10.56%) based on clip_grad_value_. It seems that clip_grad_norm_ is better than clip_grad_value_ for our current acousticmodel (tdnnlstm). So maybe I will make a PR by using clip_grad_norm_ replace clip_grad_value_ for all training scripts. Is it necessary?

And please also try changing clip-grad to clip-grad-norm .. sometimes in LSTMs, we can get grad explosion. I don't know whether it might be possible to print out the total gradient norm for each step. that will let us know if gradient explosion is happening (caution: if this does happen, it might be quite rare).
…
On Fri, Aug 6, 2021 at 11:37 AM Mingshuang Luo @.***> wrote: OK, we will have a try. M. I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed. You may need Kangwei's help with this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#797 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7QP75ASU3UVXS24ATT3NKHDANCNFSM5BUBAGMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

danpovey · 2021-08-11T04:46:52Z

Let's not change the other scripts until we have time test the effectof the change. For now we can merge this PR, after you have updated the RESULTS.md.

luomingshuang · 2021-08-11T04:48:32Z

OK, I get it.

Let's not change the other scripts until we have time test the effectof the change. For now we can merge this PR, after you have updated the RESULTS.md.

Update intersect_dense.cu

b394f55

pkufool mentioned this pull request Sep 8, 2021

Prune with max_arcs in IntersectDense #820

Merged

pkufool closed this Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update intersect_dense.cu #797

Update intersect_dense.cu #797

luomingshuang commented Aug 5, 2021

danpovey commented Aug 6, 2021 •

edited

Loading

luomingshuang commented Aug 6, 2021

danpovey commented Aug 6, 2021 via email

luomingshuang commented Aug 6, 2021

luomingshuang commented Aug 11, 2021 •

edited

Loading

danpovey commented Aug 11, 2021

luomingshuang commented Aug 11, 2021

Update intersect_dense.cu #797

Update intersect_dense.cu #797

Conversation

luomingshuang commented Aug 5, 2021

danpovey commented Aug 6, 2021 • edited Loading

luomingshuang commented Aug 6, 2021

danpovey commented Aug 6, 2021 via email

luomingshuang commented Aug 6, 2021

luomingshuang commented Aug 11, 2021 • edited Loading

danpovey commented Aug 11, 2021

luomingshuang commented Aug 11, 2021

danpovey commented Aug 6, 2021 •

edited

Loading

luomingshuang commented Aug 11, 2021 •

edited

Loading