Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update intersect_dense.cu #797

Closed
wants to merge 1 commit into from
Closed

Update intersect_dense.cu #797

wants to merge 1 commit into from

Conversation

luomingshuang
Copy link
Contributor

When I run the new mmi_bigram_{train, decode}.py, there is a bug such as the following.
With kangwei's help, I remove some lines code in k2/csrc/intersect_dense.cu and the script can run successfully.
So I make a PR to update the intersect_dense.cu. I'm not sure if there are other ways to modify it.

batch 2170, epoch 1/10 global average objf: 0.242149 over 28408356.0 frames (100.0% kept), current batch average objf: 0.255091 over 12872 frames (100.0% kept) avg time waiting for batch 0.005s
batch 2180, epoch 1/10 global average objf: 0.242089 over 28538822.0 frames (100.0% kept), current batch average objf: 0.192952 over 13173 frames (100.0% kept) avg time waiting for batch 0.005s
[F] /ceph-ly/open-source/k2/k2/csrc/intersect_dense.cu:851:lambda [](signed int)->void::operator()(signed int)->void block:[0,0,0], thread: [41,0,0] Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0 -658531.687500 vs -658530.625000
/ceph-ly/open-source/k2/k2/csrc/intersect_dense.cu:851: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [41,0,0] Assertion `Some bad things happened` failed.
[F] /ceph-ly/open-source/k2/k2/csrc/array.h:329:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int] Check failed: ret == cudaSuccess (710 vs. 0)  Error: device-side assert triggered. 


[ Stack-Trace: ]
/ceph-ly/open-source/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7f4da74943c7]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Array1<int>::operator[](int) const+0xeb9) [0x7f4da77795b9]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Renumbering::ComputeOld2New()+0x14e) [0x7f4da777451e]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x7f8) [0x7f4da7775c98]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*)+0x7ec) [0x7f4da78d344c]
/ceph-ly/open-source/k2/build/lib/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, k2::Array1<int> const*, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x420) [0x7f4da78c3bc0]
/ceph-ly/open-source/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x59c40) [0x7f4da8594c40]
/ceph-ly/open-source/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x1c4ee) [0x7f4da85574ee]
python(PyCFunction_Call+0x56) [0x5ff8a6]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x6095) [0x57e855]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(PyVectorcall_Call+0x51) [0x5ff3b1]
/root/luomingshuang/py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x8fd) [0x7f4e9fc5298d]
python(PyCFunction_Call+0xfb) [0x5ff94b]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x5b9e) [0x57e35e]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyFunction_Vectorcall+0x19c) [0x602b2c]
python() [0x4ffa96]
python(PyVectorcall_Call+0x51) [0x5ff3b1]
python(_PyEval_EvalFrameDefault+0x1c4a) [0x57a40a]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x247) [0x602bd7]
python(_PyObject_FastCallDict+0x4a) [0x60261a]
python() [0x5b034b]
python(_PyObject_MakeTpCall+0x28f) [0x5fff6f]
python(_PyEval_EvalFrameDefault+0x5553) [0x57dd13]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python(_PyFunction_Vectorcall+0x442) [0x602dd2]
python(_PyEval_EvalFrameDefault+0x1930) [0x57a0f0]
python(_PyFunction_Vectorcall+0x19c) [0x602b2c]
python(_PyEval_EvalFrameDefault+0x619) [0x578dd9]
python(_PyEval_EvalCodeWithName+0x25c) [0x5765ec]
python() [0x662c2e]
python(PyRun_FileExFlags+0x97) [0x662d07]
python(PyRun_SimpleFileExFlags+0x17f) [0x663a1f]
python(Py_RunMain+0x20e) [0x687dbe]
python(Py_BytesMain+0x29) [0x688149]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f4ea4011bf7]
python(_start+0x2a) [0x607daa]

Traceback (most recent call last):
  File "mmi_bigram_train_new.py", line 537, in <module>
    main()
  File "mmi_bigram_train_new.py", line 466, in main
    objf, valid_objf, global_batch_idx_train = train_one_epoch(
  File "mmi_bigram_train_new.py", line 187, in train_one_epoch
    curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf(
  File "mmi_bigram_train_new.py", line 105, in get_objf
    mmi_loss, tot_frames, all_frames = loss_fn(nnet_output, texts, supervision_segments)
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/snowfall-0.1-py3.8.egg/snowfall/objectives/mmi.py", line 212, in forward
  File "/root/luomingshuang/py38/lib/python3.8/site-packages/snowfall-0.1-py3.8.egg/snowfall/objectives/mmi.py", line 88, in _compute_mmi_loss_exact_optimized
  File "/ceph-ly/open-source/k2/k2/python/k2/autograd.py", line 768, in intersect_dense
    _IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam,
  File "/ceph-ly/open-source/k2/k2/python/k2/autograd.py", line 538, in forward
    ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(
RuntimeError: Some bad things happed.

@danpovey
Copy link
Collaborator

danpovey commented Aug 6, 2021

M.
I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed.
You may need Kangwei's help with this.

@luomingshuang
Copy link
Contributor Author

OK, we will have a try.

M.
I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed.
You may need Kangwei's help with this.

@danpovey
Copy link
Collaborator

danpovey commented Aug 6, 2021 via email

@luomingshuang
Copy link
Contributor Author

Get it.

And please also try changing clip-grad to clip-grad-norm .. sometimes in LSTMs, we can get grad explosion. I don't know whether it might be possible to print out the total gradient norm for each step. that will let us know if gradient explosion is happening (caution: if this does happen, it might be quite rare).

On Fri, Aug 6, 2021 at 11:37 AM Mingshuang Luo @.***> wrote: OK, we will have a try. M. I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed. You may need Kangwei's help with this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#797 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7QP75ASU3UVXS24ATT3NKHDANCNFSM5BUBAGMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

@luomingshuang
Copy link
Contributor Author

luomingshuang commented Aug 11, 2021

When I change clip_grad_value_ to clip_grad_norm_ in mmi_bigram_train.py, there are two good results.

  1. The error about the score difference doesn't happen based on the original version k2. And I also use clip_grad_norm_ for ctc_train.py, it also dosen't happen to error. So I think maybe clip_grad_norm_ can solve the error about the score difference. The some line codes in k2/csrc/intersect_dense.cu needn't to be removed. Maybe I will close this PR.
  2. The WER (test-clean) based on clip_grad_norm_ is 10.44% which is better than the result (10.56%) based on clip_grad_value_. It seems that clip_grad_norm_ is better than clip_grad_value_ for our current acousticmodel (tdnnlstm). So maybe I will make a PR by using clip_grad_norm_ replace clip_grad_value_ for all training scripts. Is it necessary?

And please also try changing clip-grad to clip-grad-norm .. sometimes in LSTMs, we can get grad explosion. I don't know whether it might be possible to print out the total gradient norm for each step. that will let us know if gradient explosion is happening (caution: if this does happen, it might be quite rare).

On Fri, Aug 6, 2021 at 11:37 AM Mingshuang Luo @.***> wrote: OK, we will have a try. M. I think it would be better to put the score difference in special buffer computed for that purpose (perhaps one element per FSA in the minibatch), that can be transferred to CPU and the largest difference printed as a warning if it is larger than 1.0. If we just let this happen silently, and the score diff is larger than the beam, intersection may sometimes fail and we won't have any idea why it failed. You may need Kangwei's help with this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#797 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7QP75ASU3UVXS24ATT3NKHDANCNFSM5BUBAGMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

@danpovey
Copy link
Collaborator

Let's not change the other scripts until we have time test the effectof the change. For now we can merge this PR, after you have updated the RESULTS.md.

@luomingshuang
Copy link
Contributor Author

OK, I get it.

Let's not change the other scripts until we have time test the effectof the change. For now we can merge this PR, after you have updated the RESULTS.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants