k2::TopSorter::TopSort assertion, but only when using GPU #1204

rouseabout · 2023-05-28T00:36:16Z

Using icefall/egs/librispeech/ASR/pruned_transducer_stateless7 recipe, using only train-clean-5 and dev-clean-2 to train a model, and running pruned_transducer_stateless7/decode.py on GPU with --decoding-method fast_beam_search_nbest_LG produces the following error.

[F] /home/user/k2/k2/csrc/top_sort.cu:324:k2::FsaVec k2::TopSorter::TopSort(k2::Array1<int>*) Check failed: start_state_present[0] == 1 (0 vs. 1) Our current implementation requires that the start state in each Fsa must be present in the first batch

However when pruned_transducer_stateless7/decode.py is forced to use the CPU, fast_beam_search_nbest_LG runs successfully.

Any suggestions what I might be doing wrong?

Image: nvcr.io/nvidia/pytorch:23.04-py3

k2.version:

k2 version: 1.24.3
Build type: Release
Git SHA1: fdb76bf4b3d9f28699eaf854b6b54e015b6b8a62
Git date: Wed May 24 23:51:07 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so

The text was updated successfully, but these errors were encountered:

csukuangfj · 2023-05-30T03:54:39Z

@rouseabout

Are you using the latest icefall and have you made any changes to the code?
Also, how did you generate LG.pt?

rouseabout · 2023-05-30T04:48:48Z

Icefall: Latest. k2-fsa/icefall@1aeffa7

Code changes: Minimal to make it download and run on my limited hardware. Diff: rouseabout/icefall@9e23b38

prepare.sh: download librispeech mini instead of full
prepare.sh: use only dev-clean-2 and train-clean-5 dataset parts
prepare.sh: comment out download musan, prepare musan and compute fbank musan
prepare.sh: comment out building G_4_gram.fst.txt (it is not used by ./local/compile_lg.py)
pruned_transducer_stateless7/asr_datamodule.py: default --mini-libri to True and default --enable-musan to False
pruned_transducer_stateless7/decode.py: test against dev_clean_2_cuts() only

./prepare.sh was used to populate ./data folder and build ./data/lang_bpe_500/LG.pt. The LG.pt model appears to work on CPU.

-rw-r--r-- 1 user user 1226762 May 26 11:07 data/lang_bpe_500/LG.pt

danpovey · 2023-05-30T06:08:40Z

Can you make sure you ran any tests that are available in k2? Sorry I don't recall the details of how.
It could be a bug in k2, relevant on your specific GPU or CPU hardware.

danpovey · 2023-05-30T06:16:53Z

Guys especially @pkufool I noticed an issue in top_sort.cu.
The comment for GetInitialBatch() says:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to the start-states of all                                                                                                                                                               
    the FSAs, and also the final-states for all FSAs in which final-states                                                                                                                                                         
    had in-degree zero (no arcs entering them).                                                                                                                                                                                    
                                                                                                                                                                                                                                   
    Note: in the originally published algorithm we start with all states                                                                                                                                                           
    that have in-degree zero, but in the context of this toolkit there                                                                                                                                                             
    is (I believe) no use in states that aren't accessible from the start                                                                                                                                                          
    state, so we remove them.                                                                                                                                                                                                      
   */

but the actual code does not do this, it actually just gets the states of in-degree 0, as in the original
published algorithm. Actually this is probably OK (although I think we should change the comment
to just say:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to all states                                                                                                                                                           
    that have in-degree zero.
*/

Notice that if the start-state has in-degree >0 (this is after removing self-loops), the start-state
will not be included in the 1st batch. This is consistent with the documentation of TopSort.
We must be careful to ensure that the input is acyclic. Can someone ensure that fast_beam_search_nbest_LG
would always give acyclic input? (Note: the 1st/start state must have an arc coming into it for this assertion
to come up, I think).
To avoid cycles, it may be necessary to add something to the "state" that is the number of symbols we have seen
on this frame. I'm assuming right now that the "state" consists of: [state_in_LG, current_frame]; we could augment
it to [state_in_LG, current_frame, num_syms_seen_this_frame].

pkufool · 2023-05-30T07:54:59Z

OK，I will have a look.

danpovey · 2023-05-30T09:14:06Z

Also, @rouseabout, can you try running it in pdb and getting a python stack trace when it fails? It would be nice to know for sure exactly when TopSort is being called.

pkufool · 2023-05-30T09:20:13Z

I suspect the top_sort is in https://github.com/k2-fsa/icefall/blob/7b0afbdc16066701759e088f7edbb648a0b879f0/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L213

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

  nbest = Nbest.from_lattice(
      lattice=lattice,
      num_paths=num_paths,
      use_double_scores=use_double_scores,
      nbest_scale=nbest_scale,
  )

  # The following code is modified from nbest.intersect()
  word_fsa = k2.invert(nbest.fsa)
  if hasattr(lattice, "aux_labels"):
      # delete token IDs as it is not needed
      del word_fsa.aux_labels
  word_fsa.scores.zero_()
  word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
  path_to_utt_map = nbest.shape.row_ids(1)

  if hasattr(lattice, "aux_labels"):
      # lattice has token IDs as labels and word IDs as aux_labels.
      # inv_lattice has word IDs as labels and token IDs as aux_labels
      inv_lattice = k2.invert(lattice)
      inv_lattice = k2.arc_sort(inv_lattice)
  else:
      inv_lattice = k2.arc_sort(lattice)

  if inv_lattice.shape[0] == 1:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=torch.zeros_like(path_to_utt_map),
          sorted_match_a=True,
      )
  else:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=path_to_utt_map,
          sorted_match_a=True,
      )

  # path_lattice has word IDs as labels and token IDs as aux_labels
  path_lattice = k2.top_sort(k2.connect(path_lattice))
  tot_scores = path_lattice.get_tot_scores(
      use_double_scores=use_double_scores,
      log_semiring=True,  # Note: we always use True
  )

rouseabout · 2023-05-30T09:55:41Z

Thanks for looking into this.

Quick note setup.py disables building the C++ tests. I suggest changing this.

extra_cmake_args += " -DK2_ENABLE_TESTS=OFF "

After rebuilding, I can see 2 C++ tests are failing. All python tests are passing.

user@0411f528f430:~/k2/build/temp.linux-x86_64-3.8$ ctest
Test project /home/user/k2/build/temp.linux-x86_64-3.8
        Start   1: Test.Cuda.cu_algorithms_test
  1/111 Test   #1: Test.Cuda.cu_algorithms_test .................   Passed    1.62 sec
[...]
110/111 Test #110: Test.Cuda.cu_k2_torch_wave_reader_test .......   Passed    0.56 sec
        Start 111: Test.torch_api_test
111/111 Test #111: Test.torch_api_test ..........................   Passed    0.57 sec

98% tests passed, 2 tests failed out of 111

Total Test time (real) = 349.05 sec

The following tests FAILED:
         10 - Test.Cuda.cu_hash_test (Failed)
        109 - Test.Cuda.cu_k2_torch_parse_options_test (Failed)
Errors while running CTest
Output from these tests are in: /home/user/k2/build/temp.linux-x86_64-3.8/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

user@0411f528f430:~/k2$ pytest k2/python/tests
============================================= test session starts =============================================
platform linux -- Python 3.8.10, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/user/k2
plugins: typeguard-4.0.0, xdist-3.2.1, shard-0.1.2, hypothesis-5.35.1, xdoctest-1.0.2, rerunfailures-11.1.2
collected 233 items                                                                                           
Running 233 items in this shard
[...]
============================================ 233 passed in 45.95s =============================================

Hardware: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Driver: Driver Version: 530.41.03 CUDA Version: 12.1

Stack trace from ./pruned_transducer_stateless7/decode.py:

[ Stack-Trace: ]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2_log.so(k2::internal::GetStackTrace[abi:cxx11]()+0x58) [0x7f94b9486538]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::internal::Logger::~Logger()+0x5a) [0x7f94b9b3ac3a]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSorter::TopSort(k2::Array1<int>*)+0x46a) [0x7f94b9f286da]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSort(k2::Ragged<k2::Arc>&, k2::Ragged<k2::Arc>*, k2::Array1<int>*)+0x14b) [0x7f94b9f19f2b]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x8d504) [0x7f94bf9b8504]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x42737) [0x7f94bf96d737]
python3(PyCFunction_Call+0x59) [0x5f6489]
python3(_PyObject_MakeTpCall+0x296) [0x5f7056]
python3(_PyEval_EvalFrameDefault+0x62d2) [0x5715a2]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(_PyEval_EvalFrameDefault+0x57f2) [0x570ac2]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(PyObject_Call+0x62) [0x5f5c02]
python3(_PyEval_EvalFrameDefault+0x1f2c) [0x56d1fc]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x72d) [0x56b9fd]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(PyEval_EvalCode+0x27) [0x68e7b7]
python3() [0x680001]
python3() [0x68007f]
python3() [0x680121]
python3(PyRun_SimpleFileExFlags+0x197) [0x680db7]
python3(Py_RunMain+0x212) [0x6b8122]
python3(Py_BytesMain+0x2d) [0x6b84ad]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f958ec7d083]
python3(_start+0x2e) [0x5fb39e]

and python error message:

Traceback (most recent call last):
  File "./pruned_transducer_stateless7/decode.py", line 972, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "./pruned_transducer_stateless7/decode.py", line 950, in main
    results_dict = decode_dataset(
  File "./pruned_transducer_stateless7/decode.py", line 656, in decode_dataset
    hyps_dict = decode_one_batch(
  File "./pruned_transducer_stateless7/decode.py", line 479, in decode_one_batch
    hyp_tokens = fast_beam_search_nbest_LG(
  File "/home/user/icefall/egs/atcosim/ASR/pruned_transducer_stateless7/beam_search.py", line 213, in fast_beam_search_nbest_LG
    path_lattice = k2.top_sort(k2.connect(path_lattice))
  File "/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 244, in top_sort
    ragged_arc, arc_map = _k2.top_sort(fsa.arcs, need_arc_map=need_arc_map)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

danpovey · 2023-05-30T10:36:04Z

Thanks! Can you rerun the tests with the --rerun-failed --output-on-failure options as it mentions? It might be CTest, not sure which directory it would have been in.

rouseabout · 2023-05-31T01:38:36Z

ctest --rerun-failed --output-on-failure 2>&1 | tee /tmp/log.txt

https://pross.sdf.org/sandpit/log.txt (467 KiB)

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

https://pross.sdf.org/sandpit/path_lattice.pt (355 MiB)

I will delete these files in a few days. Cheers.

pkufool · 2023-05-31T01:44:14Z

Thanks! I am debuging it, will post the results here once available.

pkufool · 2023-05-31T01:48:41Z

@rouseabout Could you also dump the lattice from fast_beam_search, thank you very much!

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

rouseabout · 2023-05-31T02:58:35Z

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

pkufool · 2023-05-31T03:03:21Z

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

Thank you!

~~Yes, because the nbest is random sampled from lattice, so the path_lattice may change.~~

edit: Sorry, I am wrong, the paths are not randomly sampled, see https://k2-fsa.github.io/k2/python_api/api.html#random-paths. So this might be another issue.

pkufool · 2023-06-02T04:34:08Z

@rouseabout Sorry for the slow reply, I can not reproduce the error with the lattices you provided.
The properties of path_lattice.pt and lattice.pt are:

I also tried create path_lattice from lattice, and the top_sort runs normally.

from icefall.decode import Nbest

lattice = k2.Fsa.from_dict(torch.load("/star-kw/kangwei/issues/k2_1204/lattice.pt"))
lattice = lattice.to("cuda:4")

nbest = Nbest.from_lattice(
    lattice=lattice,
    num_paths=200,
    use_double_scores=True,
    nbest_scale=0.5,
)

# The following code is modified from nbest.intersect()
word_fsa = k2.invert(nbest.fsa)
if hasattr(lattice, "aux_labels"):
    # delete token IDs as it is not needed
    del word_fsa.aux_labels
word_fsa.scores.zero_()
word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
path_to_utt_map = nbest.shape.row_ids(1)

if hasattr(lattice, "aux_labels"):
    # lattice has token IDs as labels and word IDs as aux_labels.
    # inv_lattice has word IDs as labels and token IDs as aux_labels
    inv_lattice = k2.invert(lattice)
    inv_lattice = k2.arc_sort(inv_lattice)
else:
    inv_lattice = k2.arc_sort(lattice)

if inv_lattice.shape[0] == 1:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=torch.zeros_like(path_to_utt_map),
        sorted_match_a=True,
    )
else:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=path_to_utt_map,
        sorted_match_a=True,
    )

Could you check that the lattice you have dumpped is the problemic one, thank you very much!

rouseabout · 2023-06-02T07:39:56Z

@pkufool Really appreciate you looking intro this. It is not urgent.

I can confirm the lattice.pt and path_lattice.pt were output from /pruned_transducer_stateless7/decode.py --decoding-method fast_beam_search_nbest_LG and it crashed at top_sort.cu:324.

When I run you notebook lines, I observe the same shape and properties_str output.

When I run your code, changing cuda:4 to cuda:0, it runs normally, no crash...

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

path_lattice = k2.top_sort(k2.connect(path_lattice))

After adding this line to your code, it crashes at top_sort.cu:324.

What GPU are you testing on?

pkufool · 2023-06-02T07:52:57Z

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

See cell 14.

What GPU are you testing on?

I tested it on a nvidia V100. (pytorch version 1.8.1, cuda version 10.2).

pkufool · 2023-06-02T07:58:14Z

Before invokeing top_sort, path_lattice already has a properties TopSortedAndAcyclic, so I think TopSort algorithm will not do anything, the crash is odd. So, what's your k2, pytorch and cuda version?

rouseabout · 2023-06-02T08:24:49Z

Opps, I missed cell 14 :(

k2 version: 1.24.3
Build type: Release
Git SHA1: 1a76309e5c6343c4d18965b7ce134a7f311d9d3a
Git date: Sun May 28 06:04:03 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so

I am using this docker image (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-04.html). I will try an older image.

user@2dfad2bc2655:~$ pip list | grep ^torch
torch                   2.1.0a0+fe05266
torch-tensorrt          1.4.0.dev0
torchaudio              2.1.0a0+6425d46                          /home/user/audio
torchtext               0.13.0a0+fae8e8c
torchvision             0.15.0a0

user@2dfad2bc2655:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

rouseabout · 2023-06-05T09:45:55Z

Results:

8GB Tesla P4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:22.05-py3	1.12.0a0+8a1a93a	11.7.0	WORKING
nvcr.io/nvidia/pytorch:22.12-py3	1.14.0a0+410ce96	11.8.0	WORKING
nvcr.io/nvidia/pytorch:23.02-py3	1.14.0a0+44dac51	12.0.1	CRASH
nvcr.io/nvidia/pytorch:23.04-py3	2.1.0a0+fe05266f	12.1.0	CRASH

16GB Tesla T4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:22.12-py3	1.14.0a0+410ce96	11.8.0	WORKING
nvcr.io/nvidia/pytorch:23.02-py3	1.14.0a0+44dac51	12.0.1	CRASH

Software/hardware configurations were otherwise identical. While its only a few data points, one might conclude k2 + CUDA 12.x has problems.

rouseabout · 2023-06-05T21:22:12Z

8GB Tesla P4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:23.05-py3	2.0.0	12.1.1	CRASH

pkufool · 2023-06-06T11:28:51Z

Thanks! we will debug it on cuda 12.x

divyeshrajpura4114 mentioned this issue Apr 12, 2024

Decoding Issue: fast beam search nbest LG k2-fsa/icefall#1591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k2::TopSorter::TopSort assertion, but only when using GPU #1204

k2::TopSorter::TopSort assertion, but only when using GPU #1204

rouseabout commented May 28, 2023 •

edited

Loading

csukuangfj commented May 30, 2023

rouseabout commented May 30, 2023

danpovey commented May 30, 2023

danpovey commented May 30, 2023 •

edited

Loading

pkufool commented May 30, 2023

danpovey commented May 30, 2023

pkufool commented May 30, 2023 •

edited

Loading

rouseabout commented May 30, 2023

danpovey commented May 30, 2023

rouseabout commented May 31, 2023

pkufool commented May 31, 2023

pkufool commented May 31, 2023 •

edited

Loading

rouseabout commented May 31, 2023

pkufool commented May 31, 2023 •

edited

Loading

pkufool commented Jun 2, 2023

rouseabout commented Jun 2, 2023

pkufool commented Jun 2, 2023

pkufool commented Jun 2, 2023

rouseabout commented Jun 2, 2023

rouseabout commented Jun 5, 2023

rouseabout commented Jun 5, 2023

pkufool commented Jun 6, 2023

k2::TopSorter::TopSort assertion, but only when using GPU #1204

k2::TopSorter::TopSort assertion, but only when using GPU #1204

Comments

rouseabout commented May 28, 2023 • edited Loading

csukuangfj commented May 30, 2023

rouseabout commented May 30, 2023

danpovey commented May 30, 2023

danpovey commented May 30, 2023 • edited Loading

pkufool commented May 30, 2023

danpovey commented May 30, 2023

pkufool commented May 30, 2023 • edited Loading

rouseabout commented May 30, 2023

danpovey commented May 30, 2023

rouseabout commented May 31, 2023

pkufool commented May 31, 2023

pkufool commented May 31, 2023 • edited Loading

rouseabout commented May 31, 2023

pkufool commented May 31, 2023 • edited Loading

pkufool commented Jun 2, 2023

rouseabout commented Jun 2, 2023

pkufool commented Jun 2, 2023

pkufool commented Jun 2, 2023

rouseabout commented Jun 2, 2023

rouseabout commented Jun 5, 2023

rouseabout commented Jun 5, 2023

pkufool commented Jun 6, 2023

rouseabout commented May 28, 2023 •

edited

Loading

danpovey commented May 30, 2023 •

edited

Loading

pkufool commented May 30, 2023 •

edited

Loading

pkufool commented May 31, 2023 •

edited

Loading

pkufool commented May 31, 2023 •

edited

Loading