Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k2::TopSorter::TopSort assertion, but only when using GPU #1204

Open
rouseabout opened this issue May 28, 2023 · 22 comments
Open

k2::TopSorter::TopSort assertion, but only when using GPU #1204

rouseabout opened this issue May 28, 2023 · 22 comments

Comments

@rouseabout
Copy link

rouseabout commented May 28, 2023

Using icefall/egs/librispeech/ASR/pruned_transducer_stateless7 recipe, using only train-clean-5 and dev-clean-2 to train a model, and running pruned_transducer_stateless7/decode.py on GPU with --decoding-method fast_beam_search_nbest_LG produces the following error.

[F] /home/user/k2/k2/csrc/top_sort.cu:324:k2::FsaVec k2::TopSorter::TopSort(k2::Array1<int>*) Check failed: start_state_present[0] == 1 (0 vs. 1) Our current implementation requires that the start state in each Fsa must be present in the first batch

However when pruned_transducer_stateless7/decode.py is forced to use the CPU, fast_beam_search_nbest_LG runs successfully.

Any suggestions what I might be doing wrong?

Image: nvcr.io/nvidia/pytorch:23.04-py3

k2.version:

k2 version: 1.24.3
Build type: Release
Git SHA1: fdb76bf4b3d9f28699eaf854b6b54e015b6b8a62
Git date: Wed May 24 23:51:07 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so 
@csukuangfj
Copy link
Collaborator

@rouseabout

Are you using the latest icefall and have you made any changes to the code?
Also, how did you generate LG.pt?

@rouseabout
Copy link
Author

Icefall: Latest. k2-fsa/icefall@1aeffa7

Code changes: Minimal to make it download and run on my limited hardware. Diff: rouseabout/icefall@9e23b38

  • prepare.sh: download librispeech mini instead of full
  • prepare.sh: use only dev-clean-2 and train-clean-5 dataset parts
  • prepare.sh: comment out download musan, prepare musan and compute fbank musan
  • prepare.sh: comment out building G_4_gram.fst.txt (it is not used by ./local/compile_lg.py)
  • pruned_transducer_stateless7/asr_datamodule.py: default --mini-libri to True and default --enable-musan to False
  • pruned_transducer_stateless7/decode.py: test against dev_clean_2_cuts() only

./prepare.sh was used to populate ./data folder and build ./data/lang_bpe_500/LG.pt. The LG.pt model appears to work on CPU.

-rw-r--r-- 1 user user 1226762 May 26 11:07 data/lang_bpe_500/LG.pt

@danpovey
Copy link
Collaborator

Can you make sure you ran any tests that are available in k2? Sorry I don't recall the details of how.
It could be a bug in k2, relevant on your specific GPU or CPU hardware.

@danpovey
Copy link
Collaborator

danpovey commented May 30, 2023

Guys especially @pkufool I noticed an issue in top_sort.cu.
The comment for GetInitialBatch() says:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to the start-states of all                                                                                                                                                               
    the FSAs, and also the final-states for all FSAs in which final-states                                                                                                                                                         
    had in-degree zero (no arcs entering them).                                                                                                                                                                                    
                                                                                                                                                                                                                                   
    Note: in the originally published algorithm we start with all states                                                                                                                                                           
    that have in-degree zero, but in the context of this toolkit there                                                                                                                                                             
    is (I believe) no use in states that aren't accessible from the start                                                                                                                                                          
    state, so we remove them.                                                                                                                                                                                                      
   */

but the actual code does not do this, it actually just gets the states of in-degree 0, as in the original
published algorithm. Actually this is probably OK (although I think we should change the comment
to just say:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to all states                                                                                                                                                           
    that have in-degree zero.
*/ 

Notice that if the start-state has in-degree >0 (this is after removing self-loops), the start-state
will not be included in the 1st batch. This is consistent with the documentation of TopSort.
We must be careful to ensure that the input is acyclic. Can someone ensure that fast_beam_search_nbest_LG
would always give acyclic input? (Note: the 1st/start state must have an arc coming into it for this assertion
to come up, I think).
To avoid cycles, it may be necessary to add something to the "state" that is the number of symbols we have seen
on this frame. I'm assuming right now that the "state" consists of: [state_in_LG, current_frame]; we could augment
it to [state_in_LG, current_frame, num_syms_seen_this_frame].

@pkufool
Copy link
Collaborator

pkufool commented May 30, 2023

OK,I will have a look.

@danpovey
Copy link
Collaborator

Also, @rouseabout, can you try running it in pdb and getting a python stack trace when it fails? It would be nice to know for sure exactly when TopSort is being called.

@pkufool
Copy link
Collaborator

pkufool commented May 30, 2023

I suspect the top_sort is in https://github.com/k2-fsa/icefall/blob/7b0afbdc16066701759e088f7edbb648a0b879f0/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L213

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

  nbest = Nbest.from_lattice(
      lattice=lattice,
      num_paths=num_paths,
      use_double_scores=use_double_scores,
      nbest_scale=nbest_scale,
  )

  # The following code is modified from nbest.intersect()
  word_fsa = k2.invert(nbest.fsa)
  if hasattr(lattice, "aux_labels"):
      # delete token IDs as it is not needed
      del word_fsa.aux_labels
  word_fsa.scores.zero_()
  word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
  path_to_utt_map = nbest.shape.row_ids(1)

  if hasattr(lattice, "aux_labels"):
      # lattice has token IDs as labels and word IDs as aux_labels.
      # inv_lattice has word IDs as labels and token IDs as aux_labels
      inv_lattice = k2.invert(lattice)
      inv_lattice = k2.arc_sort(inv_lattice)
  else:
      inv_lattice = k2.arc_sort(lattice)

  if inv_lattice.shape[0] == 1:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=torch.zeros_like(path_to_utt_map),
          sorted_match_a=True,
      )
  else:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=path_to_utt_map,
          sorted_match_a=True,
      )

  # path_lattice has word IDs as labels and token IDs as aux_labels
  path_lattice = k2.top_sort(k2.connect(path_lattice))
  tot_scores = path_lattice.get_tot_scores(
      use_double_scores=use_double_scores,
      log_semiring=True,  # Note: we always use True
  )

@rouseabout
Copy link
Author

Thanks for looking into this.

Quick note setup.py disables building the C++ tests. I suggest changing this.

extra_cmake_args += " -DK2_ENABLE_TESTS=OFF "

After rebuilding, I can see 2 C++ tests are failing. All python tests are passing.

user@0411f528f430:~/k2/build/temp.linux-x86_64-3.8$ ctest
Test project /home/user/k2/build/temp.linux-x86_64-3.8
        Start   1: Test.Cuda.cu_algorithms_test
  1/111 Test   #1: Test.Cuda.cu_algorithms_test .................   Passed    1.62 sec
[...]
110/111 Test #110: Test.Cuda.cu_k2_torch_wave_reader_test .......   Passed    0.56 sec
        Start 111: Test.torch_api_test
111/111 Test #111: Test.torch_api_test ..........................   Passed    0.57 sec

98% tests passed, 2 tests failed out of 111

Total Test time (real) = 349.05 sec

The following tests FAILED:
         10 - Test.Cuda.cu_hash_test (Failed)
        109 - Test.Cuda.cu_k2_torch_parse_options_test (Failed)
Errors while running CTest
Output from these tests are in: /home/user/k2/build/temp.linux-x86_64-3.8/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
user@0411f528f430:~/k2$ pytest k2/python/tests
============================================= test session starts =============================================
platform linux -- Python 3.8.10, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/user/k2
plugins: typeguard-4.0.0, xdist-3.2.1, shard-0.1.2, hypothesis-5.35.1, xdoctest-1.0.2, rerunfailures-11.1.2
collected 233 items                                                                                           
Running 233 items in this shard
[...]
============================================ 233 passed in 45.95s =============================================

Hardware: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Driver: Driver Version: 530.41.03 CUDA Version: 12.1

Stack trace from ./pruned_transducer_stateless7/decode.py:

[ Stack-Trace: ]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2_log.so(k2::internal::GetStackTrace[abi:cxx11]()+0x58) [0x7f94b9486538]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::internal::Logger::~Logger()+0x5a) [0x7f94b9b3ac3a]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSorter::TopSort(k2::Array1<int>*)+0x46a) [0x7f94b9f286da]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSort(k2::Ragged<k2::Arc>&, k2::Ragged<k2::Arc>*, k2::Array1<int>*)+0x14b) [0x7f94b9f19f2b]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x8d504) [0x7f94bf9b8504]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x42737) [0x7f94bf96d737]
python3(PyCFunction_Call+0x59) [0x5f6489]
python3(_PyObject_MakeTpCall+0x296) [0x5f7056]
python3(_PyEval_EvalFrameDefault+0x62d2) [0x5715a2]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(_PyEval_EvalFrameDefault+0x57f2) [0x570ac2]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(PyObject_Call+0x62) [0x5f5c02]
python3(_PyEval_EvalFrameDefault+0x1f2c) [0x56d1fc]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x72d) [0x56b9fd]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(PyEval_EvalCode+0x27) [0x68e7b7]
python3() [0x680001]
python3() [0x68007f]
python3() [0x680121]
python3(PyRun_SimpleFileExFlags+0x197) [0x680db7]
python3(Py_RunMain+0x212) [0x6b8122]
python3(Py_BytesMain+0x2d) [0x6b84ad]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f958ec7d083]
python3(_start+0x2e) [0x5fb39e]

and python error message:

Traceback (most recent call last):
  File "./pruned_transducer_stateless7/decode.py", line 972, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "./pruned_transducer_stateless7/decode.py", line 950, in main
    results_dict = decode_dataset(
  File "./pruned_transducer_stateless7/decode.py", line 656, in decode_dataset
    hyps_dict = decode_one_batch(
  File "./pruned_transducer_stateless7/decode.py", line 479, in decode_one_batch
    hyp_tokens = fast_beam_search_nbest_LG(
  File "/home/user/icefall/egs/atcosim/ASR/pruned_transducer_stateless7/beam_search.py", line 213, in fast_beam_search_nbest_LG
    path_lattice = k2.top_sort(k2.connect(path_lattice))
  File "/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 244, in top_sort
    ragged_arc, arc_map = _k2.top_sort(fsa.arcs, need_arc_map=need_arc_map)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

@danpovey
Copy link
Collaborator

Thanks! Can you rerun the tests with the --rerun-failed --output-on-failure options as it mentions? It might be CTest, not sure which directory it would have been in.

@rouseabout
Copy link
Author

ctest --rerun-failed --output-on-failure 2>&1 | tee /tmp/log.txt

https://pross.sdf.org/sandpit/log.txt (467 KiB)

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

https://pross.sdf.org/sandpit/path_lattice.pt (355 MiB)

I will delete these files in a few days. Cheers.

@pkufool
Copy link
Collaborator

pkufool commented May 31, 2023

Thanks! I am debuging it, will post the results here once available.

@pkufool
Copy link
Collaborator

pkufool commented May 31, 2023

@rouseabout Could you also dump the lattice from fast_beam_search, thank you very much!

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

@rouseabout
Copy link
Author

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

@pkufool
Copy link
Collaborator

pkufool commented May 31, 2023

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

Thank you!

Yes, because the nbest is random sampled from lattice, so the path_lattice may change.

edit: Sorry, I am wrong, the paths are not randomly sampled, see https://k2-fsa.github.io/k2/python_api/api.html#random-paths. So this might be another issue.

@pkufool
Copy link
Collaborator

pkufool commented Jun 2, 2023

@rouseabout Sorry for the slow reply, I can not reproduce the error with the lattices you provided.
The properties of path_lattice.pt and lattice.pt are:
image

I also tried create path_lattice from lattice, and the top_sort runs normally.

from icefall.decode import Nbest

lattice = k2.Fsa.from_dict(torch.load("/star-kw/kangwei/issues/k2_1204/lattice.pt"))
lattice = lattice.to("cuda:4")

nbest = Nbest.from_lattice(
    lattice=lattice,
    num_paths=200,
    use_double_scores=True,
    nbest_scale=0.5,
)

# The following code is modified from nbest.intersect()
word_fsa = k2.invert(nbest.fsa)
if hasattr(lattice, "aux_labels"):
    # delete token IDs as it is not needed
    del word_fsa.aux_labels
word_fsa.scores.zero_()
word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
path_to_utt_map = nbest.shape.row_ids(1)

if hasattr(lattice, "aux_labels"):
    # lattice has token IDs as labels and word IDs as aux_labels.
    # inv_lattice has word IDs as labels and token IDs as aux_labels
    inv_lattice = k2.invert(lattice)
    inv_lattice = k2.arc_sort(inv_lattice)
else:
    inv_lattice = k2.arc_sort(lattice)

if inv_lattice.shape[0] == 1:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=torch.zeros_like(path_to_utt_map),
        sorted_match_a=True,
    )
else:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=path_to_utt_map,
        sorted_match_a=True,
    )
image

Could you check that the lattice you have dumpped is the problemic one, thank you very much!

@rouseabout
Copy link
Author

@pkufool Really appreciate you looking intro this. It is not urgent.

I can confirm the lattice.pt and path_lattice.pt were output from /pruned_transducer_stateless7/decode.py --decoding-method fast_beam_search_nbest_LG and it crashed at top_sort.cu:324.

When I run you notebook lines, I observe the same shape and properties_str output.

When I run your code, changing cuda:4 to cuda:0, it runs normally, no crash...

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

path_lattice = k2.top_sort(k2.connect(path_lattice))

After adding this line to your code, it crashes at top_sort.cu:324.

What GPU are you testing on?

@pkufool
Copy link
Collaborator

pkufool commented Jun 2, 2023

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

See cell 14.

What GPU are you testing on?

I tested it on a nvidia V100. (pytorch version 1.8.1, cuda version 10.2).

@pkufool
Copy link
Collaborator

pkufool commented Jun 2, 2023

Before invokeing top_sort, path_lattice already has a properties TopSortedAndAcyclic, so I think TopSort algorithm will not do anything, the crash is odd. So, what's your k2, pytorch and cuda version?

@rouseabout
Copy link
Author

Opps, I missed cell 14 :(

k2 version: 1.24.3
Build type: Release
Git SHA1: 1a76309e5c6343c4d18965b7ce134a7f311d9d3a
Git date: Sun May 28 06:04:03 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so

I am using this docker image (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-04.html). I will try an older image.

user@2dfad2bc2655:~$ pip list | grep ^torch
torch                   2.1.0a0+fe05266
torch-tensorrt          1.4.0.dev0
torchaudio              2.1.0a0+6425d46                          /home/user/audio
torchtext               0.13.0a0+fae8e8c
torchvision             0.15.0a0
user@2dfad2bc2655:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

@rouseabout
Copy link
Author

Results:

8GB Tesla P4:

Container PyTorch CUDA Status
nvcr.io/nvidia/pytorch:22.05-py3 1.12.0a0+8a1a93a 11.7.0 WORKING
nvcr.io/nvidia/pytorch:22.12-py3 1.14.0a0+410ce96 11.8.0 WORKING
nvcr.io/nvidia/pytorch:23.02-py3 1.14.0a0+44dac51 12.0.1 CRASH
nvcr.io/nvidia/pytorch:23.04-py3 2.1.0a0+fe05266f 12.1.0 CRASH

16GB Tesla T4:

Container PyTorch CUDA Status
nvcr.io/nvidia/pytorch:22.12-py3 1.14.0a0+410ce96 11.8.0 WORKING
nvcr.io/nvidia/pytorch:23.02-py3 1.14.0a0+44dac51 12.0.1 CRASH

Software/hardware configurations were otherwise identical. While its only a few data points, one might conclude k2 + CUDA 12.x has problems.

@rouseabout
Copy link
Author

8GB Tesla P4:

Container PyTorch CUDA Status
nvcr.io/nvidia/pytorch:23.05-py3 2.0.0 12.1.1 CRASH

@pkufool
Copy link
Collaborator

pkufool commented Jun 6, 2023

Thanks! we will debug it on cuda 12.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants