Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about a problem of install k2 #569

Closed
shanguanma opened this issue Jan 5, 2021 · 60 comments
Closed

about a problem of install k2 #569

shanguanma opened this issue Jan 5, 2021 · 60 comments

Comments

@shanguanma
Copy link

I install k2 on another computer serve, I encountered an error during installation, Install step is as follows:

$ conda create -n k2-fsa python=3.7
$ conda activate k2-fas
$  conda install pytorch==1.7.1 cudatoolkit=10.1 -c pytorch
$ conda install -c pytorch torchaudio

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build

$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..
$ make _k2
$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

the logger is as follows:

-- The CUDA compiler identification is NVIDIA 10.1.168
-- The CXX compiler identification is GNU 7.4.0
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: Ubuntu 18.04.2 LTS
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now


-- Could NOT find Valgrind (missing: Valgrind_INCLUDE_DIR Valgrind_EXECUTABLE) 
-- Downloading pybind11
-- pybind11 is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda/cudnn/lib64  
-- Found cuDNN: v7.6.0  (include: /usr/local/cuda/cudnn/include, library: /usr/local/cuda/cudnn/lib64)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Found Torch: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.1
-- Downloading cub
-- cub is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home/users/ntu/tlvu/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home/users/ntu/tlvu/k2-fsa/k2/build

then I run make _k2, the error is as follows:

[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
[ 77%] Linking CUDA device code CMakeFiles/context.dir/cmake_device_link.o
[ 77%] Linking CUDA shared library ../../lib/libk2context.so
/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND
/usr/bin/ld: cannot find /usr/local/cuda/cudnn/lib64: File format not recognized
collect2: error: ld returned 1 exit status
k2/csrc/CMakeFiles/context.dir/build.make:525: recipe for target 'lib/libk2context.so' failed
make[3]: *** [lib/libk2context.so] Error 1
CMakeFiles/Makefile2:706: recipe for target 'k2/csrc/CMakeFiles/context.dir/all' failed
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
CMakeFiles/Makefile2:2210: recipe for target 'k2/python/csrc/CMakeFiles/_k2.dir/rule' failed
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
Makefile:727: recipe for target '_k2' failed
make: *** [_k2] Error 2
@csukuangfj
Copy link
Collaborator

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/"

---> change to

-D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" 

@danpovey
Copy link
Collaborator

danpovey commented Jan 5, 2021 via email

@csukuangfj
Copy link
Collaborator

CUDNN_LIBRARY_PATH expects a .so filename, not a directory. That is why ld complains:

/usr/bin/ld: cannot find /usr/local/cuda/cudnn/lib64: File format not recognized

@csukuangfj
Copy link
Collaborator

I forget the variable name

Do you mean

-DCUDA_TOOLKIT_ROOT="/usr/local/cuda"

@csukuangfj
Copy link
Collaborator

For the second error:

/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND

Please use

-D CUDA_cublas_LIBRARY="/path/to/libcublas.so"

In general, you do not need to specify so many values for cmake. CMake can figure it out.

@shanguanma
Copy link
Author

For the second error:

/usr/bin/ld: cannot find -lCUDA_cublas_LIBRARY-NOTFOUND

Please use

-D CUDA_cublas_LIBRARY="/path/to/libcublas.so"

In general, you do not need to specify so many values for cmake. CMake can figure it out.

if I don't specify cudnn path, CMake can't find it, because cudnn is not in the default location on the computer server cluster.

Cuda and cudnn path of on the computer server cluster:

$ ls /usr/local/cuda
LICENSE  README  bin  compat  cudnn  doc  extras  include  lib64  nvml	nvvm  share  src  targets  version.txt
$ ls /usr/local/cuda/cudnn/*       
/usr/local/cuda/cudnn/doc:
libcudnn7  libcudnn7-dev

/usr/local/cuda/cudnn/include:
cudnn.h

/usr/local/cuda/cudnn/lib64:
libcudnn.so  libcudnn.so.7  libcudnn.so.7.6.0  libcudnn_static.a  libcudnn_static_v7.a

I will follow your suggestion and try to do it.

@shanguanma
Copy link
Author

The compile command is as follows:

$ cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

because /usr/local/cuda don't contain libcublas.so,

grep -rn "libcublas.so" /usr/local      
grep: /usr/local/libexec/dgx-cgroup/cgroup-classify: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-remove: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-create: Permission denied
grep: /usr/local/libexec/dgx-cgroup/cgroup-cleanup: Permission denied
grep: /usr/local/libexec/dgx-cgroup/common: Permission denied
/usr/local/cuda-9.0/doc/EULA.txt:1009:  Linux   : libcublas.so, libcublas_static.a, libcublas_device.a
/usr/local/cuda-9.0/doc/EULA.txt:1010:  Android : libcublas.so, libcublas_static.a, libcublas_device.a
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvgraph.so.9.0.176 matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0.333 matches
Binary file /usr/local/cuda-9.0/targets/x86_64-linux/lib/libnvblas.so.9.0.333 matches
grep: /usr/local/bin/pbs-dgx-cgroup-create: Permission denied
grep: /usr/local/bin/pbs-dgx-cleanup: Permission denied
grep: /usr/local/bin/dgx-cgroup-create: Permission denied
grep: /usr/local/bin/dgx-cgroup-remove: Permission denied
grep: /usr/local/bin/dgx-cgroup-classify: Permission denied
grep: /usr/local/bin/dgx-docker-cleanup: Permission denied
grep: /usr/local/bin/pam-sshd-attach: Permission denied
grep: /usr/local/bin/dgx-cgroup-cleanup: Permission denied
grep: /usr/local/etc/dgx-cgroup: Permission denied
grep: /usr/local/sbin/docker-log: Permission denied
grep: /usr/local/sbin/pbs-move-undelivered: Permission denied
grep: /usr/local/sbin/node-load: Permission denied
grep: /usr/local/sbin/purge-log: Permission denied
grep: /usr/local/sbin/cleanup-tmp: Permission denied
/usr/local/cuda-10.1/doc/EULA.txt:649:libcublas.so, libcublasLt.so, libcublas_static.a,
/usr/local/cuda-10.1/doc/EULA.txt:654:libcublas.so, libcublasLt.so, libcublas_static.a,
/usr/local/cuda-8.0/doc/EULA.txt:535:  Linux   : libcublas.so, libcublas_static.a, libcublas_device.a
/usr/local/cuda-8.0/doc/EULA.txt:536:  Android : libcublas.so, libcublas_static.a, libcublas_device.a
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvgraph.so.8.0.61 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.88 matches
Binary file /usr/local/cuda-8.0/targets/x86_64-linux/lib/libnvblas.so.8.0.88 matches

when I run make _k2, the error is as follows:

[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
make[3]: *** No rule to make target '/usr/local/cuda/cudnn/lib64/libcudnn.so', needed by 'k2/csrc/CMakeFiles/context.dir/cmake_device_link.o'.  Stop.
CMakeFiles/Makefile2:706: recipe for target 'k2/csrc/CMakeFiles/context.dir/all' failed
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
CMakeFiles/Makefile2:2210: recipe for target 'k2/python/csrc/CMakeFiles/_k2.dir/rule' failed
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
Makefile:727: recipe for target '_k2' failed
make: *** [_k2] Error 2

@danpovey
Copy link
Collaborator

danpovey commented Jan 5, 2021 via email

@shanguanma
Copy link
Author

/usr/local/cuda/cudnn/lib64/libcudnn.so exists?

yes,

ls /usr/local/cuda/cudnn/lib64/libcudnn.so
/usr/local/cuda/cudnn/lib64/libcudnn.so

@shanguanma
Copy link
Author

$ ls /usr/local/cuda/cudnn/lib64/*
/usr/local/cuda/cudnn/lib64/libcudnn.so    /usr/local/cuda/cudnn/lib64/libcudnn.so.7.6.0  /usr/local/cuda/cudnn/lib64/libcudnn_static_v7.a
/usr/local/cuda/cudnn/lib64/libcudnn.so.7  /usr/local/cuda/cudnn/lib64/libcudnn_static.a

@danpovey
Copy link
Collaborator

danpovey commented Jan 5, 2021 via email

@csukuangfj
Copy link
Collaborator

What is the output of

cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

You only posted the compilation log, without the configuration log.

@shanguanma
Copy link
Author

What is the output of

cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

You only posted the compilation log, without the configuration log.

yes, it is as follows:

$  cmake -D CMAKE_CUDA_COMPILER="/usr/local/cuda/bin/nvcc " -D CMAKE_CXX_COMPILER="/usr/bin/g++" -D CUDNN_LIBRARY_PATH="/usr/local/cuda/cudnn/lib64/libcudnn.so" -D CUDA_cublas_LIBRARY="/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcublas.so" -D CUDNN_INCLUDE_PATH="/usr/local/cuda/cudnn/include" -DCMAKE_BUILD_TYPE=Release ..

-- The CUDA compiler identification is NVIDIA 10.1.168
-- The CXX compiler identification is GNU 7.4.0
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: Ubuntu 18.04.2 LTS
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now


-- Could NOT find Valgrind (missing: Valgrind_INCLUDE_DIR Valgrind_EXECUTABLE) 
-- Downloading pybind11
-- pybind11 is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda/cudnn/lib64/libcudnn.so  
-- Found cuDNN: v7.6.0  (include: /usr/local/cuda/cudnn/include, library: /usr/local/cuda/cudnn/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Found Torch: /home/users/ntu/tlvu/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.1
-- Downloading cub
-- cub is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home/users/ntu/tlvu/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home/users/ntu/tlvu/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home/users/ntu/tlvu/k2-fsa/k2/build

@shanguanma
Copy link
Author

Do ls -l, may be permission or dangling soft link problem

Yes, Maybe a problem there, Currently, as far as I know, k2 only support cuda=10.1, 10.2, can k2 support more cuda version, e.g.: cuda=10.0, cuda=9.2, etc, I don't know if there is such a plan.

@csukuangfj
Copy link
Collaborator

We only check that k2 is compiled with the same CUDA version that PyTorch is using.

You can try k2 with cuda 10.0 or 9.2. It may work but I think it has not been tested.

@shanguanma
Copy link
Author

We only check that k2 is compiled with the same CUDA version that PyTorch is using.

You can try k2 with cuda 10.0 or 9.2. It may work but I think it has not been tested.

Previously, I try to do it, but it is failing, Any way, the Server shutdown just now, once It is working, I try to do it again by using the newest master branch

@shanguanma
Copy link
Author

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0, so I use the below command to install k2 step by step:

$ conda create -n k2-fsa python=3.8
$ conda activate k2-fas
$ conda install pytorch==1.4.0 cudatoolkit=10.0 -c pytorch
$ conda install -c pytorch torchaudio

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build

$ cmake -DCMAKE_BUILD_TYPE=Release .. it don't error, its log is as follow:

-- The CUDA compiler identification is NVIDIA 10.0.130
-- The CXX compiler identification is GNU 7.5.0
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: CentOS Linux release 7.1.1503 (Core) 
-- Found Git: /usr/bin/git (found version "1.8.3.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now


-- Found Valgrind: /usr/bin  
-- Found Valgrind: /usr/bin/valgrind
-- To check memory, run `ctest -R <NAME> -D ExperimentalMemCheck`
-- Downloading pybind11
-- pybind11 is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home4/md510/anaconda3/envs/k2-fsa/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home4/md510/anaconda3/envs/k2-fsa/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home4/md510/anaconda3/envs/k2-fsa/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
CMake Warning (dev) at /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /cm/shared/apps/cuda10.0/toolkit/10.0.130

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
  cmake/torch.cmake:11 (find_package)
  CMakeLists.txt:134 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /cm/shared/apps/cuda10.0/toolkit/10.0.130 (found version "10.0") 
-- Caffe2: CUDA detected: 10.0
-- Caffe2: CUDA nvcc is: /cm/shared/apps/cuda10.0/toolkit/10.0.130/bin/nvcc
-- Caffe2: CUDA toolkit directory: /cm/shared/apps/cuda10.0/toolkit/10.0.130
-- Caffe2: Header version is: 10.0
-- Found CUDNN: /cm/shared/apps/cudnn-7.6/cuda/lib64/libcudnn.so  
-- Found cuDNN: v7.6.0  (include: /cm/shared/apps/cudnn-7.6/cuda/include, library: /cm/shared/apps/cudnn-7.6/cuda/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  6.0 6.0 6.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_60,code=sm_60
-- Found torch: /home4/md510/anaconda3/envs/k2-fsa/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.4.0
-- PyTorch cuda version: 10.0
-- Downloading cub
-- cub is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.5.0
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home4/md510/w2020/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home4/md510/w2020/k2-fsa/k2/build

then I run the make _k2, the error is as follows:

[ 61%] Building CUDA object k2/csrc/CMakeFiles/context.dir/ragged_utils.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h: In function ‘void k2::CheckLayerEqual(int32_t, int32_t, k2::RaggedShape**)’:
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:165:39: warning: ‘row_ids_dim’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (cur_level_ <= level_) printf("%d", i);
                                 ~~~~~~^~~~~~~~ 
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_utils.cu:33:25: note: ‘row_ids_dim’ was declared here
   int32_t row_splits_dim, row_ids_dim;
                         ^~~~~~~~~~~
/home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:165:39: warning: ‘row_splits_dim’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (cur_level_ <= level_) printf("%d", i);
                                 ~~~~~~^~~~~~~~ 
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_utils.cu:33:9: note: ‘row_splits_dim’ was declared here
   int32_t row_splits_dim, row_ids_dim;
         ^~~~~~~~~~~~~~
[ 64%] Building CUDA object k2/csrc/CMakeFiles/context.dir/rm_epsilon.cu.o
[ 64%] Building CUDA object k2/csrc/CMakeFiles/context.dir/tensor.cu.o
[ 67%] Building CUDA object k2/csrc/CMakeFiles/context.dir/tensor_ops.cu.o
[ 67%] Building CUDA object k2/csrc/CMakeFiles/context.dir/thread_pool.cu.o
[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/timer.cu.o
[ 70%] Building CUDA object k2/csrc/CMakeFiles/context.dir/utils.cu.o
[ 74%] Building CUDA object k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/pytorch_context.cu(196): error: class "c10::Storage" has no member "nbytes"

1 error detected in the compilation of "/tmp/tmpxft_0000524d_00000000-11_pytorch_context.compute_75.cpp1.ii".
make[3]: *** [k2/csrc/CMakeFiles/context.dir/pytorch_context.cu.o] Error 1
make[2]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
make[1]: *** [k2/python/csrc/CMakeFiles/_k2.dir/rule] Error 2
make: *** [_k2] Error 2

@csukuangfj
Copy link
Collaborator

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0

I would recommend to use CUDA 9.2 as there are lots of different PyTorch versions for it.

Only PyTorch 1.6.0 and 1.7.0 have been tested and are known to work.

@shanguanma
Copy link
Author

I try to install k2 with cuda=10.0, because when cuda=10.0, max support pytorch version =1.4.0

I would recommend to use CUDA 9.2 as there are lots of different PyTorch versions for it.

Only PyTorch 1.6.0 and 1.7.0 have been tested and are known to work.

Sorry, Currently I haven't cuda=9.2 computer server, so I can't test it right now.

@danpovey
Copy link
Collaborator

danpovey commented Jan 5, 2021 via email

@shanguanma
Copy link
Author

@danpovey ,OK, I see. Thanks for your reply.

@shanguanma
Copy link
Author

shanguanma commented Jan 12, 2021

@danpovey , @csukuangfj , today(2020-1-12), because the computer server has been updated CUDA to CUDA10.2, cudnn update to cudnn7.6.5. I will compile the latest k2 master branch. I summary the details of install is as follows:

$ conda create -n k2-fsa1 python=3.7
$ conda activate k2-fsa1
$ conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..

-- The CUDA compiler identification is NVIDIA 10.2.89
-- The CXX compiler identification is GNU 7.5.0
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc
-- Check for working CUDA compiler: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++
-- Check for working CXX compiler: /home4/md510/gcc-7.5.0/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- K2_OS: CentOS Linux release 7.8.2003 (Core)
-- Found Git: /usr/bin/git (found version "1.8.3.1") 
-- Looking for C++ include cxxabi.h
-- Looking for C++ include cxxabi.h - found
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Performing Test K2_COMPILER_SUPPORTS_CXX14
-- Performing Test K2_COMPILER_SUPPORTS_CXX14 - Success
-- C++ Standard version: 14
CMake Warning at CMakeLists.txt:112 (message):
  arch 62/72 are not supported for now


-- Found Valgrind: /usr/bin  
-- Found Valgrind: /usr/bin/valgrind
-- To check memory, run `ctest -R <NAME> -D ExperimentalMemCheck`
-- Downloading pybind11
-- pybind11 is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/pybind11-src
-- pybind11 v2.6.0 
-- Found PythonInterp: /home4/md510/anaconda3/envs/k2-fsa1/bin/python (found version "3.7.9") 
-- Found PythonLibs: /home4/md510/anaconda3/envs/k2-fsa1/lib/libpython3.7m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Python executable: /home4/md510/anaconda3/envs/k2-fsa1/bin/python
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
CMake Warning (dev) at /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /cm/shared/apps/cuda10.2/toolkit/10.2.89

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
  cmake/torch.cmake:11 (find_package)
  CMakeLists.txt:134 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /cm/shared/apps/cuda10.2/toolkit/10.2.89 (found version "10.2") 
-- Caffe2: CUDA detected: 10.2
-- Caffe2: CUDA nvcc is: /cm/shared/apps/cuda10.2/toolkit/10.2.89/bin/nvcc
-- Caffe2: CUDA toolkit directory: /cm/shared/apps/cuda10.2/toolkit/10.2.89
-- Caffe2: Header version is: 10.2
-- Found CUDNN: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so  
-- Found cuDNN: v7.6.5  (include: /cm/shared/apps/cuda10.2/toolkit/10.2.89/include, library: /cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found Torch: /home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch.so  
-- PyTorch version: 1.7.1
-- PyTorch cuda version: 10.2
-- Downloading cub
-- cub is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/cub-src
-- Downloading moderngpu
-- moderngpu is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/moderngpu-src
-- Downloading googletest
-- googletest is downloaded to /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-src
-- googletest's binary dir is /home4/md510/w2020/k2-fsa/k2/build/_deps/googletest-build
-- The C compiler identification is GNU 7.5.0
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc
-- Check for working C compiler: /home4/md510/gcc-7.5.0/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Generated /home4/md510/w2020/k2-fsa/k2/build/k2/csrc/version.h
-- Configuring done
-- Generating done
-- Build files have been written to: /home4/md510/w2020/k2-fsa/k2/build

 $ make _k2 ## no error
$ python3 -m pip install --no-deps --force-reinstall graphviz ## no error
$ make -j  ## no error
$ ctest --parallel 5 ## no error
$ make test  ## no error
$  pip3 install wheel twine
$ ./scripts/build_pip.sh
$ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl

next install lhoste:

$ pip install --force-reinstall git+https://github.com/lhotse-speech/lhotse

next install snowfall:

$ git clone https://github.com/k2-fsa/snowfall.git
$ cd snowfall
$ vim ../readme.txt 

#k2
kaldialign
#lhotse@git+https://github.com/lhotse-speech/lhotse
tensorboard
#torch>=1.6.0
#torchaudio

$ python3 -m pip install -e .

run the LibriSpeech recipe:
$ ./run.sh --stage 1 --stop_stage 5 ## no error

$ ./run.sh --stage 6 its error is as follows:

2021-01-12 17:42:56,883 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 /home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:lambda [](int)->void::operator()(int)->void:722 block:[0,0,0], thread: [37,0,0] block:[0,0,0], thread: [38,0,0] block:[0,0,0], thread: [39,0,0] block:[0,0,0], thread: [40,0,0] block:[0,0,0], thread: [41,0,0] block:[0,0,0], thread: [42,0,0] block:[0,0,0], thread: [43,0,0] block:[0,0,0], thread: [44,0,0] block:[0,0,0], thread: [45,0,0] block:[0,0,0], thread: [46,0,0] block:[0,0,0], thread: [47,0,0] block:[0,0,0], thread: [49,0,0] block:[0,0,0], thread: [50,0,0] block:[0,0,0], thread: [51,0,0] block:[0,0,0], thread: [52,0,0] block:[0,0,0], thread: [56,0,0] block:[0,0,0], thread: [57,0,0] Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0                 
















/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [37,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [38,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [39,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [40,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [41,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [42,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [43,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [44,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [45,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [46,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [47,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [49,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [50,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [51,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [52,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [56,0,0] Assertion `Some bad things happened` failed.
/home4/md510/w2020/k2-fsa/k2/k2/csrc/intersect_dense.cu:722: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [57,0,0] Assertion `Some bad things happened` failed.
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (710 vs. 0)  Error: device-side assert triggered. 


[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x2aaccdcc1904]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x28) [0x2aaccaaf4108]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Array1<int>::operator[](int) const+0x1929) [0x2aaccaaf5d89]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeOld2New()+0x13a) [0x2aaccaaf160a]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x5e0) [0x2aaccaaf2640]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*)+0x13dc) [0x2aaccabf44bc]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x364) [0x2aaccabe6ef4]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x51f23) [0x2aacc742df23]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x1a3a3) [0x2aacc73f63a3]
python3(_PyMethodDef_RawFastCallKeywords+0x316) [0x5555556b99b6]
python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]
python3(_PyEval_EvalFrameDefault+0x53e3) [0x555555726483]
python3(_PyFunction_FastCallDict+0x10b) [0x55555566985b]
/home4/md510/anaconda3/envs/k2-fsa1/lib/python3.7/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaab378fa6d]
python3(_PyMethodDef_RawFastCallKeywords+0x1e4) [0x5555556b9884]
python3(_PyCFunction_FastCallKeywords+0x21) [0x5555556b9a31]
python3(_PyEval_EvalFrameDefault+0x4e1d) [0x555555725ebd]
python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]
python3(_PyEval_EvalFrameDefault+0x4a89) [0x555555725b29]
python3(_PyEval_EvalCodeWithName+0xc30) [0x555555669160]
python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]
python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]
python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]
python3(_PyFunction_FastCallKeywords+0x387) [0x5555556b9107]
python3(_PyEval_EvalFrameDefault+0x14e5) [0x555555722585]
python3(_PyFunction_FastCallKeywords+0xfb) [0x5555556b8e7b]
python3(_PyEval_EvalFrameDefault+0x416) [0x5555557214b6]
python3(_PyEval_EvalCodeWithName+0x2f9) [0x555555668829]
python3(PyEval_EvalCodeEx+0x44) [0x555555669714]
python3(PyEval_EvalCode+0x1c) [0x55555566973c]
python3(+0x22cf14) [0x555555780f14]
python3(PyRun_FileExFlags+0xa1) [0x55555578b331]
python3(PyRun_SimpleFileExFlags+0x1c3) [0x55555578b523]
python3(+0x238655) [0x55555578c655]
python3(_Py_UnixMain+0x3c) [0x55555578c77c]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
python3(+0x1dcff0) [0x555555730ff0]

Aborted

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@shanguanma
Copy link
Author

Yes. try to do it again. it is no error.

[md510@node02 k2]$ cd build/
[md510@node02 build]$ ctest 
Test project /home4/md510/w2020/k2-fsa/k2/build
      Start  1: Test.Cuda.cu_algorithms_test
 1/75 Test  #1: Test.Cuda.cu_algorithms_test .......   Passed    6.42 sec
      Start  2: Test.Cuda.cu_array_ops_test
 2/75 Test  #2: Test.Cuda.cu_array_ops_test ........   Passed    8.96 sec
      Start  3: Test.Cuda.cu_array_test
 3/75 Test  #3: Test.Cuda.cu_array_test ............   Passed    6.39 sec
      Start  4: Test.Cuda.cu_fsa_algo_test
 4/75 Test  #4: Test.Cuda.cu_fsa_algo_test .........   Passed    8.75 sec
      Start  5: Test.Cuda.cu_fsa_test
 5/75 Test  #5: Test.Cuda.cu_fsa_test ..............   Passed    6.53 sec
      Start  6: Test.Cuda.cu_fsa_utils_test
 6/75 Test  #6: Test.Cuda.cu_fsa_utils_test ........   Passed    6.84 sec
      Start  7: Test.Cuda.cu_hash_test
 7/75 Test  #7: Test.Cuda.cu_hash_test .............   Passed    6.75 sec
      Start  8: Test.Cuda.cu_host_shim_test
 8/75 Test  #8: Test.Cuda.cu_host_shim_test ........   Passed    0.19 sec
      Start  9: Test.Cuda.cu_intersect_test
 9/75 Test  #9: Test.Cuda.cu_intersect_test ........   Passed    7.11 sec
      Start 10: Test.Cuda.cu_log_test
10/75 Test #10: Test.Cuda.cu_log_test ..............   Passed    6.42 sec
      Start 11: Test.Cuda.cu_macros_test
11/75 Test #11: Test.Cuda.cu_macros_test ...........   Passed    6.32 sec
      Start 12: Test.Cuda.cu_nvtx_test
12/75 Test #12: Test.Cuda.cu_nvtx_test .............   Passed    4.21 sec
      Start 13: Test.Cuda.cu_pinned_context_test
13/75 Test #13: Test.Cuda.cu_pinned_context_test ...   Passed   40.72 sec
      Start 14: Test.Cuda.cu_ragged_shape_test
14/75 Test #14: Test.Cuda.cu_ragged_shape_test .....   Passed    6.40 sec
      Start 15: Test.Cuda.cu_ragged_test
15/75 Test #15: Test.Cuda.cu_ragged_test ...........   Passed    7.07 sec
      Start 16: Test.Cuda.cu_ragged_utils_test
16/75 Test #16: Test.Cuda.cu_ragged_utils_test .....   Passed    6.32 sec
      Start 17: Test.Cuda.cu_rm_epsilon_test
17/75 Test #17: Test.Cuda.cu_rm_epsilon_test .......   Passed    7.27 sec
      Start 18: Test.Cuda.cu_tensor_ops_test
18/75 Test #18: Test.Cuda.cu_tensor_ops_test .......   Passed    6.71 sec
      Start 19: Test.Cuda.cu_tensor_test
19/75 Test #19: Test.Cuda.cu_tensor_test ...........   Passed    0.19 sec
      Start 20: Test.Cuda.cu_thread_pool_test
20/75 Test #20: Test.Cuda.cu_thread_pool_test ......   Passed    0.28 sec
      Start 21: Test.Cuda.cu_top_sort_test
21/75 Test #21: Test.Cuda.cu_top_sort_test .........   Passed    8.10 sec
      Start 22: Test.Cuda.cu_utils_test
22/75 Test #22: Test.Cuda.cu_utils_test ............   Passed    6.78 sec
      Start 23: Test.arcsort_test
23/75 Test #23: Test.arcsort_test ..................   Passed    0.01 sec
      Start 24: Test.array_test
24/75 Test #24: Test.array_test ....................   Passed    0.01 sec
      Start 25: Test.aux_labels_test
25/75 Test #25: Test.aux_labels_test ...............   Passed    0.01 sec
      Start 26: Test.connect_test
26/75 Test #26: Test.connect_test ..................   Passed    0.01 sec
      Start 27: Test.determinize_test
27/75 Test #27: Test.determinize_test ..............   Passed    0.02 sec
      Start 28: Test.fsa_equivalent_test
28/75 Test #28: Test.fsa_equivalent_test ...........   Passed    0.01 sec
      Start 29: Test.fsa_renderer_test
29/75 Test #29: Test.fsa_renderer_test .............   Passed    0.01 sec
      Start 30: Test.fsa_test
30/75 Test #30: Test.fsa_test ......................   Passed    0.01 sec
      Start 31: Test.fsa_util_test
31/75 Test #31: Test.fsa_util_test .................   Passed    0.01 sec
      Start 32: Test.intersect_test
32/75 Test #32: Test.intersect_test ................   Passed    0.01 sec
      Start 33: Test.properties_test
33/75 Test #33: Test.properties_test ...............   Passed    0.01 sec
      Start 34: Test.rmepsilon_test
34/75 Test #34: Test.rmepsilon_test ................   Passed    0.01 sec
      Start 35: Test.topsort_test
35/75 Test #35: Test.topsort_test ..................   Passed    0.01 sec
      Start 36: Test.weights_test
36/75 Test #36: Test.weights_test ..................   Passed    0.01 sec
      Start 37: add_epsilon_self_loops_test_py
37/75 Test #37: add_epsilon_self_loops_test_py .....   Passed    1.07 sec
      Start 38: arc_sort_test_py
38/75 Test #38: arc_sort_test_py ...................   Passed    0.68 sec
      Start 39: closure_test_py
39/75 Test #39: closure_test_py ....................   Passed    7.34 sec
      Start 40: compose_test_py
40/75 Test #40: compose_test_py ....................   Passed    0.74 sec
      Start 41: connect_test_py
41/75 Test #41: connect_test_py ....................   Passed    0.79 sec
      Start 42: ctc_gradients_test_py
42/75 Test #42: ctc_gradients_test_py ..............   Passed    8.10 sec
      Start 43: dense_fsa_vec_test_py
43/75 Test #43: dense_fsa_vec_test_py ..............   Passed    6.63 sec
      Start 44: determinize_test_py
44/75 Test #44: determinize_test_py ................   Passed    0.73 sec
      Start 45: fsa_test_py
45/75 Test #45: fsa_test_py ........................   Passed    7.19 sec
      Start 46: get_tot_scores_test_py
46/75 Test #46: get_tot_scores_test_py .............   Passed    6.39 sec
      Start 47: index_add_test_py
47/75 Test #47: index_add_test_py ..................   Passed    7.25 sec
      Start 48: index_select_test_py
48/75 Test #48: index_select_test_py ...............   Passed    7.22 sec
      Start 49: index_test_py
49/75 Test #49: index_test_py ......................   Passed    7.26 sec
      Start 50: intersect_dense_pruned_test_py
50/75 Test #50: intersect_dense_pruned_test_py .....   Passed    6.69 sec
      Start 51: intersect_dense_test_py
51/75 Test #51: intersect_dense_test_py ............   Passed    6.80 sec
      Start 52: intersect_test_py
52/75 Test #52: intersect_test_py ..................   Passed    0.74 sec
      Start 53: invert_test_py
53/75 Test #53: invert_test_py .....................   Passed    0.67 sec
      Start 54: linear_fsa_test_py
54/75 Test #54: linear_fsa_test_py .................   Passed    0.66 sec
      Start 55: numerical_gradient_check_test_py
55/75 Test #55: numerical_gradient_check_test_py ...   Passed   10.05 sec
      Start 56: ragged_ops_test_py
56/75 Test #56: ragged_ops_test_py .................   Passed    0.79 sec
      Start 57: ragged_shape_test_py
57/75 Test #57: ragged_shape_test_py ...............   Passed    6.92 sec
      Start 58: ragged_test_py
58/75 Test #58: ragged_test_py .....................   Passed    0.66 sec
      Start 59: remove_epsilon_test_py
59/75 Test #59: remove_epsilon_test_py .............   Passed    0.66 sec
      Start 60: shortest_path_test_py
60/75 Test #60: shortest_path_test_py ..............   Passed    0.74 sec
      Start 61: symbol_table_test_py
61/75 Test #61: symbol_table_test_py ...............   Passed    0.73 sec
      Start 62: top_sort_test_py
62/75 Test #62: top_sort_test_py ...................   Passed    0.68 sec
      Start 63: union_test_py
63/75 Test #63: union_test_py ......................   Passed    6.74 sec
      Start 64: host_arcsort_test_py
64/75 Test #64: host_arcsort_test_py ...............   Passed    0.68 sec
      Start 65: host_array_test_py
65/75 Test #65: host_array_test_py .................   Passed    0.70 sec
      Start 66: host_aux_labels_test_py
66/75 Test #66: host_aux_labels_test_py ............   Passed    0.68 sec
      Start 67: host_connect_test_py
67/75 Test #67: host_connect_test_py ...............   Passed    0.67 sec
      Start 68: host_determinize_test_py
68/75 Test #68: host_determinize_test_py ...........   Passed    0.63 sec
      Start 69: host_fsa_equivalent_test_py
69/75 Test #69: host_fsa_equivalent_test_py ........   Passed    0.69 sec
      Start 70: host_fsa_test_py
70/75 Test #70: host_fsa_test_py ...................   Passed    0.68 sec
      Start 71: host_intersect_test_py
71/75 Test #71: host_intersect_test_py .............   Passed    0.65 sec
      Start 72: host_properties_test_py
72/75 Test #72: host_properties_test_py ............   Passed    0.65 sec
      Start 73: host_rmepsilon_test_py
73/75 Test #73: host_rmepsilon_test_py .............   Passed    0.62 sec
      Start 74: host_topsort_test_py
74/75 Test #74: host_topsort_test_py ...............   Passed    0.71 sec
      Start 75: host_weights_test_py
75/75 Test #75: host_weights_test_py ...............   Passed    0.71 sec

100% tests passed, 0 tests failed out of 75

Total Test time (real) = 278.15 sec

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@danpovey
Copy link
Collaborator

Also this could result from over-aggressive compiler optimization. It is checking that -inf == -inf, probably. Sometimes comparisons involving infinity can be optimized out, e.g. if the compiler assumes that fabs(a-b) should be zero if a==b.
So touching the file and doing make again in build/, to see the compilation commands and associated flags, may be useful. And debug vs. release mode may matter.

@shanguanma
Copy link
Author

Your means that let me to pip uninstall pytorch , Torchaudio, and reinstall k2? ok,I will to do it again.
While I found that https://github.com/k2-fsa/k2/blob/master/.github/workflows/build.yml#L25, k2 build environment is only ubuntu16.04 ubuntu18.04, but my system os of computer server cluster is centos 7.

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@shanguanma
Copy link
Author

OK, I have reinstall k2 via below command as your suggestion:

$ conda create -n k2-fsa2 python=3.8
$ conda activate k2-fsa2
$ conda install pytorch  torchaudio cudatoolkit=10.2 -c pytorch


$ git clone https://github.com/k2-fsa/k2.git
$ cd k2
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug ..
$ make
$  python3 -m pip install --no-deps --force-reinstall graphviz
$ ctest
$ cd ..
$ pip3 install wheel twine
$ ./scripts/build_pip.sh
$ python3 -m pip install --no-deps --force-reinstall dist/k2-*.whl
install snowfall
$ git clone https://github.com/k2-fsa/snowfall.git
$ cd  snowfall
$ python3 -m pip install -e .

compile processing and install processing are no error, when I run gdb --args python3 mmi_bigram_train.py
It gives an error and it isn't same to previous error:

[md510@node02 simple_v1]$ gdb --args  python3 mmi_bigram_train.py 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home4/md510/anaconda3/envs/k2-fsa2/bin/python3.8...done.
(gdb) r
Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py
warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug
Missing separate debuginfo for /lib64/libutil.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug
[Detaching after fork from child process 46736]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug
[Detaching after fork from child process 46744]
Missing separate debuginfo for /lib64/libsndfile.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug
Missing separate debuginfo for /lib64/libgsm.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug
Missing separate debuginfo for /lib64/libFLAC.so.8
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug
Missing separate debuginfo for /lib64/libvorbisenc.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug
Missing separate debuginfo for /lib64/libvorbis.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug
Missing separate debuginfo for /lib64/libogg.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
[New Thread 0x2aab3309b700 (LWP 46745)]
2021-01-12 22:54:24,746 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-12 22:54:25,032 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-12 22:54:30,810 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-12 22:54:30,903 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-12 22:54:31,388 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-12 22:54:31,409 INFO [mmi_bigram_train.py:343] About to create dev dataloader
[New Thread 0x2aab451f3700 (LWP 46754)]
2021-01-12 22:54:31,441 INFO [mmi_bigram_train.py:350] About to create model
[New Thread 0x2aab453f4700 (LWP 46755)]
[New Thread 0x2aab455f5700 (LWP 46756)]
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-12 22:54:38,940 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[Detaching after fork from child process 46807]
[Detaching after fork from child process 46808]
[Detaching after fork from child process 46809]
[Detaching after fork from child process 46810]
[New Thread 0x2aab45a08700 (LWP 46811)]
[New Thread 0x2aab45c09700 (LWP 46812)]
[New Thread 0x2aab45e0a700 (LWP 46813)]
[New Thread 0x2aab48200700 (LWP 46814)]
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged.cu:bool k2::RaggedShape::Validate(bool) const:385 Problem validating row-ids: for layers_[0], row_splits = [ 0 1 3 5 9 13 15 17 20 22 25 27 29 34 39 41 43 48 53 58 60 63 65 68 71 73 76 79 81 84 87 89 91 100 102 109 111 113 115 117 119 122 124 126 129 131 134 136 139 141 144 146 149 151 154 156 159 161 164 166 169 172 174 179 181 184 186 189 191 193 196 198 201 204 206 211 ....here I ignore some number, because it contain many numbers
077 35077 35077 35077 ], see index 96409 of row_ids, whose dim is 101526


[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Validate(bool) const+0xe8a) [0x2aab2d083846]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::Check()+0x1e) [0x2aab2cfdba5e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape::RaggedShape(std::vector<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> > const&, bool)+0x57) [0x2aab2cfdba1b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int)+0x59a) [0x2aab2d08ec52]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int, k2::Array1<int>*, k2::Array1<int>*, int)+0x27a) [0x2aab2d08f86c]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged<k2::Arc>&, k2::Array1<int> const&)+0x38b) [0x2aab2cfc7398]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x91) [0x2aab2d03b65e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaacd9aa98d]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]


Program received signal SIGABRT, Aborted.
0x00002aaaaaf21387 in raise () from /lib64/libc.so.6
(gdb) 

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@shanguanma
Copy link
Author

[md510@node02 simple_v1]$ gdb --args  python3 mmi_bigram_train.py 
(gdb) r
Starting program: /home4/md510/anaconda3/envs/k2-fsa2/bin/python3 mmi_bigram_train.py
warning: Unable to open "librpm.so.3" (/home4/md510/anaconda3/lib/liblzma.so.5: version `XZ_5.1.2alpha' not found (required by /lib64/librpmio.so.3)), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug
Missing separate debuginfo for /lib64/libutil.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/15/86cefa927d26f144de15389f28c1cbf04c81ef.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug
[Detaching after fork from child process 66884]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/3a587b4d79216ae274467480fa10f2c44ed2d0.debug
[Detaching after fork from child process 66894]
Missing separate debuginfo for /lib64/libsndfile.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/bf/637fda83ef4f46cd3e5c172031e926dac51faa.debug
Missing separate debuginfo for /lib64/libgsm.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/8c2bd826e5837d3cee7c5cee8ed85827a90d5c.debug
Missing separate debuginfo for /lib64/libFLAC.so.8
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/9584153c0799926a60973fb77de214161e7072.debug
Missing separate debuginfo for /lib64/libvorbisenc.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e5/4da1382c034ef216379710265df600eb741e6d.debug
Missing separate debuginfo for /lib64/libvorbis.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/75/48d115412cc33bf67c1598e446c70daa1b7461.debug
Missing separate debuginfo for /lib64/libogg.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/6c/77e88fb8736ffe5770b2e96ee60c8a6460d782.debug
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
[New Thread 0x2aab3309b700 (LWP 66896)]
2021-01-12 23:40:11,250 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-12 23:40:11,533 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-12 23:40:17,630 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-12 23:40:17,727 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-12 23:40:18,201 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-12 23:40:18,223 INFO [mmi_bigram_train.py:343] About to create dev dataloader
[New Thread 0x2aab451f3700 (LWP 66931)]
2021-01-12 23:40:18,276 INFO [mmi_bigram_train.py:350] About to create model
[New Thread 0x2aab453f4700 (LWP 66933)]
[New Thread 0x2aab455f5700 (LWP 66934)]
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-12 23:40:21,868 INFO [mmi_bigram_train.py:400] epoch 0, learning rate 0.001
[Detaching after fork from child process 66939]
[Detaching after fork from child process 66940]
[Detaching after fork from child process 66941]
[Detaching after fork from child process 66942]
[New Thread 0x2aab45a08700 (LWP 66943)]
[New Thread 0x2aab45c09700 (LWP 66944)]
[New Thread 0x2aab45e0a700 (LWP 66945)]
[New Thread 0x2aab48200700 (LWP 66946)]
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]:280 Check failed: ret == cudaSuccess (700 vs. 0)  Error: an illegal memory access was encountered. 


[ Stack-Trace: ]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x2e) [0x2aab2cf365ee]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1<int>::operator[](int) const+0x56c) [0x2aab2cf3ad80]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::Array1<int>::Back() const+0x130) [0x2aab2cf385a0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int)+0x27f) [0x2aab2d08e937]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int, k2::Array1<int>*, k2::Array1<int>*, int)+0x70) [0x2aab2d08f662]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::GetIncomingArcs(k2::Ragged<k2::Arc>&, k2::Array1<int> const&)+0x38b) [0x2aab2cfc7398]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::MultiGraphDenseIntersect(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float)+0x551) [0x2aab2d040b2b]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, float, k2::Ragged<k2::Arc>*, k2::Array1<int>*, k2::Array1<int>*)+0x91) [0x2aab2d03b65e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb356e) [0x2aab296be56e]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbc772) [0x2aab296c7772]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xbb9b0) [0x2aab296c69b0]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb99d5) [0x2aab296c49d5]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb9a5f) [0x2aab296c4a5f]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x48c20) [0x2aab29653c20]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0x56) [0x5555556d3f76]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x11d0) [0x555555715b90]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyVectorcall_Call+0x71) [0x555555691041]
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object*, _object*)+0x93d) [0x2aaacd9aa98d]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyCFunction_Call+0xdb) [0x5555556d3ffb]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55555569185f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalFrameDefault+0x4596) [0x555555718f56]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10077f) [0x55555565477f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x7df) [0x5555556def9f]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x1e3) [0x5555556df943]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x10011a) [0x55555565411a]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyFunction_Vectorcall+0x10b) [0x5555556df86b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0xfeb84) [0x555555652b84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(_PyEval_EvalCodeWithName+0x2d2) [0x5555556dea92]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCodeEx+0x44) [0x5555556df754]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyEval_EvalCode+0x1c) [0x55555576dedc]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x219f84) [0x55555576df84]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x24c1f4) [0x5555557a01f4]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_FileExFlags+0xa1) [0x5555556686e1]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(PyRun_SimpleFileExFlags+0x3b4) [0x555555668ac6]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x11598b) [0x55555566998b]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(Py_BytesMain+0x39) [0x5555557a2d19]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaaaf0d555]
/home4/md510/anaconda3/envs/k2-fsa2/bin/python3(+0x1dee93) [0x555555732e93]


Program received signal SIGABRT, Aborted.
0x00002aaaaaf21387 in raise () from /lib64/libc.so.6

(gdb) bt full 
#0  0x00002aaaaaf21387 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00002aaaaaf22a78 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00002aab2cf36630 in k2::internal::Logger::~Logger (this=0x7fffffffb340, __in_chrg=<optimized out>) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/log.h:149
        stack_trace = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
            _M_p = 0x5555c7e0dee8 "[ Stack-Trace: ]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x46) [0x2aab3048cc12]\n/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/sit"...}}
#3  0x00002aab2cf3ad80 in k2::Array1<int>::operator[] (this=0x7fffffffb680, i=64) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:280
        ans = 21845
        ret = cudaErrorIllegalAddress
        __PRETTY_FUNCTION__ = "T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int]"
        k2_nvtx_6 = {<No data fields>}
        data = 0x2aabaae45100
        type = k2::kCuda
#4  0x00002aab2cf385a0 in k2::Array1<int>::Back (this=0x7fffffffb680) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/array.h:289
        __PRETTY_FUNCTION__ = "T k2::Array1<T>::Back() const [with T = int]"
#5  0x00002aab2d08e937 in k2::RaggedShape2 (row_splits=0x7fffffffb680, row_ids=0x7fffffffb6a0, cached_tot_size=35078) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:112
        k2_nvtx_65 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape2(k2::Array1<int>*, k2::Array1<int>*, int32_t)"
        ctx = {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
        axes = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
            _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
              _M_start = 0x5555c69c4e38, _M_finish = 0x7fffffffb498, _M_end_of_storage = 0xffffffffffffb460}}, <No data fields>}
#6  0x00002aab2d08f662 in k2::RaggedShape3 (row_splits1=0x7fffffffb680, row_ids1=0x7fffffffb6a0, cached_tot_size1=35078, row_splits2=0x7fffffffb6c0, row_ids2=0x7fffffffb6e0, 
    cached_tot_size2=101526) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:193
        k2_nvtx_68 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::RaggedShape k2::RaggedShape3(k2::Array1<int>*, k2::Array1<int>*, int32_t, k2::Array1<int>*, k2::Array1<int>*, int32_t)"
        shape1 = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
              _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
                _M_start = 0x5555c69bd278, _M_finish = 0x7fffffffb5b8, _M_end_of_storage = 0x2aab29689143
     <__gnu_cxx::__atomic_add_dispatch(_Atomic_word*, int)+46>}}, <No data fields>}}
        temp_array = {dim_ = -962881248, byte_offset_ = 140737488337984, 
          region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7fffffffb5a0, _M_refcount = {_M_pi = 0x12cf6eaa2}}, <No data fields>}}
#7  0x00002aab2cfc7398 in k2::GetIncomingArcs (fsas=..., dest_states=...) at /home4/md510/w2020/k2-fsa/k2/k2/csrc/fsa_utils.cu:837
        k2_nvtx_76 = {<No data fields>}
        __PRETTY_FUNCTION__ = "k2::Ragged<int> k2::GetIncomingArcs(k2::FsaVec&, const k2::Array1<int>&)"
        c = @0x5555c8017fa0: {<std::__shared_ptr<k2::Context, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Context, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c69b9920, _M_refcount = {_M_pi = 0x5555c69b9910}}, <No data fields>}
        dest_states_tensor = {shape = {layers_ = {<std::_Vector_base<k2::RaggedShapeLayer, std::allocator<k2::RaggedShapeLayer> >> = {
                _M_impl = {<std::allocator<k2::RaggedShapeLayer>> = {<__gnu_cxx::new_allocator<k2::RaggedShapeLayer>> = {<No data fields>}, <No data fields>}, 
                  _M_start = 0x5555c8014070, _M_finish = 0x5555c8014100, _M_end_of_storage = 0x5555c8014100}}, <No data fields>}}, values = {dim_ = 101526, byte_offset_ = 0, 
            region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c8056db0, _M_refcount = {_M_pi = 0x5555c8056da0}}, <No data fields>}}}
        num_fsas = 64
        num_states = 35078
        num_arcs = 101526
        incoming_arcs_order = {dim_ = 101526, byte_offset_ = 0, 
          region_ = {<std::__shared_ptr<k2::Region, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<k2::Region, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x5555c7fc3b10, _M_refcount = {_M_pi = 0x5555c7fc3b00}}, <No data fields>}}
        ans_row_ids2 = {dim_ = 101526, byte_offset_ = 0, 


@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jan 12, 2021 via email

@csukuangfj
Copy link
Collaborator

csukuangfj commented Jan 12, 2021 via email

@shanguanma
Copy link
Author

Will training with CPU give the same error? Tuesday, 12 January 2021, 23:49 +0800 from notifications@github.com notifications@github.com:

I run python3 mmi_bigram_train.py with cpu. it should be no error. the logger is as follows:

[md510@node02 simple_v1]$ python3 mmi_bigram_train.py 
/home4/md510/anaconda3/envs/k2-fsa2/lib/python3.8/site-packages/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  warnings.warn(
2021-01-13 09:46:02,825 INFO [mmi_bigram_train.py:310] Loading L.fst
2021-01-13 09:46:03,058 INFO [mmi_bigram_train.py:328] About to get train cuts
2021-01-13 09:46:07,104 INFO [mmi_bigram_train.py:330] About to get dev cuts
2021-01-13 09:46:07,176 INFO [mmi_bigram_train.py:333] About to create train dataset
2021-01-13 09:46:07,599 INFO [mmi_bigram_train.py:337] About to create dev dataset
2021-01-13 09:46:07,613 INFO [mmi_bigram_train.py:341] About to create train dataloader
2021-01-13 09:46:07,613 INFO [mmi_bigram_train.py:343] About to create dev dataloader
2021-01-13 09:46:07,697 INFO [mmi_bigram_train.py:350] About to create model
================================================================================
Model parameters summary:
================================================================================
* P_scores:                                                                 7568
* tdnn.0.weight:                                                           60000
* tdnn.0.bias:                                                               500
* tdnn.3.weight:                                                          750000
* tdnn.3.bias:                                                               500
* tdnn.6.weight:                                                          750000
* tdnn.6.bias:                                                               500
* lstms.0.weight_ih_l0:                                                  1000000
* lstms.0.weight_hh_l0:                                                  1000000
* lstms.0.bias_ih_l0:                                                       2000
* lstms.0.bias_hh_l0:                                                       2000
* lstms.1.weight_ih_l0:                                                  1000000
* lstms.1.weight_hh_l0:                                                  1000000
* lstms.1.bias_ih_l0:                                                       2000
* lstms.1.bias_hh_l0:                                                       2000
* lstms.2.weight_ih_l0:                                                  1000000
* lstms.2.weight_hh_l0:                                                  1000000
* lstms.2.bias_ih_l0:                                                       2000
* lstms.2.bias_hh_l0:                                                       2000
* lstms.3.weight_ih_l0:                                                  1000000
* lstms.3.weight_hh_l0:                                                  1000000
* lstms.3.bias_ih_l0:                                                       2000
* lstms.3.bias_hh_l0:                                                       2000
* lstms.4.weight_ih_l0:                                                  1000000
* lstms.4.weight_hh_l0:                                                  1000000
* lstms.4.bias_ih_l0:                                                       2000
* lstms.4.bias_hh_l0:                                                       2000
* linear.weight:                                                           43500
* linear.bias:                                                                87
================================================================================
Total: 11632655
================================================================================
2021-01-13 09:46:07,771 INFO [mmi_bigram_train.py:401] epoch 0, learning rate 0.001
2021-01-13 09:47:32,896 INFO [mmi_bigram_train.py:220] batch 0, epoch 0/10 global average objf: 1.989916 over 29599.0 frames (100.0% kept), current batch average objf: 1.989915 over 29599 frames (100.0% kept) avg time waiting for batch 3.367s
2021-01-13 09:58:43,705 INFO [mmi_bigram_train.py:220] batch 10, epoch 0/10 global average objf: 1.760037 over 327009.0 frames (100.0% kept), current batch average objf: 1.610216 over 29735 frames (100.0% kept) avg time waiting for batch 0.343s

@shanguanma
Copy link
Author

Do the same after doing export K2_SYNC_KERNELS=1 .. wanna see if the error was the first one.

Yes, I try it again, it occurs same error, note: I use single GPU on mmi_bigram_train.py
The error is same as #569 (comment)

@danpovey
Copy link
Collaborator

danpovey commented Jan 13, 2021 via email

@danpovey
Copy link
Collaborator

Please try running with this code:
#585
which may make the error show up earlier. Note: haven't finished running tests yet.

@shanguanma
Copy link
Author

OK, I try to run it.

@csukuangfj
Copy link
Collaborator

csukuangfj commented Jan 13, 2021 via email

@csukuangfj
Copy link
Collaborator

@shanguanma
Could you try this pull-request: #586

I think it should fix your problem.

@shanguanma
Copy link
Author

Please try running with this code:
#585
which may make the error show up earlier. Note: haven't finished running tests yet.

I follow your code to my k2 codebase, then re-install k2. then run python3 mmi_bigram_train.py , the error is as follows:

2021-01-13 12:44:34,119 INFO [mmi_bigram_train.py:401] epoch 0, learning rate 0.001
[F] /home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu:void k2::CheckGetTransposeReordering(k2::Ragged<int>&, k2::Array1<int>&):1171 Check failed: IsPermutation(ans) 

@shanguanma
Could you try this pull-request: #586

I think it should fix your problem.

OK, I try to do it now.

@shanguanma
Copy link
Author

@shanguanma
Could you try this pull-request: #586

I think it should fix your problem.

I follow your code and re-install k2, when I run make, got the below error :

[ 19%] Building CUDA object k2/csrc/CMakeFiles/context.dir/ragged_ops.cu.o
/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "context" is not a type name

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "temp_storage_bytes" is not a type name

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: expected a ")"

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1223): error: expression must have class type

4 errors detected in the compilation of "/tmp/tmpxft_000251ad_00000000-11_ragged_ops.compute_75.cpp1.ii".
make[2]: *** [k2/csrc/CMakeFiles/context.dir/ragged_ops.cu.o] Error 1
make[1]: *** [k2/csrc/CMakeFiles/context.dir/all] Error 2
make: *** [all] Error 2

@csukuangfj
Copy link
Collaborator

Can you check that you did git checkout the correct commit?

@shanguanma
Copy link
Author

I add your #586 (comment) to my local k2 codebase. then re-install. is the way wrong?

@csukuangfj
Copy link
Collaborator

/home4/md510/w2020/k2-fsa/k2/k2/csrc/ragged_ops.cu(1221): error: variable "context" is not a type name

What does the line 1221 in your local ragged_ops.cu look like? Is it the same as the one in #586 ?

@danpovey
Copy link
Collaborator

danpovey commented Jan 13, 2021 via email

@shanguanma
Copy link
Author

Maybe he merged with my PR?

Yes, After I merge your(@danpovey ) code, then merge @csukuangfj 's code.
Anyway, I show my code to you .

 #if __CUDACC_VER_MAJOR__ > 10 ||   \
1192     (__CUDACC_VER_MAJOR__ == 10 && \
1193      (__CUDACC_VER_MINOR__ > 1 ||  \
1194       (__CUDACC_VER_MINOR__ == 1 && __CUDACC_VER_BUILD__ > 105)))
1195   // Enable it only for NVCC > 10.1.105
1196   //
1197   // Refer to https://github.com/LLNL/axom/issues/88
1198   // NVCC 10.1.105 has a known issue for cub::DeviceRadixSort
1199   int32_t num_buckets = num_cols;
1200   int32_t num_elements = src.values.Dim();
1201   int32_t log_buckets = static_cast<int32_t>(ceilf(log2f(num_buckets)));
1202 
1203   //Array1<int32_t> ans = Range(context, num_elements, 0);
1204   Array1<int32_t> order = Range(context, num_elements, 0);
1205   Array1<int32_t> src_tmp_out(context, num_elements);
1206   Array1<int32_t> ans(context, num_elements);
1207 
1208   cudaStream_t stream = context->GetCudaStream();
1209 
1210   size_t temp_storage_bytes = 0;
1211   K2_CUDA_SAFE_CALL(cub::DeviceRadixSort::SortPairs(
1212       //nullptr, temp_storage_bytes, src.values.Data(),
1213       //static_cast<int32_t *>(nullptr), ans.Data(), ans.Data(), num_elements, 0,
1214       //log_buckets, stream));
1215       nullptr, temp_storage_bytes, src.values.Data(), src_tmp_out.Data(),
1216       order.Data(), ans.Data(), num_elements, 0, log_buckets, stream));
1217 
1218 
1219   Array1<int8_t> d_temp_storage(
1220       //context, temp_storage_bytes + num_elements * sizeof(int32_t));
1221       Array1<int8_t> d_temp_storage(context, temp_storage_bytes));
1222 
1223   K2_CUDA_SAFE_CALL(cub::DeviceRadixSort::SortPairs(
1224       //d_temp_storage.Data() + sizeof(int32_t) * num_elements,
1225       //temp_storage_bytes, src.values.Data(),
1226       //reinterpret_cast<int32_t *>(d_temp_storage.Data()), ans.Data(),
1227       //ans.Data(), num_elements, 0, log_buckets, stream));
1228       d_temp_storage.Data(), temp_storage_bytes, src.values.Data(),
1229       src_tmp_out.Data(), order.Data(), ans.Data(), num_elements, 0,
1230       log_buckets, stream));
1231 
1232   //if (!kDisableDebug && !DisableChecks())
1233   CheckGetTransposeReordering(src, ans);
1234   return ans;
1235 #else
1236   //if (src.NumAxes() == 3)
1237   //  return GetTransposeReorderingThreeAxesCuda(src, num_cols);
1238   if (src.NumAxes() == 3){
1239      Array1<int3_t> ans = GetTansposeReorderingThreeAxesCuda(src, num_cols);
1240      //if (!kDisableDebug && !DisableChecks())
1241      CheckGetTransposeReordering(src, ans);
1242      return ans;
```

@danpovey
Copy link
Collaborator

danpovey commented Jan 13, 2021 via email

@csukuangfj
Copy link
Collaborator

Can you check that you did git checkout the correct commit?

Please use git checkout, not git merge.

@csukuangfj
Copy link
Collaborator

Anyway, I show my code to you .

You did not resolve the merge conflicts in a correct way.

@shanguanma
Copy link
Author

Anyway, I show my code to you .

You did not resolve the merge conflicts in a correct way.

Yes, I merge your code with manually. I am so pool for using git command, I don't know to merge your #586 (comment) with automatically

@csukuangfj
Copy link
Collaborator

I would suggest

cd k2
git remote add kk https://github.com/csukuangfj/k2.git
git fetch kk
git checkout fangjun-get-transpose-reordering
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..

@shanguanma
Copy link
Author

Thanks, I follow your suggestion.

@shanguanma
Copy link
Author

Solved, Thanks a lot. @danpovey , @csukuangfj

@danpovey
Copy link
Collaborator

Do did it work?

@shanguanma
Copy link
Author

Do did it work?

Yes, Dan, it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants