Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% GPU 'freeze' with Zipformer #416

Open
gabor-pinter opened this issue Jun 27, 2023 · 11 comments
Open

100% GPU 'freeze' with Zipformer #416

gabor-pinter opened this issue Jun 27, 2023 · 11 comments

Comments

@gabor-pinter
Copy link

We are having an issue using using zipformer with multiple worker threads, looking like a livelock/busy deadlock situation:

  • GPU utilization at 100% (jumping all of a sudden from cca 30% to 100%)
  • no further processing is done

Further notes:

  • less likely to happen with lower number of working threads
  • less likely to happen with lower batch sizes
  • does not happen with pruned transducer model
  • does not happen during CPU-only computation (ohne GPU)

I am attaching a trace log captured during a deadlock
The node I am using has 8 virtual nodes, Sherpa uses 11 threads.
Sherpa seem to be active on 4 threads:

#1  sherpa::OnlineZipformerTransducerModel::GetEncoderInitStates(...)
#11 sherpa::OnlineRecognizer::OnlineRecognizerImpl::DecodeStreams(...)
#15 sherpa::OnlineTransducerGreedySearchDecoder::Decode(...)
#18 sherpa::OnlineZipformerTransducerModel::RunEncoder(...)

Environment:

Nvidia driver version: 510.108.03
CUDA runtime version: 11.8.89
PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
Is debug build: False
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)

Versions of relevant libraries:
[pip3] k2==1.23.3.dev20230105+cuda11.7.torch1.13.1
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.13.1
[pip3] torch-tensorrt==1.3.0a0
[pip3] torchaudio==2.0.2
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0

Does it look like a race/sync issue?
Hopefully I will be able to post results from nvidia's compute-sanitizer.

@gabor-pinter
Copy link
Author

Answering myself.
No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

@danpovey
Copy link
Collaborator

danpovey commented Jul 3, 2023 via email

@csukuangfj
Copy link
Collaborator

The node I am using has 8 virtual nodes, Sherpa uses 11 threads.
Sherpa seem to be active on 4 threads:

By the way, could you post the complete commands you are using? Also, did you change any code?

@gabor-pinter
Copy link
Author

Hi Dan,
Thanks for the comment on the profiler.
Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with
nsys stats report3.nsys-rep
(where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

@gabor-pinter
Copy link
Author

Hi Csukuangfj,

Also, did you change any code?

Yes, we did some changes - but mainly around logging.

could you post the complete commands you are using?

Sure, let me go back to a version where I can reproduce the issue, and will post the command (hopefully with some insights from nsys).

@danpovey
Copy link
Collaborator

danpovey commented Jul 4, 2023 via email

@gabor-pinter
Copy link
Author

When it comes to debug, I believe there are too many flags/options to consider for a release version.

@gabor-pinter
Copy link
Author

An update:
I tested the crash-y version in 3 conditions:

[1] running the binary directly

  • sometimes completes
  • sometimes 100% GPU freeze

[2] nsys run

  • no freezes

[3] compute-sanitizer run

  • crashes with: Make sure that libnvrtc-builtins.so.11.7 is installed **correctly.
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/zipformer.py", line 67, in forward
  ... [omitted]
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.
/usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.8.89
/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7
  • it seems that during the nsys run the system and dist-package versions of libnvrtc-builtins are mixed up
  • not sure if this is the reason for the 100% GPU though ( it is possible that originally it is a plain old out-of-memory error, during which the above quoted ABI incompatibility error happens)

@csukuangfj , here is the command to start the server:

/workspace/sherpa/build/temp.linux-x86_64-3.8/bin/sherpa-online-websocket-server \
     --port=7014 \
     --nn-model=${MDL_DIR}/cpu_jit.pt \
     --tokens=${MDL_DIR}//tokens.txt \
     --doc-root=$WEB_INDEX \
     --use-gpu=true \
     --sample-frequency=8000 \
     --num-work-threads=10 \
     --max-batch-size=400 \
     --decode-chunk-size=64

@csukuangfj
Copy link
Collaborator

RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

The error shows it cannot find the following file

/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

Could you set

export LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

and see if this error goes away.

@gabor-pinter
Copy link
Author

Hi @csukuangfj ,
Thanks for the hint. The modification of LD_LIBRARY_PATH worked. However after 2 runs, the server crashed. Since a few of us are using the same server, (1) not absolutely sure if this modification has to do anything with the crash, (2) will have to try to find some calm period when I can try again.

One thing I noticed though is that the python/dist-packages path preceded CUDA's compat library. My guess is that the compat lib supposed to come left-most in LD_LIBRARY_PATH.

@danpovey
Copy link
Collaborator

If the server just rebooted without anything in the logs, it's likely that the power supply was not sufficient and it tripped due to the GPUs being used too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants