100% GPU 'freeze' with Zipformer #416

gabor-pinter · 2023-06-27T06:22:38Z

We are having an issue using using zipformer with multiple worker threads, looking like a livelock/busy deadlock situation:

GPU utilization at 100% (jumping all of a sudden from cca 30% to 100%)
no further processing is done

Further notes:

less likely to happen with lower number of working threads
less likely to happen with lower batch sizes
does not happen with pruned transducer model
does not happen during CPU-only computation (ohne GPU)

I am attaching a trace log captured during a deadlock
The node I am using has 8 virtual nodes, Sherpa uses 11 threads.
Sherpa seem to be active on 4 threads:

#1  sherpa::OnlineZipformerTransducerModel::GetEncoderInitStates(...)
#11 sherpa::OnlineRecognizer::OnlineRecognizerImpl::DecodeStreams(...)
#15 sherpa::OnlineTransducerGreedySearchDecoder::Decode(...)
#18 sherpa::OnlineZipformerTransducerModel::RunEncoder(...)

Environment:

Nvidia driver version: 510.108.03
CUDA runtime version: 11.8.89
PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
Is debug build: False
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)

Versions of relevant libraries:
[pip3] k2==1.23.3.dev20230105+cuda11.7.torch1.13.1
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.13.1
[pip3] torch-tensorrt==1.3.0a0
[pip3] torchaudio==2.0.2
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0

Does it look like a race/sync issue?
Hopefully I will be able to post results from nvidia's compute-sanitizer.

The text was updated successfully, but these errors were encountered:

gabor-pinter · 2023-07-03T04:39:10Z

Answering myself.
No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet.

danpovey · 2023-07-03T05:26:51Z

If it happens frequently enough it may be possible to find out which kernel was running when it crashed, by doing something like: nsys profile python3 <your top level python script> ... by looking at the .qdrep file with NVidia NSight Systems, you may be able to see that. that could require a "debug" version of PyTorch, though.

…

On Sun, Jul 2, 2023 at 9:39 PM gabor-pinter ***@***.***> wrote: Answering myself. No exhaustive testing was done, but when using matching CUDA (=11.7) and PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3 as base image), the problem seems to disappear. Again, no exhaustive testing was done yet. — Reply to this email directly, view it on GitHub <#416 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO23CNTR2HFAEAL3WEDXOJEHTANCNFSM6AAAAAAZVDLPVM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

csukuangfj · 2023-07-03T05:31:49Z

The node I am using has 8 virtual nodes, Sherpa uses 11 threads.
Sherpa seem to be active on 4 threads:

By the way, could you post the complete commands you are using? Also, did you change any code?

gabor-pinter · 2023-07-04T06:02:16Z

Hi Dan,
Thanks for the comment on the profiler.
Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with
nsys stats report3.nsys-rep
(where report3.nsys-rep is the nsys dump):

that could require a "debug" version of PyTorch, though.

Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it.

gabor-pinter · 2023-07-04T06:05:41Z

Hi Csukuangfj,

Also, did you change any code?

Yes, we did some changes - but mainly around logging.

could you post the complete commands you are using?

Sure, let me go back to a version where I can reproduce the issue, and will post the command (hopefully with some insights from nsys).

danpovey · 2023-07-04T13:07:06Z

Torch has some kind of debug build option I think... IDK whether they distribute these via pip etc.

…

On Tue, Jul 4, 2023, 8:02 AM gabor-pinter ***@***.***> wrote: Hi Dan, Thanks for the comment on the profiler. Though I only used it on the "fixed" setup, nsys output is really informative, thanks for mentioning it. For the reader, a human-readable report can be generated with nsys stats report3.nsys-rep (where report3.nsys-rep is the nsys dump): that could require a "debug" version of PyTorch, though. Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it. — Reply to this email directly, view it on GitHub <#416 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZE5UDZLWMAG5YF7YTXOOWXFANCNFSM6AAAAAAZVDLPVM> . You are receiving this because you commented.Message ID: ***@***.***>

gabor-pinter · 2023-07-05T01:04:32Z

When it comes to debug, I believe there are too many flags/options to consider for a release version.

gabor-pinter · 2023-07-07T01:38:13Z

An update:
I tested the crash-y version in 3 conditions:

[1] running the binary directly

sometimes completes
sometimes 100% GPU freeze

[2] nsys run

no freezes

[3] compute-sanitizer run

crashes with: Make sure that libnvrtc-builtins.so.11.7 is installed **correctly.

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/zipformer.py", line 67, in forward
  ... [omitted]
RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

/usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.11.8.89
/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

it seems that during the nsys run the system and dist-package versions of libnvrtc-builtins are mixed up
not sure if this is the reason for the 100% GPU though ( it is possible that originally it is a plain old out-of-memory error, during which the above quoted ABI incompatibility error happens)

@csukuangfj , here is the command to start the server:

/workspace/sherpa/build/temp.linux-x86_64-3.8/bin/sherpa-online-websocket-server \
     --port=7014 \
     --nn-model=${MDL_DIR}/cpu_jit.pt \
     --tokens=${MDL_DIR}//tokens.txt \
     --doc-root=$WEB_INDEX \
     --use-gpu=true \
     --sample-frequency=8000 \
     --num-work-threads=10 \
     --max-batch-size=400 \
     --decode-chunk-size=64

csukuangfj · 2023-07-07T01:43:07Z

RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.7.
  Make sure that libnvrtc-builtins.so.11.7 is installed correctly.

The error shows it cannot find the following file

/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7

Could you set

export LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH

and see if this error goes away.

gabor-pinter · 2023-07-10T11:34:03Z

Hi @csukuangfj ,
Thanks for the hint. The modification of LD_LIBRARY_PATH worked. However after 2 runs, the server crashed. Since a few of us are using the same server, (1) not absolutely sure if this modification has to do anything with the crash, (2) will have to try to find some calm period when I can try again.

One thing I noticed though is that the python/dist-packages path preceded CUDA's compat library. My guess is that the compat lib supposed to come left-most in LD_LIBRARY_PATH.

danpovey · 2023-07-10T11:45:04Z

If the server just rebooted without anything in the logs, it's likely that the power supply was not sufficient and it tripped due to the GPUs being used too much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% GPU 'freeze' with Zipformer #416

100% GPU 'freeze' with Zipformer #416

gabor-pinter commented Jun 27, 2023

gabor-pinter commented Jul 3, 2023

danpovey commented Jul 3, 2023 via email

csukuangfj commented Jul 3, 2023

gabor-pinter commented Jul 4, 2023

gabor-pinter commented Jul 4, 2023

danpovey commented Jul 4, 2023 via email

gabor-pinter commented Jul 5, 2023

gabor-pinter commented Jul 7, 2023

csukuangfj commented Jul 7, 2023

gabor-pinter commented Jul 10, 2023

danpovey commented Jul 10, 2023

100% GPU 'freeze' with Zipformer #416

100% GPU 'freeze' with Zipformer #416

Comments

gabor-pinter commented Jun 27, 2023

gabor-pinter commented Jul 3, 2023

danpovey commented Jul 3, 2023 via email

csukuangfj commented Jul 3, 2023

gabor-pinter commented Jul 4, 2023

gabor-pinter commented Jul 4, 2023

danpovey commented Jul 4, 2023 via email

gabor-pinter commented Jul 5, 2023

gabor-pinter commented Jul 7, 2023

csukuangfj commented Jul 7, 2023

gabor-pinter commented Jul 10, 2023

danpovey commented Jul 10, 2023