-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% GPU 'freeze' with Zipformer #416
Comments
Answering myself. |
If it happens frequently enough it may be possible to find out which kernel
was running when it crashed, by doing something like:
nsys profile python3 <your top level python script>
... by looking at the .qdrep file with NVidia NSight Systems, you may be
able to see that.
that could require a "debug" version of PyTorch, though.
…On Sun, Jul 2, 2023 at 9:39 PM gabor-pinter ***@***.***> wrote:
Answering myself.
No exhaustive testing was done, but when using matching CUDA (=11.7) and
PyTorch CUDA(=11.7) versions (by using nvcr.io/nvidia/pytorch:22.08-py3
as base image), the problem seems to disappear. Again, no exhaustive
testing was done yet.
—
Reply to this email directly, view it on GitHub
<#416 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO23CNTR2HFAEAL3WEDXOJEHTANCNFSM6AAAAAAZVDLPVM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
By the way, could you post the complete commands you are using? Also, did you change any code? |
Hi Dan,
Do you mean a static build of torch with debug symbols? I am slowly developing an itch to build torch in house - and probably there will be a point where we cannot avoid it. |
Hi Csukuangfj,
Yes, we did some changes - but mainly around logging.
Sure, let me go back to a version where I can reproduce the issue, and will post the command (hopefully with some insights from |
Torch has some kind of debug build option I think... IDK whether they
distribute these via pip etc.
…On Tue, Jul 4, 2023, 8:02 AM gabor-pinter ***@***.***> wrote:
Hi Dan,
Thanks for the comment on the profiler.
Though I only used it on the "fixed" setup, nsys output is really
informative, thanks for mentioning it. For the reader, a human-readable
report can be generated with
nsys stats report3.nsys-rep
(where report3.nsys-rep is the nsys dump):
that could require a "debug" version of PyTorch, though.
Do you mean a static build of torch with debug symbols? I am slowly
developing an itch to build torch in house - and probably there will be a
point where we cannot avoid it.
—
Reply to this email directly, view it on GitHub
<#416 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZE5UDZLWMAG5YF7YTXOOWXFANCNFSM6AAAAAAZVDLPVM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
When it comes to debug, I believe there are too many flags/options to consider for a release version. |
An update: [1] running the binary directly
[2] nsys run
[3] compute-sanitizer run
@csukuangfj , here is the command to start the server:
|
The error shows it cannot find the following file /usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7 Could you set export LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/nvidia/cuda_nvrtc/lib:$LD_LIBRARY_PATH and see if this error goes away. |
Hi @csukuangfj , One thing I noticed though is that the python/dist-packages path preceded CUDA's compat library. My guess is that the compat lib supposed to come left-most in LD_LIBRARY_PATH. |
If the server just rebooted without anything in the logs, it's likely that the power supply was not sufficient and it tripped due to the GPUs being used too much. |
We are having an issue using using zipformer with multiple worker threads, looking like a livelock/busy deadlock situation:
Further notes:
I am attaching a trace log captured during a deadlock
The node I am using has 8 virtual nodes, Sherpa uses 11 threads.
Sherpa seem to be active on 4 threads:
Environment:
Does it look like a race/sync issue?
Hopefully I will be able to post results from nvidia's
compute-sanitizer
.The text was updated successfully, but these errors were encountered: