New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA crashes when passed complex record array #9471
Comments
Many thanks for taking the time to capture what's needed for this bug report - I appreciate this has been a bit of a pain and difficult to nail down. Upon my attempts so far I haven't seen a crash, but running under valgrind does suggest some incorrect behaviour in NVVM:
Often this sort of thing can be triggered by invalid IR that gets past the The next step is to identify whether we're producing invalid IR, or triggering an NVVM bug. |
Thank you! For me it crashes all the time. Also, the real datastructures are somewhat more complex (this is probably 75%) - I deleted fields somewhat randomly until I got to the point that, if I deleted more, it would not crash. (Also I didn't try valgrind - w/ compute sanitizer, when it crashed I wouldn't get any data.) I'll be quite happy to see if any fix to what you see also fixes my problem. If not I can also send a version with MOAR fields. NB - as far as best practices go, I realize that most kernels simply aren't so large & complex. I'm taking advantage of cooperative groups to do "many different shapes of (parallel) operation" in sequence. As a baseline I would expect this to save a bit of time by avoiding extra communications w/ CPU at the end of kernels (even if the data stays on the device), and I hope that longer term when more cooperative group primitives are available I can save even more time (e.g. by splitting the groups). Besides, ahem, crashing - would be curious to know if there are any other reasons to avoid this approach. It would seem there aren't that many people using cooperative groups despite numba as a whole (rightfully) being quite popular. |
I just realised, looking at your traceback, that we're observing issues in different places - yours is in the linker and mine is in NVVM. I tried with a later version of NVVM and didn't see the corruption, but I also didn't get a crash. So I need to go back and attempt to reproduce your issue again. |
Is there any chance you could run again with the environment variables
set and provide the output please? Note the output will go to stderr and will be around 1MB, so you may need to do whatever you need to do to capture that in pytest - I modified the reproducer so I could run it outside of pytest by calling |
Ok - less output than expected but that may be due to the crash itself. logs
|
Many thanks! Could you now please run with:
set and provide the output please? (We are homing in on the problem, I apologise for the back-and-forth though) |
And, you are correct, there's not so much output because of the crash, the majority of it in my setup came from a dump of the cubin, which it doesn't reach the point of producing in your situation. |
Ok - you wanted a huge amount of text - this time you got it! :) |
Thanks! Using the PTX you provided in that dump, I can get:
when linking it with the driver, which is now a crash at the same point you saw it. Similarly ptxas crashes with it:
|
Excellent - glad we are zeroing in. Odd that you get different PTX, though (I presume). Are you indeed seeing a difference in a diff? |
I haven't looked yet, my first hypothesis is that maybe our file paths differ leading to longer variable names, or something like that. I will resume looking at this on Monday. |
Any luck?
That sounds worrying - if data-structures rely on concatenated names (rather than mangling, say), then there is an implicit (but varying from platform to platform & maybe one installation path to another) limit on their complexity. (?) |
👀 @gmarkall - any chance you'll have time to look into this? I assume you got absorbed in the release (which incorporated a few other things I was waiting for - thank you!). I've split up my kernel and data-structures into pieces, but its an ad hoc, messy solution. I'm eager to learn at least what might be going on here and what would be the prospects of fixing it. |
Passing sufficiently complex data to a CUDA kernel, passing the data to a device function and getting a piece of it causes the kernel to crash.
Code to reproduce
Error message on running
With apologies for the size of the example - as it is it took me a whole day to isolate (after a couple days trying to debug). There may be ways to simplify it, but it seems that if I delete most any field at the moment, it tends to pass rather than fail. It doesn't seem to care which field. I suspect that there is some buffer somewhere - or the equivalent - that I am overflowing with the shape of the structure.
I have been struggling with similar problems for at least a few months now. Usually, I have been able to fiddle with the datastructures to make the problem go away, and the complexity of reproducers has dissuaded me from submitting them. However, the continual fiddling with datastructures is becoming a long-term maintenance issue for the code base. Also, this time the error seems more recondite. (I have trimmed a few things out of the real structures to create this example.)
numba -s output
Update: I noticed that was newer nvidia libraries available, and updated python 3.11.3 => python 3.11.8. No change to results. I've updated
numba -s
output and crash report to latest.The text was updated successfully, but these errors were encountered: