-
-
Notifications
You must be signed in to change notification settings - Fork 83
CURAND_STATUS_PREEXISTING_FAILURE with v2.0.1 but not v1.7.3 #682
Comments
Could you add a call to |
I tried both @checked function curandGenerateSeeds(generator)
initialize_api()
CUDAdrv.synchronize()
@runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
(curandGenerator_t,),
generator)
end and @checked function curandGenerateSeeds(generator)
CUDAdrv.synchronize()
initialize_api()
@runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
(curandGenerator_t,),
generator)
end and I still get the same error / stack trace, although anecdotally it seems like it takes a little longer to trigger (might just be in my head). Is that what you meant? |
Yes, but sadly it doesn't catch anything. I wonder why CURAND thinks there's a preexisting failure then. Bisecting would be useful. Due to the coupling between CuArrays/CUDAnative/GPUArrays you'll probably have to use the Manifest that's part of CuArrays (only a few commits don't work, you can |
Ok, bisected it to this being the first bad commit: 65a35b1 I checked a couple of times and I'm pretty sure this is it. I'm using the Manifest like you suggested, so the breakdown is:
I notice that whenever I switch between these two commit I get
which may be relevant. |
Hmm, that doesn't help much. Are you using multiple threads or tasks? |
My code is single threaded, and can run in a one-MPI-process-per-GPU configuration. I mentioned above sometimes it hangs intsead of giving me the CURAND_STATUS_PREEXISTING_FAILURE error, but based on your comment / that bisect I ran my code with a single MPI process, and it looks like in this case its just always hanging. Maybe the CURAND_STATUS_PREEXISTING_FAILURE is a red-herring / side-effect of the real issue? With a single process, I reproduced the hang about 5 times (with the "bad" versions from above), each time I get this identical stack track if I just kill it:
|
Ah, so even |
This appears to be fixed for me on 2.2.0, presumably by the referenced issue above. Guessing the CURAND thing was just a random side-effect. |
I recently upgraded from 1.7.3 to 2.0.1 and started seeing this error sporadically in my code, sometimes taking a ~minute of running, but always eventually hitting it. Unfortunately I'm unable to come up with a MWE, but I can reliably reproduce this on my system, including switching back and forth between 1.7.3 and 2.0.1 and seeing it appear / disspear.
The code is doing fairly standard I think manipulations of CuArrays (no custom kernels). I have Julia 1.4,
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
, andand stacktrace (which I see maybe 80% of the time, the other 20% it just seems to hang):
Any ideas what may have changed that could be causing this?
Would it be helpful if I bisect to the exact commit? Or maybe an expert can just guess what's going on from here?
The text was updated successfully, but these errors were encountered: