-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement non-blocking kernel launches for HIP backend #3697
Conversation
skyreflectedinmirrors
commented
Jan 4, 2021
•
edited
edited
- HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible
- Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached.
- Calling a fence on the instance will reset the cycle index
ea9de06
to
36f7a82
Compare
36f7a82
to
c00dd25
Compare
857dcfe
to
d5b019e
Compare
d5b019e
to
34a07c1
Compare
fence(); | ||
HIP_SAFE_CALL(hipHostFree(d_driverWorkArray)); | ||
m_maxDriverTypeSize = driverTypeSize; | ||
if (m_maxDriverTypeSize % 128 != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 128
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- Leopold wrote that bit, I assume it was some attempt at padding / alignment, but I'll check w/ him. It's probably safe to remove since it's pretty arbitrary.
- HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible - Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached. - Calling a fence on the instance will reset the cycle index Change-Id: Ibec81051ac6018d8aef4510b3450428b0e52d822
34a07c1
to
374b1ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add one more step and copy the thing to the device. I.e. have one copy on the device and issue a async memcpy from the host to device before issuing the kernel. Otherwise every access to a member of the drivertype will go to the host won't it?
Ok hm my own testing doesn't show that this slows stuff down. Does this allocation get cached on the GPU? |
@crtrott -- yes, allocating the host pointer as non-coherent here tells the GPU to attempt to cache the Driver in L2, so the cost should be minimal after the first load. We could make it an async memcopy to a device pointer, but in practice we've found this is intolerably slow for small copies because the runtime is... not so great for that use-case (yet, we continue to push on them on it). It would probably work fairly well if you could actually queue up ~100 kernels w/o interruption, but right now that isn't typically the case (e.g., in LAMMPS) because there are other (sometimes hidden) synchronization points (e.g., reductions, fences, etc). Another option we had played around with was using LargeBAR support to just directly For the moment I think this is a reasonable implementation, but we might consider implementing the async and LargeBAR options, and directly benchmarking them against this cached zero-copy approach on LAMMPS, ArborX, and/or any of the other codes we know are working (more or less) in the HIP backend right now. |
That makes sense thanks! The non-coherent part is what I didn't quite recognize. In this case I don't see an issue, and this definitely cuts down launch latency quite a bit. |