-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of gfx942 #6358
Conversation
Change-Id: Id31ca3ba5356d021cade2abc3e3f51f9f3b4d211
Change-Id: I1454bb0b91518bfcf7a04506e40b98387cdf8ed9
Change-Id: Id9c03fe451d1d28a3c23a77f161a2600f016c7e4
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Can you say a word about the thread fences that are being added? |
Technically, this was a violation of the memory model, because there was no guarantee that the write for the intermediate reduction values became visible before the read by the last block to do the second stage. This never bit us because it was quite unlikely on the hardware we're running on, but... shall we say that may not always hold. I've tested correctness and performance of several LAMMPS benchmarks, a regular dot product, and the yAx tutorial example on MI-250, and saw essentially no impact from unconditionally including it. |
Change-Id: Ibd028fddeedf8e0fdda50b72625ab62cee6fa71e
CUDA failure unrelated |
No description provided.