New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amount of synchronization seemingly a bit hurtful for AMD GPUs #48
Comments
Hi @Epliz, thank you! HIP seems to only support 7 (!) super expensive AMD GPUs, so I'll stick with OpenCL. Maybe your findings give AMD an incentive to optimize their OpenCL runtime :) At least one synchronization is required per time step. Otherwiese, with an unlimited number of time steps, the OpenCL queue gets new entries every couple milliseconds, but kernels can't complete fast enough, so the queue and system memory fill up within seconds, causing a crash. Can you send me the standard benchmark results for MI50/MI100 please? From my experience, AMD GPUs are quite sensitive to box size, so maybe go with 464³ resolution instead of the standard 256³, that runs fastest on the Radeon VII. Would be good additions to the table! Regards, |
Hey, Actually I was wrong, when the kernel code is the same, there is not really any difference between HIP and OpenCL beyond the small synchronization difference.
I think the changes are correct, but to be honest I am not sure how to test them properly. I hope my changes are valid, and that they can be of help. Best, |
Hi @Epliz, during the last week I have experimented some more with reducing synchronization barriers in every time step. In headless mode, the performance difference on Radeon VII is measurable but insignificant, and in interactive graphics mode, it makes the graphics freeze repeatedly as graphics kernels then don't get placed in the queue often enough. I also tried event-based synchronization in multi-GPU but that isn't faster either. Removing synchronization from every time step creates more trouble than it has benefits. Specializing stream_collide for even and odd time steps defeats the purpose of modular code. Passing the In my testing, the few 64-bit integer operations for array index calculation also don't significantly impact performance. When using 64-bit integer everywhere (for the global ID and computing all the neighbor grid indices), there is a significant difference, so I stick to 32-bit for the max grid size and only extend to 64-bit for array accesses in higher dimensions, like in the Regards, |
Hi Moritz, Thanks for trying, and I think your results make sense. In any case, thanks for trying. |
Hi,
First off I want to say that you have made some great software, very nice to both use and look into. I really appreciate the effort that went into making it.
I was quite surprised by the disparity between the performance that AMD cards reach in the benchmark compared to the NVidia GPUs, so I tried to look into it. I ported the
stream_collide
kernel to HIP and saw that the performance that can be reached with HIP is much higher, effectively same peak bandwidths ratio as with NVidia GPUs. (If of interest, I can provide that HIP code).So that lead me to think that something is wrong with AMD's OpenCL runtime, such as kernel launch &synchronization overheads or something of the like.
I saw that in the case of single GPU simulation, there was some synchronization that could be removed in FluidX3D by synchronizing once per
run()
call instead of once perdo_time_step()
.By doing so, I can see that the total simulation time gets smaller on AMD GPUs, by roughly a constant amount of time independent of the GPU. For faster GPUs, it means the improvement gets more significant relatively.
Here are some numbers:
While the improvement isn't amazing, I guess it is a low effort improvement, and the effect might be more visible for faster GPUs, maybe 10% relatively on MI250x if the time improvement is the same.
Hoping you consider this improvement,
On my side I will continue trying to understand where the disparity between HIP and OpenCL comes from for funsies,
Best regards,
Epliz
The text was updated successfully, but these errors were encountered: