Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Epliz · 2023-01-15T10:09:02Z

Hi,

First off I want to say that you have made some great software, very nice to both use and look into. I really appreciate the effort that went into making it.

I was quite surprised by the disparity between the performance that AMD cards reach in the benchmark compared to the NVidia GPUs, so I tried to look into it. I ported the stream_collide kernel to HIP and saw that the performance that can be reached with HIP is much higher, effectively same peak bandwidths ratio as with NVidia GPUs. (If of interest, I can provide that HIP code).

So that lead me to think that something is wrong with AMD's OpenCL runtime, such as kernel launch &synchronization overheads or something of the like.
I saw that in the case of single GPU simulation, there was some synchronization that could be removed in FluidX3D by synchronizing once per run() call instead of once per do_time_step().
By doing so, I can see that the total simulation time gets smaller on AMD GPUs, by roughly a constant amount of time independent of the GPU. For faster GPUs, it means the improvement gets more significant relatively.

Here are some numbers:

GPU	Grid size	Simulation time before change	Simulation time after change	Improvement
Ryzen 5800U	128^3	114s	108s	1.05x faster
NVidia V100 16GB PCIe	256^3	34.2s	34.1s	same
AMD RX6700xt	256^3	131s	127s	1.3x faster
AMD MI50	256^3	50.7s	48.0s	1.05x faster
AMD MI100	256^3	34.0s	32.0s	1.06x faster

While the improvement isn't amazing, I guess it is a low effort improvement, and the effect might be more visible for faster GPUs, maybe 10% relatively on MI250x if the time improvement is the same.

Hoping you consider this improvement,
On my side I will continue trying to understand where the disparity between HIP and OpenCL comes from for funsies,

Best regards,
Epliz

The text was updated successfully, but these errors were encountered:

ProjectPhysX · 2023-01-15T12:57:40Z

Hi @Epliz,

thank you!

HIP seems to only support 7 (!) super expensive AMD GPUs, so I'll stick with OpenCL. Maybe your findings give AMD an incentive to optimize their OpenCL runtime :)

At least one synchronization is required per time step. Otherwiese, with an unlimited number of time steps, the OpenCL queue gets new entries every couple milliseconds, but kernels can't complete fast enough, so the queue and system memory fill up within seconds, causing a crash.
For multi-GPU I have already minimized the number of synchronization points to what's absolutely necessary, and disabled the additional synchronization for single-GPU if there is more than one domain.

Can you send me the standard benchmark results for MI50/MI100 please? From my experience, AMD GPUs are quite sensitive to box size, so maybe go with 464³ resolution instead of the standard 256³, that runs fastest on the Radeon VII. Would be good additions to the table!

Regards,
Moritz

Epliz · 2023-01-22T16:14:04Z

Hey,

Actually I was wrong, when the kernel code is the same, there is not really any difference between HIP and OpenCL beyond the small synchronization difference.
To illustrate what I proposed, I put my changes in a fork at https://github.com/Epliz/FluidX3D/commits/master .
I added some other small optimizations so that in total there are the following optimizations:

less sync
Specialize the stream_collide kernel to remove the t parameter
use "near loads" and "near stores" whenever possible which leverage some variant of the load/store instructions that can avoid some 64 bit operations

I think the changes are correct, but to be honest I am not sure how to test them properly.
With those changes, on a 256^3 grid, on my MI100 the perf goes from 5220 peak MLUPS (baseline using your repo) to 5572 peak MLUPS. The gains come predominantly from the sync optimization, to the point that you don't really need to take the other commits.
On V100 there is no perf change at all.

I hope my changes are valid, and that they can be of help.

Best,
Epliz

ProjectPhysX · 2023-03-12T09:54:17Z

Hi @Epliz,

during the last week I have experimented some more with reducing synchronization barriers in every time step. In headless mode, the performance difference on Radeon VII is measurable but insignificant, and in interactive graphics mode, it makes the graphics freeze repeatedly as graphics kernels then don't get placed in the queue often enough. I also tried event-based synchronization in multi-GPU but that isn't faster either. Removing synchronization from every time step creates more trouble than it has benefits.

Specializing stream_collide for even and odd time steps defeats the purpose of modular code. Passing the t parameter directly has no disadvantage. I couldn't measure a significant performance difference between 32-bit and 64-bit time parameter. I was considering passing time only as even/odd (0/1) or as 32-bit integer, but left it as 64-bit integer as the time step could be used as a seed for random number generation in the future.

In my testing, the few 64-bit integer operations for array index calculation also don't significantly impact performance. When using 64-bit integer everywhere (for the global ID and computing all the neighbor grid indices), there is a significant difference, so I stick to 32-bit for the max grid size and only extend to 64-bit for array accesses in higher dimensions, like in the fi array. This makes sure that up to 2³² grid points work without integer overflow. The application is so much bandwidth-bound that even with the few 64-bit integer operations at 1:64 INT64:FP32 ratio, it remains well in the bandwidth limit.

Regards,
Moritz

Epliz · 2023-03-12T17:16:54Z

Hi Moritz,

Thanks for trying, and I think your results make sense.
For the effect of reducing synchronization, based on Amdahl's law, I would expect bigger improvement on the faster GPUs like MI100 or MI250X while the improvement should be smaller on Radeon vii.
Have you had the chance to try on MI250x?

In any case, thanks for trying.
Best,
Epliz

ProjectPhysX closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2023

ProjectPhysX added the not an issue not a problem with the software label Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Epliz commented Jan 15, 2023

ProjectPhysX commented Jan 15, 2023

Epliz commented Jan 22, 2023

ProjectPhysX commented Mar 12, 2023

Epliz commented Mar 12, 2023

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Comments

Epliz commented Jan 15, 2023

ProjectPhysX commented Jan 15, 2023

Epliz commented Jan 22, 2023

ProjectPhysX commented Mar 12, 2023

Epliz commented Mar 12, 2023