Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Closed
Epliz opened this issue Jan 15, 2023 · 4 comments
Closed

Amount of synchronization seemingly a bit hurtful for AMD GPUs #48

Epliz opened this issue Jan 15, 2023 · 4 comments
Labels
not an issue not a problem with the software

Comments

@Epliz
Copy link

Epliz commented Jan 15, 2023

Hi,

First off I want to say that you have made some great software, very nice to both use and look into. I really appreciate the effort that went into making it.

I was quite surprised by the disparity between the performance that AMD cards reach in the benchmark compared to the NVidia GPUs, so I tried to look into it. I ported the stream_collide kernel to HIP and saw that the performance that can be reached with HIP is much higher, effectively same peak bandwidths ratio as with NVidia GPUs. (If of interest, I can provide that HIP code).

So that lead me to think that something is wrong with AMD's OpenCL runtime, such as kernel launch &synchronization overheads or something of the like.
I saw that in the case of single GPU simulation, there was some synchronization that could be removed in FluidX3D by synchronizing once per run() call instead of once per do_time_step().
By doing so, I can see that the total simulation time gets smaller on AMD GPUs, by roughly a constant amount of time independent of the GPU. For faster GPUs, it means the improvement gets more significant relatively.

Here are some numbers:

GPU Grid size Simulation time before change Simulation time after change Improvement
Ryzen 5800U 128^3 114s 108s 1.05x faster
NVidia V100 16GB PCIe 256^3 34.2s 34.1s same
AMD RX6700xt 256^3 131s 127s 1.3x faster
AMD MI50 256^3 50.7s 48.0s 1.05x faster
AMD MI100 256^3 34.0s 32.0s 1.06x faster

While the improvement isn't amazing, I guess it is a low effort improvement, and the effect might be more visible for faster GPUs, maybe 10% relatively on MI250x if the time improvement is the same.

Hoping you consider this improvement,
On my side I will continue trying to understand where the disparity between HIP and OpenCL comes from for funsies,

Best regards,
Epliz

@ProjectPhysX
Copy link
Owner

Hi @Epliz,

thank you!

HIP seems to only support 7 (!) super expensive AMD GPUs, so I'll stick with OpenCL. Maybe your findings give AMD an incentive to optimize their OpenCL runtime :)

At least one synchronization is required per time step. Otherwiese, with an unlimited number of time steps, the OpenCL queue gets new entries every couple milliseconds, but kernels can't complete fast enough, so the queue and system memory fill up within seconds, causing a crash.
For multi-GPU I have already minimized the number of synchronization points to what's absolutely necessary, and disabled the additional synchronization for single-GPU if there is more than one domain.

Can you send me the standard benchmark results for MI50/MI100 please? From my experience, AMD GPUs are quite sensitive to box size, so maybe go with 464³ resolution instead of the standard 256³, that runs fastest on the Radeon VII. Would be good additions to the table!

Regards,
Moritz

@Epliz
Copy link
Author

Epliz commented Jan 22, 2023

Hey,

Actually I was wrong, when the kernel code is the same, there is not really any difference between HIP and OpenCL beyond the small synchronization difference.
To illustrate what I proposed, I put my changes in a fork at https://github.com/Epliz/FluidX3D/commits/master .
I added some other small optimizations so that in total there are the following optimizations:

  1. less sync
  2. Specialize the stream_collide kernel to remove the t parameter
  3. use "near loads" and "near stores" whenever possible which leverage some variant of the load/store instructions that can avoid some 64 bit operations

I think the changes are correct, but to be honest I am not sure how to test them properly.
With those changes, on a 256^3 grid, on my MI100 the perf goes from 5220 peak MLUPS (baseline using your repo) to 5572 peak MLUPS. The gains come predominantly from the sync optimization, to the point that you don't really need to take the other commits.
On V100 there is no perf change at all.

I hope my changes are valid, and that they can be of help.

Best,
Epliz

@ProjectPhysX
Copy link
Owner

Hi @Epliz,

during the last week I have experimented some more with reducing synchronization barriers in every time step. In headless mode, the performance difference on Radeon VII is measurable but insignificant, and in interactive graphics mode, it makes the graphics freeze repeatedly as graphics kernels then don't get placed in the queue often enough. I also tried event-based synchronization in multi-GPU but that isn't faster either. Removing synchronization from every time step creates more trouble than it has benefits.

Specializing stream_collide for even and odd time steps defeats the purpose of modular code. Passing the t parameter directly has no disadvantage. I couldn't measure a significant performance difference between 32-bit and 64-bit time parameter. I was considering passing time only as even/odd (0/1) or as 32-bit integer, but left it as 64-bit integer as the time step could be used as a seed for random number generation in the future.

In my testing, the few 64-bit integer operations for array index calculation also don't significantly impact performance. When using 64-bit integer everywhere (for the global ID and computing all the neighbor grid indices), there is a significant difference, so I stick to 32-bit for the max grid size and only extend to 64-bit for array accesses in higher dimensions, like in the fi array. This makes sure that up to 2³² grid points work without integer overflow. The application is so much bandwidth-bound that even with the few 64-bit integer operations at 1:64 INT64:FP32 ratio, it remains well in the bandwidth limit.

Regards,
Moritz

@ProjectPhysX ProjectPhysX closed this as not planned Won't fix, can't repro, duplicate, stale Mar 12, 2023
@Epliz
Copy link
Author

Epliz commented Mar 12, 2023

Hi Moritz,

Thanks for trying, and I think your results make sense.
For the effect of reducing synchronization, based on Amdahl's law, I would expect bigger improvement on the faster GPUs like MI100 or MI250X while the improvement should be smaller on Radeon vii.
Have you had the chance to try on MI250x?

In any case, thanks for trying.
Best,
Epliz

@ProjectPhysX ProjectPhysX added the not an issue not a problem with the software label Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not an issue not a problem with the software
Projects
None yet
Development

No branches or pull requests

2 participants