[OpenMP] Performance degradation from devices reorganization #75677

dhruvachak · 2023-12-16T00:45:31Z

@ronlieb collected some performance numbers using versions of the upstream compiler that show degradations after the devices reorganization. It appears that #74397 is one of the patches causing the degradation. Currently, all available devices on a system are being initialized eagerly. Previously, only devices that were being used by an application would be initialized.

These are the llvm-project before/after commits used for comparison.

Before SHA: bb0f162
After SHA: 77c40ea

SPEChpc 2021 benchmark 505.lbm degraded 10% when using 8 MPI ranks on a system with 8 AMD GPUs. The configuration tested is the default setting without the env-var ROCR_VISIBLE_DEVICES set.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-12-16T00:45:46Z

@llvm/issue-subscribers-openmp

Author: None (dhruvachak)

@ronlieb collected some performance numbers using versions of the upstream compiler that show degradations after the devices reorganization. It appears that https://github.com//pull/74397 is one of the patches causing the degradation. Currently, all available devices on a system are being initialized eagerly. Previously, only devices that were being used by an application would be initialized.

These are the llvm-project before/after commits used for comparison.

Before SHA: bb0f162
After SHA: 77c40ea

SPEChpc 2021 benchmark 505.lbm degraded 10% when using 8 MPI ranks on a system with 8 AMD GPUs. The configuration tested is the default setting without the env-var ROCR_VISIBLE_DEVICES set.

jdoerfert · 2023-12-19T16:23:31Z

@dhruvachak Could you please provide a profile (LIBOMPTARGET_PROFILE) for the issue? We (= @fel-cab) tried to replicate it and it does not show up on our end. We saw a ~10% slowdown between Oct 23 and 30, maybe that is what you are seeing?

doru1004 · 2023-12-19T16:30:26Z

Thanks for trying to replicate it. What system did you try it on? With how many GPUs?

fel-cab · 2023-12-19T16:45:24Z

We are running on Frontier. 1 Node, 8 GPUs

doru1004 · 2023-12-19T16:48:20Z

Did you run it with 8 MPI processes?

fel-cab · 2023-12-19T16:52:30Z

Yes

doru1004 · 2023-12-19T16:54:22Z

And you don't have something which silently sets the ROCR_VISIBLE_DEVICES right?

fel-cab · 2023-12-19T16:59:39Z

Not that I know. I've been running SPEChpc weekly, with LLVM weekly build on Frontier for about a year. I have not use ROCR_VISIBLE_DEVICES. It is the first time I hear of this

jdoerfert · 2023-12-19T17:04:39Z

The profile will include the time we spend in deviceInit, so it will be easy to see.
If you generate the profile for both versions (before and after), we can probably pinpoint where it spends extra time, if any.

ronlieb · 2023-12-19T17:27:27Z

hi @fel-cab , i will take a run a this on frontier tomorrow sometime, kinda swamped today. maybe i can call you at some point and compare notes ?

fel-cab · 2023-12-19T17:32:45Z

Sound good.

doru1004 · 2024-01-04T18:08:28Z

We saw the degradation on PCIe interconnects. That's why you're not seeing the performance degradation on your side.

jdoerfert · 2024-01-05T21:50:28Z

We saw the degradation on PCIe interconnects. That's why you're not seeing the performance degradation on your side.

Can you please share a profile w/ and w/o this patch (or your proposed PR)?
LIBOMPTARGET_PROFILE is sufficient.

doru1004 · 2024-01-05T22:51:45Z

Can you please share a profile w/ and w/o this patch (or your proposed PR)?
LIBOMPTARGET_PROFILE is sufficient.

I don't have any more details than the overall numbers. Slight improvements with my PR on 8GPU PCIe and no real difference on the non-PCIe 8GPU system tested. There could well be other cases where this patch makes a difference but that's the data I have for now.

jdoerfert · 2024-01-06T00:09:49Z

The profile system is literally builtin, just run it with the env var set to a file name and share the results. We can then see where the extra time is spend.

Edit:
If you cannot share the full profile, feel free just to diff it by yourself and present the results.
My guess is that the kernels are slower due to oversubscription, but it could also be something else, we'll see.
Also, did you confirm the problem is not reproducible on Frontier?

doru1004 · 2024-01-09T00:35:17Z

I'm not sure why oversubscription should be a problem here. The issue should only concern the initial runtime setup time no?

jdoerfert · 2024-01-09T17:26:49Z

I'm not sure why oversubscription should be a problem here. The issue should only concern the initial runtime setup time no?

In that case, the profile will clearly show that part. Let's simply take a look.

jdoerfert · 2024-01-09T19:14:06Z

Here is a screenshot from a system with 8 PCI connected MI250X cards. The upper one is initializing all 8 eagerly, the lower one initializes only a single one. As before, initialization takes ~10ms per device, thus 80ms vs 10ms. Is the slowdown you see ~70ms long or do you experience something else?

jdoerfert · 2024-05-22T16:56:18Z

@dhruvachak Can we close this?

dhruvachak · 2024-05-22T17:20:44Z

Yes, not sure whether this is reproducible any more.

dhruvachak added openmp openmp:libomptarget OpenMP offload runtime labels Dec 16, 2023

dhruvachak assigned jdoerfert Dec 16, 2023

doru1004 mentioned this issue Jan 3, 2024

[OpenMP][libomptarget] Enable lazy device initialization #76832

Open

dhruvachak closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Performance degradation from devices reorganization #75677

[OpenMP] Performance degradation from devices reorganization #75677

dhruvachak commented Dec 16, 2023

llvmbot commented Dec 16, 2023

jdoerfert commented Dec 19, 2023

doru1004 commented Dec 19, 2023

fel-cab commented Dec 19, 2023

doru1004 commented Dec 19, 2023 •

edited

fel-cab commented Dec 19, 2023

doru1004 commented Dec 19, 2023

fel-cab commented Dec 19, 2023

jdoerfert commented Dec 19, 2023

ronlieb commented Dec 19, 2023

fel-cab commented Dec 19, 2023

doru1004 commented Jan 4, 2024 •

edited

jdoerfert commented Jan 5, 2024

doru1004 commented Jan 5, 2024 •

edited

jdoerfert commented Jan 6, 2024 •

edited

doru1004 commented Jan 9, 2024

jdoerfert commented Jan 9, 2024

jdoerfert commented Jan 9, 2024

jdoerfert commented May 22, 2024

dhruvachak commented May 22, 2024

[OpenMP] Performance degradation from devices reorganization #75677

[OpenMP] Performance degradation from devices reorganization #75677

Comments

dhruvachak commented Dec 16, 2023

llvmbot commented Dec 16, 2023

jdoerfert commented Dec 19, 2023

doru1004 commented Dec 19, 2023

fel-cab commented Dec 19, 2023

doru1004 commented Dec 19, 2023 • edited

fel-cab commented Dec 19, 2023

doru1004 commented Dec 19, 2023

fel-cab commented Dec 19, 2023

jdoerfert commented Dec 19, 2023

ronlieb commented Dec 19, 2023

fel-cab commented Dec 19, 2023

doru1004 commented Jan 4, 2024 • edited

jdoerfert commented Jan 5, 2024

doru1004 commented Jan 5, 2024 • edited

jdoerfert commented Jan 6, 2024 • edited

doru1004 commented Jan 9, 2024

jdoerfert commented Jan 9, 2024

jdoerfert commented Jan 9, 2024

jdoerfert commented May 22, 2024

dhruvachak commented May 22, 2024

doru1004 commented Dec 19, 2023 •

edited

doru1004 commented Jan 4, 2024 •

edited

doru1004 commented Jan 5, 2024 •

edited

jdoerfert commented Jan 6, 2024 •

edited