Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Performance degradation from devices reorganization #75677

Closed
dhruvachak opened this issue Dec 16, 2023 · 20 comments
Closed

[OpenMP] Performance degradation from devices reorganization #75677

dhruvachak opened this issue Dec 16, 2023 · 20 comments
Assignees
Labels
openmp:libomptarget OpenMP offload runtime openmp

Comments

@dhruvachak
Copy link
Contributor

@ronlieb collected some performance numbers using versions of the upstream compiler that show degradations after the devices reorganization. It appears that #74397 is one of the patches causing the degradation. Currently, all available devices on a system are being initialized eagerly. Previously, only devices that were being used by an application would be initialized.

These are the llvm-project before/after commits used for comparison.

Before SHA: bb0f162
After SHA: 77c40ea

SPEChpc 2021 benchmark 505.lbm degraded 10% when using 8 MPI ranks on a system with 8 AMD GPUs. The configuration tested is the default setting without the env-var ROCR_VISIBLE_DEVICES set.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 16, 2023

@llvm/issue-subscribers-openmp

Author: None (dhruvachak)

@ronlieb collected some performance numbers using versions of the upstream compiler that show degradations after the devices reorganization. It appears that https://github.com//pull/74397 is one of the patches causing the degradation. Currently, all available devices on a system are being initialized eagerly. Previously, only devices that were being used by an application would be initialized.

These are the llvm-project before/after commits used for comparison.

Before SHA: bb0f162
After SHA: 77c40ea

SPEChpc 2021 benchmark 505.lbm degraded 10% when using 8 MPI ranks on a system with 8 AMD GPUs. The configuration tested is the default setting without the env-var ROCR_VISIBLE_DEVICES set.

@jdoerfert
Copy link
Member

@dhruvachak Could you please provide a profile (LIBOMPTARGET_PROFILE) for the issue? We (= @fel-cab) tried to replicate it and it does not show up on our end. We saw a ~10% slowdown between Oct 23 and 30, maybe that is what you are seeing?

@doru1004
Copy link
Contributor

Thanks for trying to replicate it. What system did you try it on? With how many GPUs?

@fel-cab
Copy link
Contributor

fel-cab commented Dec 19, 2023

We are running on Frontier. 1 Node, 8 GPUs

@doru1004
Copy link
Contributor

doru1004 commented Dec 19, 2023

Did you run it with 8 MPI processes?

@fel-cab
Copy link
Contributor

fel-cab commented Dec 19, 2023

Yes

@doru1004
Copy link
Contributor

And you don't have something which silently sets the ROCR_VISIBLE_DEVICES right?

@fel-cab
Copy link
Contributor

fel-cab commented Dec 19, 2023

Not that I know. I've been running SPEChpc weekly, with LLVM weekly build on Frontier for about a year. I have not use ROCR_VISIBLE_DEVICES. It is the first time I hear of this

@jdoerfert
Copy link
Member

The profile will include the time we spend in deviceInit, so it will be easy to see.
If you generate the profile for both versions (before and after), we can probably pinpoint where it spends extra time, if any.

@ronlieb
Copy link
Contributor

ronlieb commented Dec 19, 2023

hi @fel-cab , i will take a run a this on frontier tomorrow sometime, kinda swamped today. maybe i can call you at some point and compare notes ?

@fel-cab
Copy link
Contributor

fel-cab commented Dec 19, 2023

Sound good.

@doru1004
Copy link
Contributor

doru1004 commented Jan 4, 2024

We saw the degradation on PCIe interconnects. That's why you're not seeing the performance degradation on your side.

@jdoerfert
Copy link
Member

We saw the degradation on PCIe interconnects. That's why you're not seeing the performance degradation on your side.

Can you please share a profile w/ and w/o this patch (or your proposed PR)?
LIBOMPTARGET_PROFILE is sufficient.

@doru1004
Copy link
Contributor

doru1004 commented Jan 5, 2024

Can you please share a profile w/ and w/o this patch (or your proposed PR)?
LIBOMPTARGET_PROFILE is sufficient.

I don't have any more details than the overall numbers. Slight improvements with my PR on 8GPU PCIe and no real difference on the non-PCIe 8GPU system tested. There could well be other cases where this patch makes a difference but that's the data I have for now.

@jdoerfert
Copy link
Member

jdoerfert commented Jan 6, 2024

The profile system is literally builtin, just run it with the env var set to a file name and share the results. We can then see where the extra time is spend.

Edit:
If you cannot share the full profile, feel free just to diff it by yourself and present the results.
My guess is that the kernels are slower due to oversubscription, but it could also be something else, we'll see.
Also, did you confirm the problem is not reproducible on Frontier?

@doru1004
Copy link
Contributor

doru1004 commented Jan 9, 2024

I'm not sure why oversubscription should be a problem here. The issue should only concern the initial runtime setup time no?

@jdoerfert
Copy link
Member

I'm not sure why oversubscription should be a problem here. The issue should only concern the initial runtime setup time no?

In that case, the profile will clearly show that part. Let's simply take a look.

@jdoerfert
Copy link
Member

Screenshot 2024-01-09 at 11 11 48 AM Here is a screenshot from a system with 8 PCI connected MI250X cards. The upper one is initializing all 8 eagerly, the lower one initializes only a single one. As before, initialization takes ~10ms per device, thus 80ms vs 10ms. Is the slowdown you see ~70ms long or do you experience something else?

@jdoerfert
Copy link
Member

@dhruvachak Can we close this?

@dhruvachak
Copy link
Contributor Author

Yes, not sure whether this is reproducible any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomptarget OpenMP offload runtime openmp
Projects
None yet
Development

No branches or pull requests

7 participants
@doru1004 @jdoerfert @ronlieb @llvmbot @dhruvachak @fel-cab and others