Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/opx: 1.15.1: use always_inline with flatten __attribute__ cause use gcc huge memory usage when LTO is used #7916

Closed
kloczek opened this issue Aug 2, 2022 · 15 comments
Assignees
Labels

Comments

@kloczek
Copy link

kloczek commented Aug 2, 2022

Describe the bug
Use always_inline with flatten attribute cause use huge memory usage when LTO is used (I've hit 110GB phisical RAM).

All details of this case + explanation of the issu in libfabric code + patch which fixex the issue are on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106499

@kloczek kloczek added the bug label Aug 2, 2022
@shefty
Copy link
Member

shefty commented Aug 2, 2022

If I understand the bug report, the problem is that the OPX provider is insane and thinks that having a 100 GB binary will somehow outperform a provider with function calls and a significantly smaller footprint. Does this seem to be the issue? Can you build libfabric, but using --disable-opx, and see if the problem is limited to that provider?

@kloczek
Copy link
Author

kloczek commented Aug 2, 2022

This +100GB it is not binary size but size of the MEMORY used by gcc during linking (after reaching 110GB consumed RAM and 2.5h of linking I've stopped that process so amount of CPU power memory used on linking is probably even bigger).

@shefty
Copy link
Member

shefty commented Aug 2, 2022

I know trying to build opx takes about 5 minutes on my system. Can you try using --disable-opx and verify that the problem goes away? I want to confirm this is related to the opx provider.

@kloczek
Copy link
Author

kloczek commented Aug 2, 2022

With --disable-opx there is no that issue.
So why this provider is enabled by default?

And/or as dist binaries should I build all provides as loadable modules? 🤔

@shefty
Copy link
Member

shefty commented Aug 2, 2022

libfabric will try to build all providers that have the required prerequisites available on the system. I don't see a strong reason to build providers as loadable modules (unless wanting to update only that provider against an older build of libfabric). Even if a provider is built into libfabric, if a loadable provider is found, it will be used if it reports a newer version number than the internal provider.

@shefty shefty changed the title 1.15.1: use always_inline with flatten __attribute__ cause use gcc huge memory usage when LTO is used prov/opx: 1.15.1: use always_inline with flatten __attribute__ cause use gcc huge memory usage when LTO is used Aug 2, 2022
@kloczek
Copy link
Author

kloczek commented Aug 2, 2022

I don't see a strong reason to build providers as loadable modules (unless wanting to update only that provider against an older build of libfabric). Even if a provider is built into libfabric, if a loadable provider is found,

Kind of strange logic.
If someone would like to update DSO with provider by recompile it why not just rebuild libfabric? 🤔
To be hones I've packages libfabric because it is on build dependencies of other packages.
For what all those providers?
Some diagnostics? 🤔

@shefty
Copy link
Member

shefty commented Aug 2, 2022

Suppose there is an installed version of libfabric on a system (say, packaged by RedHat or SuSE). That version may have all providers built into it. Now say vendor Intel finds a bug in their provider and wants to deploy a fix. They can do this by shipping only their provider library. This way, the system installed version of libfabric would still be used, but it can pick up the latest Intel provider.

Not everyone will necessary be able to rebuild libfabric and re-install it on their system. This way, the other providers that might be in use (the ones build into libfabric) are not modified. This also provides a mechanism for providers that are maintained out of tree, and may not be open source, to be used with the upsteam libfabric.

@kloczek
Copy link
Author

kloczek commented Aug 2, 2022

Suppose there is an installed version of libfabric on a system (say, packaged by RedHat or SuSE). That version may have all providers built into it. Now say vendor Intel finds a bug in their provider and wants to deploy a fix.

We are taking abbout providers which are part of the libfabric. Please ..
So those providers are like drivers. IMO name is a bit misleading. It should be called just driver and/or HW/HW beckend/platform driver.
"Provider" suggests providing new functionality and looks like in this case it is about supporting exact HW.

PS. Still .. always_inline should not be used together with flatten.

@shefty
Copy link
Member

shefty commented Aug 2, 2022

"Provider" is the term that we use because we needed to call these plug-ins something, and at this point it's baked into the API, so I don't see that changing.

Most are not drivers in the common use of that term and are not associated with specific HW. They provide an implementation of the APIs over "something" -- shared memory, tcp sockets, udp sockets, any verbs-based device, EFA NIC, etc. -- maybe even over another provider in order to extend its functionality. Most of the API calls are function pointers that go directly into the provider. libfabric itself only exports a handful of actual APIs.

@kloczek
Copy link
Author

kloczek commented Aug 2, 2022

So .. in other words opx provider code is broken because it tries to use always_inline together with `flatten``,

@timothom64
Copy link
Contributor

Looking into this

@timothom64
Copy link
Contributor

Fixed with 779972e

@charlesshereda
Copy link
Contributor

Can we close this? Tim's removal of the flatten attribute in 779972e should have addressed this.

@kloczek
Copy link
Author

kloczek commented Oct 12, 2022

If you are thinking that issue is solved just close ticket.
Because now is +60 commits since last trelease I'll check that on next release. 😋

@timothom64
Copy link
Contributor

I don't think I have the ablity to close this ticket?

@shefty shefty closed this as completed Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants