Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/opx: used by default instead of psm2 even though it's "beta" #7796

Closed
bartoldeman opened this issue May 31, 2022 · 7 comments
Closed

prov/opx: used by default instead of psm2 even though it's "beta" #7796

bartoldeman opened this issue May 31, 2022 · 7 comments
Assignees
Labels

Comments

@bartoldeman
Copy link

bartoldeman commented May 31, 2022

Describe the bug
The opx provider is default even though it's labelled BETA, psm2 is only used if you disable opx or set FI_PROVIDER=psm2.

If opx is enabled it'll take priority over psm2 even if it's labelled BETA: src/fabric.c:

        char *ordered_prov_names[] = {
                "efa", "opx", "psm2", "psm", "usnic", "gni", "bgq", "verbs",

shouldn't this be "efa", "psm2", "opx", "psm", "usnic", "gni", "bgq", "verbs", instead?

Secondly if you force psm2 via FI_PROVIDER=psm2 all symbols dynamically linked from libpsm2.so (e.g. psm2_mq_irecv2) are duplicated by the psm3 provider inside libfabric.so, so not taken from libpsm2.so. As a consequence all communication goes over the ethernet instead of omnipath. This was fixed in libfabric 1.15.0 as far as I can see, commit 3f1d52d.

To Reproduce
Steps to reproduce the behavior:

  • without FI_PROVIDER set run a test with FI_LOG_LEVEL=trace, and you see opx is used on omnipath.

Expected behavior
If needed, a clear and concise description of what you expected to happen.

  • without FI_PROVIDER set run a test with FI_LOG_LEVEL=trace, and you see psm2 is used on omnipath.

Environment:
OS (if not Linux), provider, endpoint type, etc.

$ opainfo 
hfi1_0:1                           PortGID:0xfe80000000000000:00117501017afb0e
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4         
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   LID: 0x000000a7-0x000000a7       SM LID: 0x00000005 SL: 0 
         QSFP Copper,       2m  Hitachi Metals    P/N IQSFP26C-20       Rev 00
   Xmit Data:         1200547346 MB Pkts:         255264322799
   Recv Data:         1580317452 MB Pkts:         297296790887
   Link Quality: 5 (Excellent)

Additional context

Workaround: set FI_PROVIDER=psm2

@ToddRimmer
Copy link
Contributor

duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.

@bartoldeman
Copy link
Author

duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.

yes sorry, I conflated two issues. The issue with symbols occurs with 1.12.1 but not with 1.15.1

Still defaulting to opx over psm2 is surpising so I'll edit and leave that.

@bartoldeman bartoldeman changed the title prov/psm2: bypassed by either psm3 or opx if enabled without dlopen prov/opx: used by default instead of psm2 even though it's "beta" May 31, 2022
@timothom64
Copy link
Contributor

I'll look into this. Can you assign this to me?

@timothom64
Copy link
Contributor

Created PR

#7926

@timothom64
Copy link
Contributor

This is probably fixed/closed

@timothom64
Copy link
Contributor

@shefty another one that I think has been fixed

@shefty shefty closed this as completed Jan 26, 2023
@shefty
Copy link
Member

shefty commented Jan 26, 2023

@timothom64 - Do you need write access for the opx provider? (and ofiwg more broadly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants