Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to specify which provider to test #19

Closed
carns opened this issue Sep 22, 2022 · 9 comments · Fixed by #20
Closed

how to specify which provider to test #19

carns opened this issue Sep 22, 2022 · 9 comments · Fixed by #20
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@carns
Copy link
Contributor

carns commented Sep 22, 2022

Libfabric library builds often have support for multiple providers built in. How do you control which one is tested?

Skimming through the code I don't immediately see programmatic API control over it in the C code or environment variable control over it in the job scripts.

@hyoklee
Copy link
Collaborator

hyoklee commented Sep 22, 2022

I don't know. @gnuoyd , do you know? @derobins , please answer this question if you know.

@carns , how do other projects specify a provider?

@carns
Copy link
Contributor Author

carns commented Sep 22, 2022

I'm not familiar enough with the libfabric API to know how to do it programmatically off the top of my head. You could maybe look at how Mercury selects providers in na_ofi.c.

I'm not sure how well it will work with the code as structured, but you can also set a runtime environment variable (i.e. in the job scripts) that will restrict the set of providers that libfabric will allow. This would be the FI_PROVIDER environment variable, which takes a comma separated list of providers, kind of like an allow list. See https://ofiwg.github.io/libfabric/main/man/fabric.7.html.

On Polaris we want to test "verbs,rxm" (two providers are required to use verbs in reliable datagram mode), on Crusher we want "cxi", and on Theta we want "gni", for example. I guess you could try setting that (or whatever is appropriate for your test platform) and see if the tests execute.

Based on this discussion it sounds like we really need the test output to report what provider was used (independent of what was attempted) for validation. I don't know what provider is being selected by default in the tests thus far, but if it is the tcp provider that's not really the transport we want to be testing.

@hyoklee hyoklee self-assigned this Sep 22, 2022
@hyoklee
Copy link
Collaborator

hyoklee commented Sep 23, 2022

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[hyoklee@login2.crusher build]$ export FI_PROVIDER=cxi
[hyoklee@login2.crusher build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

When FI_PROVIDER is set to verbs,rxm on Poaris, test hangs with libfabric 1.15.0:

hyoklee@polaris-login-02:~/fabtsuite/build> export FI_PROVIDER=verbs
hyoklee@polaris-login-02:~/fabtsuite/build> ctest -I 1,1
Test project /home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

@carns
Copy link
Contributor Author

carns commented Sep 23, 2022

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[hyoklee@login2.crusher build]$ export FI_PROVIDER=cxi
[hyoklee@login2.crusher build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

How are you building libfabric (you can share your environment configuration if you are using Spack). It might be easiest to debug these kind of initialization problems by trying to launch the server in an interactive session. You can set the FI_LOG_LEVEL=debug environment variable to get more detailed information out of libfabric.

@hyoklee
Copy link
Collaborator

hyoklee commented Sep 23, 2022

For Crusher, 1.15.0 is provided. For Theta, I use spack install fabtsuite ^libfabric fabrics=gni,tcp,udp,rxd,rxm.

@hyoklee
Copy link
Collaborator

hyoklee commented Oct 31, 2022

I ran the test again by specifying the

export FI_PROPVIDER=cxi
export FI_LOG_LEVEL=debug

to the test/wait.slurm script.

I used the system libfabric.
The test failed with timeout on Crusher.
Crusher reported an error message in detail.

libfabric:107344:1667248504:cxi:cq:cxip_cq_verify_attr():840<warn> crusher008: \
CQ wait objects not supported
get_state_open.4216: fi_cq_open: Function not implemented
real 0.09
user 0.00
sys 0.01
1
srun: error: crusher008: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=208067.1
0

I also could verify that the address returned by fabtget is different from tcp provider.

Thus, I think fabtsuite seems to be able to test a different provider.

@hyoklee hyoklee added enhancement New feature or request question Further information is requested labels Oct 31, 2022
@carns
Copy link
Contributor Author

carns commented Nov 2, 2022

@hyoklee can you confirm that the rest of the test suite passes on cxi?

@hyoklee
Copy link
Collaborator

hyoklee commented Nov 4, 2022

@carns , I tested the rest of suite today and they worked fine. Do you want me to update slurm job script to use CXI (e.g., cross.slurm)? Or just update documentation like FAQ?

@carns
Copy link
Contributor Author

carns commented Nov 10, 2022

Thanks @hyoklee .

Both if you don't mind. The script can be hardcoded to use cxi; that's likely to be the only thing we test on Crusher. The doc can describe more generically how to set the test to exercise a particular provider (cxi or otherwise).

As a side note since we have mentioned platform-specific test scripts: the .slurm etc. files would be a little clearer if the names of the files included the machine name. There are a lot of slurm, qsub, etc. systems out there but what actually needs to be executed within the script is likely platform-specific. If the current naming is important to the overall test flow then maybe just a comment at the top of each one that says something like "# test script for the Polaris system @alcf".

hyoklee added a commit that referenced this issue Nov 11, 2022
docs(faq): add FI_PROVIDER answer for #19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants