how to specify which provider to test #19

carns · 2022-09-22T18:49:05Z

Libfabric library builds often have support for multiple providers built in. How do you control which one is tested?

Skimming through the code I don't immediately see programmatic API control over it in the C code or environment variable control over it in the job scripts.

hyoklee · 2022-09-22T19:23:46Z

I don't know. @gnuoyd , do you know? @derobins , please answer this question if you know.

@carns , how do other projects specify a provider?

carns · 2022-09-22T20:19:53Z

I'm not familiar enough with the libfabric API to know how to do it programmatically off the top of my head. You could maybe look at how Mercury selects providers in na_ofi.c.

I'm not sure how well it will work with the code as structured, but you can also set a runtime environment variable (i.e. in the job scripts) that will restrict the set of providers that libfabric will allow. This would be the FI_PROVIDER environment variable, which takes a comma separated list of providers, kind of like an allow list. See https://ofiwg.github.io/libfabric/main/man/fabric.7.html.

On Polaris we want to test "verbs,rxm" (two providers are required to use verbs in reliable datagram mode), on Crusher we want "cxi", and on Theta we want "gni", for example. I guess you could try setting that (or whatever is appropriate for your test platform) and see if the tests execute.

Based on this discussion it sounds like we really need the test output to report what provider was used (independent of what was attempted) for validation. I don't know what provider is being selected by default in the tests thus far, but if it is the tcp provider that's not really the transport we want to be testing.

hyoklee · 2022-09-23T04:10:00Z

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[hyoklee@login2.crusher build]$ export FI_PROVIDER=cxi
[hyoklee@login2.crusher build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

When FI_PROVIDER is set to verbs,rxm on Poaris, test hangs with libfabric 1.15.0:

hyoklee@polaris-login-02:~/fabtsuite/build> export FI_PROVIDER=verbs
hyoklee@polaris-login-02:~/fabtsuite/build> ctest -I 1,1
Test project /home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

carns · 2022-09-23T15:14:21Z

When FI_PROVIDER is set to gni, I get the following error:

hyoklee@thetalogin6:~/fabtsuite-m/build/transfer> ./fabtget
0.000000083 capabilities not available?
main.4785: fi_getinfo: No data available

When FI_PROVIDER is set to cxi, test hangs:

[hyoklee@login2.crusher build]$ export FI_PROVIDER=cxi
[hyoklee@login2.crusher build]$ ctest -I 1,1
Test project /ccs/home/hyoklee/fabtsuite/build
    Start 1: single-node
  C-c C-c

How are you building libfabric (you can share your environment configuration if you are using Spack). It might be easiest to debug these kind of initialization problems by trying to launch the server in an interactive session. You can set the FI_LOG_LEVEL=debug environment variable to get more detailed information out of libfabric.

hyoklee · 2022-09-23T15:21:15Z

For Crusher, 1.15.0 is provided. For Theta, I use spack install fabtsuite ^libfabric fabrics=gni,tcp,udp,rxd,rxm.

hyoklee · 2022-10-31T20:49:43Z

I ran the test again by specifying the

export FI_PROPVIDER=cxi
export FI_LOG_LEVEL=debug

to the test/wait.slurm script.

I used the system libfabric.
The test failed with timeout on Crusher.
Crusher reported an error message in detail.

libfabric:107344:1667248504:cxi:cq:cxip_cq_verify_attr():840<warn> crusher008: \
CQ wait objects not supported
get_state_open.4216: fi_cq_open: Function not implemented
real 0.09
user 0.00
sys 0.01
1
srun: error: crusher008: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=208067.1
0

I also could verify that the address returned by fabtget is different from tcp provider.

Thus, I think fabtsuite seems to be able to test a different provider.

carns · 2022-11-02T20:33:22Z

@hyoklee can you confirm that the rest of the test suite passes on cxi?

hyoklee · 2022-11-04T21:06:09Z

@carns , I tested the rest of suite today and they worked fine. Do you want me to update slurm job script to use CXI (e.g., cross.slurm)? Or just update documentation like FAQ?

carns · 2022-11-10T20:35:37Z

Thanks @hyoklee .

Both if you don't mind. The script can be hardcoded to use cxi; that's likely to be the only thing we test on Crusher. The doc can describe more generically how to set the test to exercise a particular provider (cxi or otherwise).

As a side note since we have mentioned platform-specific test scripts: the .slurm etc. files would be a little clearer if the names of the files included the machine name. There are a lot of slurm, qsub, etc. systems out there but what actually needs to be executed within the script is likely platform-specific. If the current naming is important to the overall test flow then maybe just a comment at the top of each one that says something like "# test script for the Polaris system @alcf".

docs(faq): add FI_PROVIDER answer for #19

hyoklee self-assigned this Sep 22, 2022

hyoklee added enhancement New feature or request question Further information is requested labels Oct 31, 2022

hyoklee added a commit to hyoklee/fabtsuite that referenced this issue Nov 8, 2022

docs(faq): add FI_PROVIDER answer for mercury-hpc#19

6b74682

hyoklee mentioned this issue Nov 8, 2022

docs(faq): add FI_PROVIDER answer for #19 #20

Merged

hyoklee closed this as completed in #20 Nov 11, 2022

hyoklee added a commit that referenced this issue Nov 11, 2022

Merge pull request #20 from hyoklee/hyoklee-issue-19

a06b866

docs(faq): add FI_PROVIDER answer for #19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to specify which provider to test #19

how to specify which provider to test #19

carns commented Sep 22, 2022

hyoklee commented Sep 22, 2022

carns commented Sep 22, 2022

hyoklee commented Sep 23, 2022 •

edited

Loading

carns commented Sep 23, 2022

hyoklee commented Sep 23, 2022

hyoklee commented Oct 31, 2022 •

edited

Loading

carns commented Nov 2, 2022

hyoklee commented Nov 4, 2022

carns commented Nov 10, 2022

how to specify which provider to test #19

how to specify which provider to test #19

Comments

carns commented Sep 22, 2022

hyoklee commented Sep 22, 2022

carns commented Sep 22, 2022

hyoklee commented Sep 23, 2022 • edited Loading

carns commented Sep 23, 2022

hyoklee commented Sep 23, 2022

hyoklee commented Oct 31, 2022 • edited Loading

carns commented Nov 2, 2022

hyoklee commented Nov 4, 2022

carns commented Nov 10, 2022

hyoklee commented Sep 23, 2022 •

edited

Loading

hyoklee commented Oct 31, 2022 •

edited

Loading