Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/TCP: Use SIOCGIFCONF ioctl when /sys/class/net is missing. #4462

Open
wants to merge 87 commits into
base: master
Choose a base branch
from

Conversation

civodul
Copy link

@civodul civodul commented Nov 18, 2019

What

This change provides alternative code that uses the SIOCGIFCONF ioctl to get the names of the available TCP network interfaces.

Why ?

In some cases such as isolated build environments (as found in GNU Guix), containers, or non-Linux based system, /sys is missing.

How ?

Using the old, portable SIOCGIFCONF ioctl.

It may be that the SIOCGIFCONF can in fact replace the /sys-based code since the information returned should be the same. WDYT?

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
@civodul
Copy link
Author

civodul commented Nov 18, 2019

Hi @dmitrygx,

Thanks for your feedback. I've amended the patch following your suggestions, except one:

pls, use ucs_netif_ioctl instead

AFAICS, ucs_netif_ioctl is not applicable here because if_name would be NULL. However, I've changed this bit to use ucs_socket_create instead of socket.

Let me know what you think!

src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
@dmitrygx
Copy link
Member

AFAICS, ucs_netif_ioctl is not applicable here because if_name would be NULL. However, I've changed this bit to use ucs_socket_create instead of socket.

@civodul yes, you're right

src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
@civodul civodul force-pushed the tcp-iface-ioctl branch 2 times, most recently from d0f29b9 to 9ffc49e Compare November 18, 2019 14:23
src/uct/tcp/tcp_iface.c Outdated Show resolved Hide resolved
Copy link
Member

@dmitrygx dmitrygx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, thank you @civodul

@dmitrygx
Copy link
Member

ok to test

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 25 of 25 workers (click for details)

Note: the logs will be deleted after 25-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W3 ❌ FAILURE
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W2 ❌ FAILURE
hpc-arm-hwi-jenkins_W3 ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W1 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-test-node-new_W0 ❌ FAILURE
hpc-test-node-new_W1 ❌ FAILURE
hpc-test-node-new_W2 ❌ FAILURE
hpc-test-node-new_W3 ❌ FAILURE
r-vmb-ppc-jenkins_W0 ❌ FAILURE
r-vmb-ppc-jenkins_W1 ❌ FAILURE
r-vmb-ppc-jenkins_W2 ❌ FAILURE
r-vmb-ppc-jenkins_W3 ❌ FAILURE

@civodul
Copy link
Author

civodul commented Nov 19, 2019

Hi @dmitrygx,

Not sure I understand what the build failures are about. Let me know if you need anything else from me.

@dmitrygx
Copy link
Member

Hi @dmitrygx,

Not sure I understand what the build failures are about. Let me know if you need anything else from me.

/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:646:5: error: passing argument 2 of ‘ucs_malloc’ makes pointer from integer without a cast [-Werror]
     conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");
     ^
In file included from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/sys/sys.h:19:0,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/base/uct_iface.h:21,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp.h:10,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:11:
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/debug/memtrack.h:102:7: note: expected ‘const char *’ but argument is of type ‘int’
 void *ucs_malloc(size_t size, const char *name);
       ^
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:646:5: error: too many arguments to function ‘ucs_malloc’
     conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");
     ^
In file included from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/sys/sys.h:19:0,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/base/uct_iface.h:21,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp.h:10,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:11:
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/debug/memtrack.h:102:7: note: declared here
 void *ucs_malloc(size_t size, const char *name);

this following has to be changed
from

conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");

to

conf.ifc_req = ucs_calloc(1, conf.ifc_len, "ifreq");

@civodul
Copy link
Author

civodul commented Nov 19, 2019

Indeed... Done, thanks.

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 4 of 25 workers (click for details)

Note: the logs will be deleted after 26-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@civodul
Copy link
Author

civodul commented Nov 20, 2019

Hello! The messages I received from Mellanox' CI system show that the 3 test failures are about:

Fatal: transport error: Endpoint timeout

It's unclear to me how this could relate to this patch. Thoughts?

@brminich
Copy link
Contributor

unrelated

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 27-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@dmitrygx
Copy link
Member

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 3 of 25 workers (click for details)

Note: the logs will be deleted after 28-Nov-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@dmitrygx
Copy link
Member

infra issues
bot:mlx:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 29-Nov-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

dmitrygx
dmitrygx previously approved these changes Nov 22, 2019
yosefe and others added 29 commits August 21, 2023 17:01
…os_v1.15.x

BUILD: Readthedocs - add OS (v1.15.x)
UCS/SYS/TOPO: Added bw estimation for Sapphire Rapids family - v1.15.x
NEWS: Updated 1.15.0-rc4 section - v1.15.x
…v1.15.x

AZP: Fix user-defined var setters - v1.15.x
…rs-v1.15.x

Revert "AZP: Fix io_demo var access - v1.15.x"
…ch-v1.15.x

AZP/RELEASE: Fix launch condition-v1.15.x
…ix_1.15.x

UCP/RNDV: Do not use recv ppln with generic send buf - v1.15.x
In some cases such as isolated build environments (containers) or
non-Linux based system, /sys is missing.

This change provides alternative code that uses the SIOCGIFCONF ioctl to
get the names of the available TCP network interfaces.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP-DNM Work in progress / Do not review
Projects
None yet
Development

Successfully merging this pull request may close these issues.