Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infiniband-diags: update ibportstate port enabling example #847

Closed
wants to merge 1 commit into from

Conversation

Honggang-LI
Copy link
Contributor

To enable a disabled port, the CA name and Port number is needed.
The old example of port enabling does not work, so replace it.

Signed-off-by: Honggang Li honli@redhat.com

@@ -402,7 +402,7 @@ int main(int argc, char **argv)
"\tmkey, mkeylease, mkeyprot\n";
const char *usage_examples[] = {
"3 1 disable\t\t\t# by lid",
"-G 0x2C9000100D051 1 enable\t# by guid",
"3 1 -C mlx4_0 -P 1 enable\t# by lid, CA name and Port number",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is enablement via GUID not working?

FWIW if a CA port is disabled it does not have a valid LID (unless running on an SM node). So I'm wondering why this is better than using a port GUID which should be unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[root ~]$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.42.5000
Hardware version: 1
Node GUID: 0xf4521403007be0e0
System image GUID: 0xf4521403007be0e3
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 38
LMC: 0
SM lid: 13
Capability mask: 0x02594868
Port GUID: 0xf4521403007be0e1
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02594868
Port GUID: 0xf4521403007be0e2
Link layer: InfiniBand

[root ~]$ ibportstate -G 0xf4521403007be0e1 1 enable
ibwarn: [84425] ib_path_query_via: sa call path_query failed
ibportstate: iberror: failed: can't resolve destination port 0xf4521403007be0e1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is enablement via GUID not working?

I did not run the git bisect test. But I build ibportstate with code from upstream repo. The problem is that the first port is disabled. function 'resolve_ca_port' skips the first port.

 292>        for (i = 0; i <= ca.numports; i++) {
 293DEBUG("checking port %d", i);
 294if (!ca.ports[i])
 295continue;
 296if (strcmp(ca.ports[i]->link_layer, "InfiniBand") &&
 297strcmp(ca.ports[i]->link_layer, "IB"))
 298continue;
 299if (up < 0 && ca.ports[i]->phys_state == 5)
 300up = *port = i;
/root/rdma-core/libibumad/umad.c  

(gdb) p ca.ports[0]
$11 = (umad_port_t *) 0x0
(gdb) p ca.ports[1]
$12 = (umad_port_t *) 0x6099d0
(gdb) bt
#0 resolve_ca_port (ca_name=ca_name@entry=0x611920 "mlx4_0", port=port@entry=0x7fffffffd82c) at ../libibumad/umad.c:292
#1 0x0000155554f0bea8 in resolve_ca_name (ca_in=0x0, ca_name=0x7fffffffd888, best_port=0x7fffffffd87c) at ../libibumad/umad.c:372
#2 resolve_ca_name (ca_in=, best_port=0x7fffffffd87c, ca_name=0x7fffffffd888) at ../libibumad/umad.c:334
#3 0x0000155554f0c22d in umad_open_port (ca_name=ca_name@entry=0x0, portnum=, portnum@entry=0) at ../libibumad/umad.c:701
#4 0x000015555511d718 in mad_rpc_open_port (dev_name=0x0, dev_port=0, mgmt_classes=mgmt_classes@entry=0x7fffffffdaa4, num_classes=num_classes@ent
ry=3) at ../libibmad/rpc.c:398
#5 0x00000000004019c4 in main (argc=3, argv=0x7fffffffdeb8) at ../infiniband-diags/ibportstate.c:423

FWIW if a CA port is disabled it does not have a valid LID (unless running on an SM node). So I'm wondering why this is better than using a port GUID which should be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix this code then. The tool should be able to enable a local port. I don't have time right now to figure out why this changed but at some point it did work and I don't see any reason it could not be made to work again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix this code then.
Got it. I'm closing this PR and will open a new PR to fix the code.
Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is enablement via GUID not working?

I tested the oldest OFED packages, which built from https://downloads.openfabrics.org/management/ , on a RHEL-7.0 x86-64 machine. They do not work.

~]$ ls *gz
infiniband-diags-1.3.2.tar.gz  libibcommon-1.0.5.tar.gz  libibumad-1.1.3.tar.gz
 libibmad-1.1.2.tar.gz     opensm-3.1.5.tar.gz    

 ~]$ rpm -q libibcommon libibumad opensm-libs libibmad infiniband-diags
libibcommon-1.0.5-1.el7.x86_64
libibumad-1.1.3-1.el7.x86_64
opensm-libs-3.1.5-1.el7.x86_64
libibmad-1.1.2-1.el7.x86_64
infiniband-diags-1.3.2-1.el7.x86_64

~]$ ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 0
	Node GUID: 0xf4521403007be160
	System image GUID: 0xf4521403007be163
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 37
		LMC: 0
		SM lid: 13
		Capability mask: 0x02514868
		Port GUID: 0xf4521403007be161
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 2
		LMC: 0
		SM lid: 1
		Capability mask: 0x02514868
		Port GUID: 0xf4521403007be162

~]$ ibportstate 37 1 disable
ibportstate: iberror: failed: smp query nodeinfo: Node type not switch

I think ibportstate function enable needs the HCA name when multiple HCAs available. It also needs the port number for target HCA, which has more than one ports. Otherwise, the first active port of first HCA found by libibumad will be use. The first active port found by libibumad may not in the same fabric as the target port, which will be enabled.

When enablement via GUID for a system which has multiple HCAs or IB ports, the CA name and Port number was not specified, the first active port of first HCA found by libibumad will be used. The port selected by libibumad and the port specified by GUID may not in the same fabric. When they are not in the same fabric, enablement via GUID will never work.

I think we need to update the help message to hint use to specific CA name and Port number when multiple HCAs or ports available. In the meanwhile, we also need to fix enablement via GUID when CA name and Port number are specified.

@Honggang-LI Honggang-LI reopened this Oct 21, 2020
@Honggang-LI
Copy link
Contributor Author

I'm reopening this PR for discussion.

@Honggang-LI
Copy link
Contributor Author

Test log with latest upstream rdma-core/infiniband-diags.

~]$ cat a.sh 
#!/bin/bash
set -x

rpm -q rdma-core infiniband-diags

ibstat

ibportstate 38 1 disable
sleep 10
ibstat

ibportstate -G 0xf4521403007be0e1 1 enable
sleep 10
ibstat

ibportstate  -G 0xf4521403007be0e1  1   -C mlx4_0 -P 1 enable
sleep 10
ibstat

ibportstate  38 1 -C mlx4_0 -P 1   enable
sleep 10
ibstat

~]$ sh a.sh 
+ rpm -q rdma-core infiniband-diags
rdma-core-32.0-1.el8.x86_64
infiniband-diags-32.0-1.el8.x86_64
+ ibstat
ibwarn: [168949] umad_init: umad_init
ibwarn: [168949] umad_get_ca_device_list: return 1 cas
ibwarn: [168949] umad_get_ca: ca_name mlx4_0
ibwarn: [168949] umad_get_ca: opened mlx4_0
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007be0e0
	System image GUID: 0xf4521403007be0e3
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 38
		LMC: 0
		SM lid: 13
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e2
		Link layer: InfiniBand
+ ibportstate 38 1 disable
ibwarn: [168950] umad_init: umad_init
ibwarn: [168950] umad_open_port: ca (null) port 0
ibwarn: [168950] umad_get_ca_device_list: return 1 cas
ibwarn: [168950] resolve_ca_name: checking ca 'mlx4_0'
ibwarn: [168950] resolve_ca_port: checking ca 'mlx4_0'
ibwarn: [168950] umad_get_ca: ca_name mlx4_0
ibwarn: [168950] umad_get_ca: opened mlx4_0
ibwarn: [168950] resolve_ca_port: checking port 0
ibwarn: [168950] resolve_ca_port: checking port 1
ibwarn: [168950] resolve_ca_port: found active port 1
ibwarn: [168950] resolve_ca_name: found ca mlx4_0 with port 1 type 1
ibwarn: [168950] resolve_ca_name: found ca mlx4_0 with active port 1
ibwarn: [168950] umad_open_port: opening mlx4_0 port 1
ibwarn: [168950] dev_to_umad_id: mapped mlx4_0 1 to 0
ibwarn: [168950] umad_open_port: opened /dev/infiniband/umad0 fd 3 portid 0
ibwarn: [168950] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168950] umad_register: fd 3 registered to use agent 0 qp 0
ibwarn: [168950] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168950] umad_register: fd 3 registered to use agent 1 qp 0
ibwarn: [168950] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)
ibwarn: [168950] umad_register: fd 3 registered to use agent 2 qp 1
ibwarn: [168950] umad_set_addr: umad 0x7fff675323a0 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [168950] umad_send: fd 3 agentid 0 umad 0x7fff675323a0 timeout 1000
ibwarn: [168950] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [168950] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168950] umad_recv: fd 3 umad 0x7fff675327a0 timeout 1000
ibwarn: [168950] umad_recv: mad received by agent 0 length 320
Initial CA/RT PortInfo:
ibwarn: [168950] umad_set_addr: umad 0x7fff67532330 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [168950] umad_send: fd 3 agentid 0 umad 0x7fff67532330 timeout 1000
ibwarn: [168950] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [168950] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168950] umad_recv: fd 3 umad 0x7fff67532730 timeout 1000
ibwarn: [168950] umad_recv: mad received by agent 0 length 320
# Port info: Lid 38 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................38
SMLid:...........................13
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............14.0625 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibwarn: [168950] umad_set_addr: umad 0x7fff675323a0 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [168950] umad_send: fd 3 agentid 0 umad 0x7fff675323a0 timeout 1000
ibwarn: [168950] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [168950] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168950] umad_recv: fd 3 umad 0x7fff675327a0 timeout 1000
ibwarn: [168950] umad_recv: mad received by agent 0 length 320
# MLNX ext Port info: Lid 38 port 1
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x00
Disable may be irreversible
ibwarn: [168950] umad_set_addr: umad 0x7fff675323b0 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [168950] umad_send: fd 3 agentid 0 umad 0x7fff675323b0 timeout 1000
ibwarn: [168950] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [168950] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168950] umad_recv: fd 3 umad 0x7fff675327b0 timeout 1000
ibwarn: [168950] umad_recv: mad received by agent 0 length 320

After PortInfo set:
# Port info: Lid 38 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................38
SMLid:...........................13
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................Extended speed
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............14.0625 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibwarn: [168950] umad_close_port: closed fd 3
+ sleep 10
+ ibstat
ibwarn: [168988] umad_init: umad_init
ibwarn: [168988] umad_get_ca_device_list: return 1 cas
ibwarn: [168988] umad_get_ca: ca_name mlx4_0
ibwarn: [168988] umad_get_ca: opened mlx4_0
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007be0e0
	System image GUID: 0xf4521403007be0e3
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 38
		LMC: 0
		SM lid: 13
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e2
		Link layer: InfiniBand
+ ibportstate -G 0xf4521403007be0e1 1 enable
ibwarn: [168989] umad_init: umad_init
ibwarn: [168989] umad_open_port: ca (null) port 0
ibwarn: [168989] umad_get_ca_device_list: return 1 cas
ibwarn: [168989] resolve_ca_name: checking ca 'mlx4_0'
ibwarn: [168989] resolve_ca_port: checking ca 'mlx4_0'
ibwarn: [168989] umad_get_ca: ca_name mlx4_0
ibwarn: [168989] umad_get_ca: opened mlx4_0
ibwarn: [168989] resolve_ca_port: checking port 0
ibwarn: [168989] resolve_ca_port: checking port 1
ibwarn: [168989] resolve_ca_port: checking port 2
ibwarn: [168989] resolve_ca_port: found active port 2
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with port 2 type 1
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with active port 2
ibwarn: [168989] umad_open_port: opening mlx4_0 port 2
ibwarn: [168989] dev_to_umad_id: mapped mlx4_0 2 to 1
ibwarn: [168989] umad_open_port: opened /dev/infiniband/umad1 fd 3 portid 1
ibwarn: [168989] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168989] umad_register: fd 3 registered to use agent 0 qp 0
ibwarn: [168989] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168989] umad_register: fd 3 registered to use agent 1 qp 0
ibwarn: [168989] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)
ibwarn: [168989] umad_register: fd 3 registered to use agent 2 qp 1
ibwarn: [168989] umad_get_port: ca_name (null) portnum 0
ibwarn: [168989] umad_get_ca_device_list: return 1 cas
ibwarn: [168989] resolve_ca_name: checking ca 'mlx4_0'
ibwarn: [168989] resolve_ca_port: checking ca 'mlx4_0'
ibwarn: [168989] umad_get_ca: ca_name mlx4_0
ibwarn: [168989] umad_get_ca: opened mlx4_0
ibwarn: [168989] resolve_ca_port: checking port 0
ibwarn: [168989] resolve_ca_port: checking port 1
ibwarn: [168989] resolve_ca_port: checking port 2
ibwarn: [168989] resolve_ca_port: found active port 2
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with port 2 type 1
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with active port 2
ibwarn: [168989] umad_release_port: port mlx4_0:2
ibwarn: [168989] umad_release_port: releasing mlx4_0:2
ibwarn: [168989] umad_get_port: ca_name (null) portnum 0
ibwarn: [168989] umad_get_ca_device_list: return 1 cas
ibwarn: [168989] resolve_ca_name: checking ca 'mlx4_0'
ibwarn: [168989] resolve_ca_port: checking ca 'mlx4_0'
ibwarn: [168989] umad_get_ca: ca_name mlx4_0
ibwarn: [168989] umad_get_ca: opened mlx4_0
ibwarn: [168989] resolve_ca_port: checking port 0
ibwarn: [168989] resolve_ca_port: checking port 1
ibwarn: [168989] resolve_ca_port: checking port 2
ibwarn: [168989] resolve_ca_port: found active port 2
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with port 2 type 1
ibwarn: [168989] resolve_ca_name: found ca mlx4_0 with active port 2
ibwarn: [168989] umad_release_port: port mlx4_0:2
ibwarn: [168989] umad_release_port: releasing mlx4_0:2
ibwarn: [168989] umad_set_addr: umad 0x7ffed38e3b80 dlid 1 dqp 1 sl 0, qkey 80010000
ibwarn: [168989] umad_send: fd 3 agentid 2 umad 0x7ffed38e3b80 timeout 1000
ibwarn: [168989] umad_dump: agent id 2 status 0 timeout 1000
ibwarn: [168989] umad_addr_dump: qpn 1 qkey 0x80010000 lid 1 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168989] umad_recv: fd 3 umad 0x7ffed38e3f80 timeout 1000
ibwarn: [168989] umad_recv: mad received by agent 2 length 320
ibwarn: [168989] ib_path_query_via: sa call path_query failed
ibportstate: iberror: failed: can't resolve destination port 0xf4521403007be0e1
+ sleep 10
+ ibstat
ibwarn: [168996] umad_init: umad_init
ibwarn: [168996] umad_get_ca_device_list: return 1 cas
ibwarn: [168996] umad_get_ca: ca_name mlx4_0
ibwarn: [168996] umad_get_ca: opened mlx4_0
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007be0e0
	System image GUID: 0xf4521403007be0e3
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 38
		LMC: 0
		SM lid: 13
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e2
		Link layer: InfiniBand
+ ibportstate -G 0xf4521403007be0e1 1 -C mlx4_0 -P 1 enable
ibwarn: [168997] umad_init: umad_init
ibwarn: [168997] umad_open_port: ca mlx4_0 port 1
ibwarn: [168997] umad_open_port: opening mlx4_0 port 1
ibwarn: [168997] dev_to_umad_id: mapped mlx4_0 1 to 0
ibwarn: [168997] umad_open_port: opened /dev/infiniband/umad0 fd 3 portid 0
ibwarn: [168997] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168997] umad_register: fd 3 registered to use agent 0 qp 0
ibwarn: [168997] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [168997] umad_register: fd 3 registered to use agent 1 qp 0
ibwarn: [168997] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)
ibwarn: [168997] umad_register: fd 3 registered to use agent 2 qp 1
ibwarn: [168997] umad_get_port: ca_name mlx4_0 portnum 1
ibwarn: [168997] umad_release_port: port mlx4_0:1
ibwarn: [168997] umad_release_port: releasing mlx4_0:1
ibwarn: [168997] umad_get_port: ca_name mlx4_0 portnum 1
ibwarn: [168997] umad_release_port: port mlx4_0:1
ibwarn: [168997] umad_release_port: releasing mlx4_0:1
ibwarn: [168997] umad_set_addr: umad 0x7ffcf771a310 dlid 13 dqp 1 sl 0, qkey 80010000
ibwarn: [168997] umad_send: fd 3 agentid 2 umad 0x7ffcf771a310 timeout 1000
ibwarn: [168997] umad_dump: agent id 2 status 0 timeout 1000
ibwarn: [168997] umad_addr_dump: qpn 1 qkey 0x80010000 lid 13 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [168997] umad_recv: fd 3 umad 0x7ffcf771a710 timeout 1000
ibwarn: [168997] _do_madrpc: recv failed: Connection timed out
ibwarn: [168997] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 13)
ibwarn: [168997] ib_path_query_via: sa call path_query failed
ibportstate: iberror: failed: can't resolve destination port 0xf4521403007be0e1
+ sleep 10
+ ibstat
ibwarn: [169003] umad_init: umad_init
ibwarn: [169003] umad_get_ca_device_list: return 1 cas
ibwarn: [169003] umad_get_ca: ca_name mlx4_0
ibwarn: [169003] umad_get_ca: opened mlx4_0
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007be0e0
	System image GUID: 0xf4521403007be0e3
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 38
		LMC: 0
		SM lid: 13
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e2
		Link layer: InfiniBand
+ ibportstate 38 1 -C mlx4_0 -P 1 enable
ibwarn: [169004] umad_init: umad_init
ibwarn: [169004] umad_open_port: ca mlx4_0 port 1
ibwarn: [169004] umad_open_port: opening mlx4_0 port 1
ibwarn: [169004] dev_to_umad_id: mapped mlx4_0 1 to 0
ibwarn: [169004] umad_open_port: opened /dev/infiniband/umad0 fd 3 portid 0
ibwarn: [169004] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [169004] umad_register: fd 3 registered to use agent 0 qp 0
ibwarn: [169004] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil)
ibwarn: [169004] umad_register: fd 3 registered to use agent 1 qp 0
ibwarn: [169004] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)
ibwarn: [169004] umad_register: fd 3 registered to use agent 2 qp 1
ibwarn: [169004] umad_set_addr: umad 0x7ffe9cef0950 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [169004] umad_send: fd 3 agentid 0 umad 0x7ffe9cef0950 timeout 1000
ibwarn: [169004] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [169004] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [169004] umad_recv: fd 3 umad 0x7ffe9cef0d50 timeout 1000
ibwarn: [169004] umad_recv: mad received by agent 0 length 320
Initial CA/RT PortInfo:
ibwarn: [169004] umad_set_addr: umad 0x7ffe9cef08e0 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [169004] umad_send: fd 3 agentid 0 umad 0x7ffe9cef08e0 timeout 1000
ibwarn: [169004] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [169004] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [169004] umad_recv: fd 3 umad 0x7ffe9cef0ce0 timeout 1000
ibwarn: [169004] umad_recv: mad received by agent 0 length 320
# Port info: Lid 38 port 1
LinkState:.......................Down
PhysLinkState:...................Disabled
Lid:.............................38
SMLid:...........................13
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............No Extended Speed
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibwarn: [169004] umad_set_addr: umad 0x7ffe9cef0950 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [169004] umad_send: fd 3 agentid 0 umad 0x7ffe9cef0950 timeout 1000
ibwarn: [169004] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [169004] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [169004] umad_recv: fd 3 umad 0x7ffe9cef0d50 timeout 1000
ibwarn: [169004] umad_recv: mad received by agent 0 length 320
# MLNX ext Port info: Lid 38 port 1
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x00
ibwarn: [169004] umad_set_addr: umad 0x7ffe9cef0960 dlid 38 dqp 0 sl 0, qkey 0
ibwarn: [169004] umad_send: fd 3 agentid 0 umad 0x7ffe9cef0960 timeout 1000
ibwarn: [169004] umad_dump: agent id 0 status 0 timeout 1000
ibwarn: [169004] umad_addr_dump: qpn 0 qkey 0x0 lid 38 sl 0
grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0
Gid 0x00000000000000000000000000000000
ibwarn: [169004] umad_recv: fd 3 umad 0x7ffe9cef0d60 timeout 1000
ibwarn: [169004] umad_recv: mad received by agent 0 length 320

After PortInfo set:
# Port info: Lid 38 port 1
LinkState:.......................Down
PhysLinkState:...................Polling
Lid:.............................38
SMLid:...........................13
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............No Extended Speed
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibwarn: [169004] umad_close_port: closed fd 3
+ sleep 10
+ ibstat
ibwarn: [169298] umad_init: umad_init
ibwarn: [169298] umad_get_ca_device_list: return 1 cas
ibwarn: [169298] umad_get_ca: ca_name mlx4_0
ibwarn: [169298] umad_get_ca: opened mlx4_0
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007be0e0
	System image GUID: 0xf4521403007be0e3
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 38
		LMC: 0
		SM lid: 13
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x02594868
		Port GUID: 0xf4521403007be0e2
		Link layer: InfiniBand

@Hakon-Bugge
Copy link
Contributor

Hakon-Bugge commented Oct 21, 2020

@Honggang-LI A quick reading reveals:

ibportstate 38 1 disable

To disable a port, you must do so on the switch port it is connected. Or, is this a deliberate negative testing?

Give Lid1 and Lid2 being local lids of the ports on an HCA, you can do:

ibtracert lid1 lid2
to find the switch port number these ports are connected to.

@Honggang-LI
Copy link
Contributor Author

@Honggang-LI A quick reading reveals:

ibportstate 38 1 disable

To disable a port, you must do so on the switch port it is connected.

Why?

Or, is this a deliberate negative testing?

Yes. For example, while we run NVME over IB, we flip the port state to simulate path fail.

@Hakon-Bugge
Copy link
Contributor

To disable a port, you must do so on the switch port it is connected.

Why?

Because, at least for IB HCAs, it doesn't work.

Or, is this a deliberate negative testing?

Yes. For example, while we run NVME over IB, we flip the port state to simulate path fail.

I meant negative testing of ibportstate.

Snip from the ibportstate man page:

ibportstate allows the port state and port physical state of an IB port to be queried (...), or a switch port to be disabled, enabled, or reset. It also allows the link speed/width enabled on any IB port to be adjusted.

@Honggang-LI
Copy link
Contributor Author

To disable a port, you must do so on the switch port it is connected.

Why?

Because, at least for IB HCAs, it doesn't work.

In fact, it works for me for IB HCAs. Which type of HCA you had tested?

I tested mlx4, mlx5, qib and mellanox connectIB. They all work for me. I locally changed
the HCA port state without the knowledge of the Subnet Manager for IB HCAs.

Please see chapter 14 of InfiniBand Architecture Release 1.3 for details of PortInfo and local changes.

Or, is this a deliberate negative testing?

Yes. For example, while we run NVME over IB, we flip the port state to simulate path fail.

I meant negative testing of ibportstate.

OK, I got it. But locally disable a HCA port and disable a switch port the HCA port connected to are different things.

For example, at least the HCA port physical state is different. When locally disable HCA port, its physical port state
is 'Disabled', while disable the switch port it connected to, its physical port state is 'Polling'.

[root@rdma01 ~]# diff -Nurp disable.ib_switch.port locally.diable.HCA.port 
--- disable.ib_switch.port	2020-10-26 06:44:23.486659205 -0400
+++ locally.diable.HCA.port	2020-10-26 06:43:10.676194121 -0400
@@ -17,7 +17,7 @@ CA 'qib0'
 		Link layer: InfiniBand
 	Port 2:
 		State: Down
-		Physical state: Polling
+		Physical state: Disabled
 		Rate: 10
 		Base lid: 4
 		LMC: 0

Snip from the ibportstate man page:

ibportstate allows the port state and port physical state of an IB port to be queried (...), or a switch port to be disabled, enabled, or reset. It also allows the link speed/width enabled on any IB port to be adjusted.

It seems we also need to update the man page for ibportstate.

Here is an example of locally changed of HCA port state.
1 Two machines, each has a dual port qib HCA, all IB ports had been connect to same IB switch.
2 Reboot both machines.
3 Start an instance opensm on the first machine. Wait for two minutes, make sure all four ports are 'Active'.
4 Kill the opensm on first machine.
5 Run sminfo on the second machine. sminfo should fail as opensm was killed. Now, no Subnet manager for the fabric.
6 locally disable a port of the second machine, and wait for 30 seconds
7 locally enable the disabled port of the second machine.
8 run ibstat check the state of the enabled port, it should be in 'Initializing' state. The physical state should be 'LinkUp'.

[root@rdma02 ~]# sminfo 
sminfo: iberror: failed: query

[root@rdma02 ~]# ibstat
CA 'qib0'
	CA type: InfiniPath_QLE7342
	Number of ports: 2
	Firmware version: 
	Hardware version: 2
	Node GUID: 0x001175000077d708
	System image GUID: 0x001175000077d708
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 2
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d708
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d709
		Link layer: InfiniBand
[root@rdma02 ~]# ibportstate -C qib0 -P 1 2 1 disable
Initial CA/RT PortInfo:
# Port info: Lid 2 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................2
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
Disable may be irreversible

After PortInfo set:
# Port info: Lid 2 port 1
LinkState:.......................Down
PhysLinkState:...................Disabled
Lid:.............................2
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................2.5 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
[root@rdma02 ~]# 
[root@rdma02 ~]# ibstat
CA 'qib0'
	CA type: InfiniPath_QLE7342
	Number of ports: 2
	Firmware version: 
	Hardware version: 2
	Node GUID: 0x001175000077d708
	System image GUID: 0x001175000077d708
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 2
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d708
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d709
		Link layer: InfiniBand
[root@rdma02 ~]# 

[root@rdma02 ~]# ibportstate -C qib0 -P 1 2 1 enable
Initial CA/RT PortInfo:
# Port info: Lid 2 port 1
LinkState:.......................Down
PhysLinkState:...................Disabled
Lid:.............................2
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................2.5 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0

After PortInfo set:
# Port info: Lid 2 port 1
LinkState:.......................Down
PhysLinkState:...................PortConfigurationTraining
Lid:.............................2
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................2.5 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
[root@rdma02 ~]# 
[root@rdma02 ~]# ibstat
CA 'qib0'
	CA type: InfiniPath_QLE7342
	Number of ports: 2
	Firmware version: 
	Hardware version: 2
	Node GUID: 0x001175000077d708
	System image GUID: 0x001175000077d708
	Port 1:
		State: Initializing
		Physical state: LinkUp
		Rate: 40
		Base lid: 2
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d708
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 3
		LMC: 0
		SM lid: 1
		Capability mask: 0x07690868
		Port GUID: 0x001175000077d709
		Link layer: InfiniBand
[root@rdma02 ~]# 

@Hakon-Bugge
Copy link
Contributor

Yep, the args you present to ibportstate actually works. Just verified on a mlx4 system w IB link-layer.

@jgunthorpe
Copy link
Member

Is there some conclusion here?

@weiny2
Copy link
Contributor

weiny2 commented Nov 5, 2020

I thought the issue was just user error. But perhaps I miss-read something?

@Honggang-LI
Copy link
Contributor Author

I thought the issue was just user error. But perhaps I miss-read something?

The ibportstat documentation is misleading. We need update the man-page and example in usage message.

@weiny2
Copy link
Contributor

weiny2 commented Nov 5, 2020

I thought the issue was just user error. But perhaps I miss-read something?

The ibportstat documentation is misleading. We need update the man-page and example in usage message.

I can see that being a problem. Will you close this and open another issue?

…tstate

A host, from which execute the enable/disable/reset command, may be
connected to multiple InfiniBand fabrics. When the HCA name and
Port number were not specified, the libibumad library will pick up the
first active port it was found, which may not be wanted. Recommend to
specific the HCA name and Port number when run ibportstate.

On the other hand, HCA port may be locally changed without the
knowledge of the Subnet Manager. When locally enable a disabled HCA
port, the HCA name and Port number must be specified.

Signed-off-by: Honggang Li <honli@redhat.com>
@Honggang-LI
Copy link
Contributor Author

I thought the issue was just user error. But perhaps I miss-read something?

The ibportstat documentation is misleading. We need update the man-page and example in usage message.

I can see that being a problem. Will you close this and open another issue?

New PR:
#868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants