query_smp.c:199; Connection timed out #12

mglants · 2021-03-31T14:31:36Z

After commit c6eef51
I got
src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
infiniband_scrape_ok 0.0

The text was updated successfully, but these errors were encountered:

mglants · 2021-03-31T15:15:20Z

@guilbaults

guilbaults · 2021-03-31T18:17:15Z

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ?
From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

mglants · 2021-03-31T19:20:01Z

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ?
From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

First string is about error, the next data is ok. What is the cause of the problem?

src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
Errors for 0xb8599f0300383ec0 "ibswh2 [255-B05] HDR-40"
   GUID 0xb8599f0300383ec0 port ALL: [SymbolErrorCounter == 19817 (19.353K)] [LinkDownedCounter == 37 (37.000)] [PortXmitDiscards == 8 (8.000)] [PortLocalPhysicalErrors == 8] [PortXmitWait == 2417380683166 (2.199T)] [PortXmitData == 5044493407069892 (17.922PB)] [PortRcvData == 5044493406602707 (17.922PB)] [PortXmitPkts == 5101051455966 (4.639T)] [PortRcvPkts == 5101051441855 (4.639T)] [PortUnicastXmitPkts == 5101050102526 (4.639T)] [PortUnicastRcvPkts == 5101050102542 (4.639T)] [PortMulticastXmitPkts == 29380 (28.691K)] [PortMulticastRcvPkts == 15253 (14.896K)]
   GUID 0xb8599f0300383ec0 port 1: [LinkDownedCounter == 1 (1.000)] [PortXmitDiscards == 3 (3.000)] [PortLocalPhysicalErrors == 3] [PortXmitWait == 102856180328 (95.792G)] [PortXmitData == 225495513282581 (820.348TB)] [PortRcvData == 154914464326291 (563.576TB)] [PortXmitPkts == 229378238279 (213.625G)] [PortRcvPkts == 154134492149 (143.549G)] [PortUnicastXmitPkts == 229378226543 (213.625G)] [PortUnicastRcvPkts == 154134491212 (143.549G)] [PortMulticastXmitPkts == 11736 (11.461K)] [PortMulticastRcvPkts == 937 (937.000)]

Also output before commit c6eef51 just working fine

guilbaults · 2021-03-31T19:32:16Z

Its weird since c6eef51 did not change anything in parsing, but is simply capturing stderr and since you have something in stderr, it abort the collection cycle.

I don't have yet access to a HDR switch to verify (will do in a few months), but it seem there is a problem in ibqueryerrors since it only seem to show your local port. Some similar error can happen when the tools in the infiniband-diags rpm are not launched as root since some packets need to be sent with root (umad packets). You can try ibhosts and ibswitches, they should list everything connected in your network.

From what I know, 2 other things could block umad packets, running in a VM with SR-IOV and having some management keys installed on the fabric/subnet manager.

mglants · 2021-03-31T19:34:06Z

Found something on the internet
Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:


root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out


Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31

Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

gabrieleiannetti · 2021-04-09T14:09:56Z

`` to see if your host can access the counters on the fabric ?

Yes, I can confirm that. We are also seeing errors caused by ibqueryerrors.

Exporter Output:

2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

STDERR output from ibqueryerrors:

src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

I would say, that the exporter works as designed with return on error implemented in c6eef51.

Previously running the exporter before commit c6eef51, we did not see any errors in the metrics exported by the exporter. So I doubt that there is no errors in the fabric, since also calling ibquererrors returns the shown errors.

guilbaults · 2021-04-09T14:20:06Z

When that error is printed in STDERR, is there any useful counters in STDOUT ?
If some counters are collected correctly, then probably the script should not abort, but still report the error in the logs and exported stats.

gabrieleiannetti · 2021-04-09T14:41:40Z

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

As I mentioned before, as the exporter was running without quitting on error,
we did not see any positive values in the exported metrics.

I checked every metric. There was a lot of metrics exported with 0 values and I could identify the nodes and switches in the fabric.

But the exporter should export those errors, right?

gabrieleiannetti · 2021-04-09T15:02:57Z

2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

We have identified those 13 bad status 110; Connection timed out errors as machines that are still in the fabric, but there are switched off. Think they have power on the ports of the NICs.

gabrieleiannetti · 2021-04-09T15:06:26Z

Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery n

Hi @mglants,
can you verify that the machines are up and running or are they switched off?

guilbaults · 2021-04-09T15:26:56Z

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

mglants · 2021-04-10T16:47:27Z

Well, I am not familiar with the output yet, but it looks like it is full of errors.
I have attached the output with anonymous host names as XXX.
ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

Can we also include information about ports in down state. In my case it was due we shutdown server for maintenance. As that errors have a path to port with problem

gabrieleiannetti · 2021-04-12T11:10:27Z

Well, I am not familiar with the output yet, but it looks like it is full of errors.
I have attached the output with anonymous host names as XXX.
ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

I would like to propose, that the exporter does export a metric for that error "bad status 110; Connection timed out".

Would be also helpful to collect such errors in the fabric.

Thanks for sharing the most interesting errors.

gabrieleiannetti · 2021-04-12T13:49:25Z

Found something on the internet
Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:
root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out


Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31
Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

Just another approach which gives a good overview of the Ports of the switch here:

iblinkinfo --node-name-map ibswitches_GUID-Name -D <direct_path_without_last_number>

In your example you should see at port 18 a line that will probably end with something like "Port not available".

* Remove return statement * Reformat code for more readability

gabrieleiannetti · 2021-04-16T12:52:11Z

@guilbaults

Hi, I would like to contribute a patch for processing an error from STDERR...

I have tried to check which errors can be returned by ibqueryerrors, but it is not clear to me.

My approach would be checking the return code first...

https://docs.oracle.com/cd/E88353_01/html/E72487/ibqueryerrors-8.html

EXIT STATUS
       -1 if scan fails.

       0 if scan succeeds without errors beyond thresholds

       1 if errors are found beyond thresholds or inconsistencies are found in
       check mode.

If the return code is -1 the exporter skips and sets the metric infiniband_scrape_ok=0.
Otherwise it continues, errors from stderr should always be checked then.

I would first implement processing of a "bad status" error.

As I figured out from the sources found here about this error (https://github.com/linux-rdma/infiniband-diags/blob/b48b4a630f54438ba2b529f40fab99c1abba3763/libibnetdisc/src/query_smp.c#L189), it looks like there is the following scheme:

	if ((status = umad_status(umad))) {
		IBND_ERROR("umad (%s Attr 0x%x:%u) bad status %d; %s\n",
			   portid2str(&smp->path), smp->rpc.attr.id,
			   smp->rpc.attr.mod, status, strerror(status));
		if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
			rc = mlnx_ext_port_info_err(engine, smp, mad,
						    smp->cb_data);
	} else if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F))) {
		IBND_ERROR("mad (%s Attr 0x%x:%u) bad status 0x%x\n",
			   portid2str(&smp->path), smp->rpc.attr.id,
			   smp->rpc.attr.mod, status);
		if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
			rc = mlnx_ext_port_info_err(engine, smp, mad,
						    smp->cb_data);

So I would suggest to add a new metric infiniband_bad_status... (data type?),
that should include the following labels:

path
status
error (could be empty, see else if)

As an example how it would look like with that data set:

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,1 Attr 0xff90:1) bad status 110; Connection timed out

path: "DR path slid 0; dlid 0; 0,1,1"
status: "110"
error: "Connection timed out"

If the exporter detects an unknown error not bad status, it could log warnings and signal that something is wrong by setting infiniband_scrape_ok=0.

What do you think - Should we go that way?

gabrieleiannetti · 2021-04-27T13:02:39Z

I have built an example how to parse and process an STDERR error for bad status:

import re

pattern = r"src\/query\_smp\.c\:[\d]+\; (?:mad|umad) \((DR path .*) Attr .*\) bad status ([\d]+); (.*)"

string = "src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out"

prog = re.compile(pattern)
result = prog.match(string)

if result:
    path = result.group(1)
    status = result.group(2)
    error = result.group(3)

else:
    pass # handle error...

I think this should do the job.

Probably I will create an pull request the next days...

* add new metric infiniband_bad_status_error * update infiniband_scrape_ok metric depending on scrape of infiniband_bad_status_error

guilbaults changed the title ~~Error after last commit~~ query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out Mar 31, 2021

guilbaults changed the title ~~query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out~~ query_smp.c:199; Connection timed out Mar 31, 2021

gabrieleiannetti mentioned this issue Apr 14, 2021

Prevents quit on errors from ibqueryerror (#12) #22

Merged

guilbaults pushed a commit that referenced this issue Apr 14, 2021

Prevents quit on errors from ibqueryerror (#12) (#22)

cd5fa05

* Remove return statement * Reformat code for more readability

guilbaults pushed a commit that referenced this issue Apr 29, 2021

Add bad status error metric (#12) (#26)

2aca02c

* add new metric infiniband_bad_status_error * update infiniband_scrape_ok metric depending on scrape of infiniband_bad_status_error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query_smp.c:199; Connection timed out #12

query_smp.c:199; Connection timed out #12

mglants commented Mar 31, 2021

mglants commented Mar 31, 2021

guilbaults commented Mar 31, 2021

mglants commented Mar 31, 2021 •

edited

Loading

guilbaults commented Mar 31, 2021

mglants commented Mar 31, 2021 •

edited

Loading

gabrieleiannetti commented Apr 9, 2021 •

edited

Loading

guilbaults commented Apr 9, 2021

gabrieleiannetti commented Apr 9, 2021 •

edited

Loading

gabrieleiannetti commented Apr 9, 2021

gabrieleiannetti commented Apr 9, 2021

guilbaults commented Apr 9, 2021

mglants commented Apr 10, 2021

gabrieleiannetti commented Apr 12, 2021 •

edited

Loading

gabrieleiannetti commented Apr 12, 2021

gabrieleiannetti commented Apr 16, 2021 •

edited

Loading

gabrieleiannetti commented Apr 27, 2021

query_smp.c:199; Connection timed out #12

query_smp.c:199; Connection timed out #12

Comments

mglants commented Mar 31, 2021

mglants commented Mar 31, 2021

guilbaults commented Mar 31, 2021

mglants commented Mar 31, 2021 • edited Loading

guilbaults commented Mar 31, 2021

mglants commented Mar 31, 2021 • edited Loading

gabrieleiannetti commented Apr 9, 2021 • edited Loading

guilbaults commented Apr 9, 2021

gabrieleiannetti commented Apr 9, 2021 • edited Loading

gabrieleiannetti commented Apr 9, 2021

gabrieleiannetti commented Apr 9, 2021

guilbaults commented Apr 9, 2021

mglants commented Apr 10, 2021

gabrieleiannetti commented Apr 12, 2021 • edited Loading

gabrieleiannetti commented Apr 12, 2021

gabrieleiannetti commented Apr 16, 2021 • edited Loading

gabrieleiannetti commented Apr 27, 2021

mglants commented Mar 31, 2021 •

edited

Loading

mglants commented Mar 31, 2021 •

edited

Loading

gabrieleiannetti commented Apr 9, 2021 •

edited

Loading

gabrieleiannetti commented Apr 9, 2021 •

edited

Loading

gabrieleiannetti commented Apr 12, 2021 •

edited

Loading

gabrieleiannetti commented Apr 16, 2021 •

edited

Loading