Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query_smp.c:199; Connection timed out #12

Open
mglants opened this issue Mar 31, 2021 · 16 comments
Open

query_smp.c:199; Connection timed out #12

mglants opened this issue Mar 31, 2021 · 16 comments

Comments

@mglants
Copy link
Contributor

mglants commented Mar 31, 2021

After commit c6eef51
I got
src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
infiniband_scrape_ok 0.0

@mglants
Copy link
Contributor Author

mglants commented Mar 31, 2021

@guilbaults

@guilbaults
Copy link
Owner

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ?
From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

@guilbaults guilbaults changed the title Error after last commit query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out Mar 31, 2021
@guilbaults guilbaults changed the title query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out query_smp.c:199; Connection timed out Mar 31, 2021
@mglants
Copy link
Contributor Author

mglants commented Mar 31, 2021

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ?
From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

First string is about error, the next data is ok. What is the cause of the problem?

src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
Errors for 0xb8599f0300383ec0 "ibswh2 [255-B05] HDR-40"
   GUID 0xb8599f0300383ec0 port ALL: [SymbolErrorCounter == 19817 (19.353K)] [LinkDownedCounter == 37 (37.000)] [PortXmitDiscards == 8 (8.000)] [PortLocalPhysicalErrors == 8] [PortXmitWait == 2417380683166 (2.199T)] [PortXmitData == 5044493407069892 (17.922PB)] [PortRcvData == 5044493406602707 (17.922PB)] [PortXmitPkts == 5101051455966 (4.639T)] [PortRcvPkts == 5101051441855 (4.639T)] [PortUnicastXmitPkts == 5101050102526 (4.639T)] [PortUnicastRcvPkts == 5101050102542 (4.639T)] [PortMulticastXmitPkts == 29380 (28.691K)] [PortMulticastRcvPkts == 15253 (14.896K)]
   GUID 0xb8599f0300383ec0 port 1: [LinkDownedCounter == 1 (1.000)] [PortXmitDiscards == 3 (3.000)] [PortLocalPhysicalErrors == 3] [PortXmitWait == 102856180328 (95.792G)] [PortXmitData == 225495513282581 (820.348TB)] [PortRcvData == 154914464326291 (563.576TB)] [PortXmitPkts == 229378238279 (213.625G)] [PortRcvPkts == 154134492149 (143.549G)] [PortUnicastXmitPkts == 229378226543 (213.625G)] [PortUnicastRcvPkts == 154134491212 (143.549G)] [PortMulticastXmitPkts == 11736 (11.461K)] [PortMulticastRcvPkts == 937 (937.000)]

Also output before commit c6eef51 just working fine

@guilbaults
Copy link
Owner

Its weird since c6eef51 did not change anything in parsing, but is simply capturing stderr and since you have something in stderr, it abort the collection cycle.

I don't have yet access to a HDR switch to verify (will do in a few months), but it seem there is a problem in ibqueryerrors since it only seem to show your local port. Some similar error can happen when the tools in the infiniband-diags rpm are not launched as root since some packets need to be sent with root (umad packets). You can try ibhosts and ibswitches, they should list everything connected in your network.

From what I know, 2 other things could block umad packets, running in a VM with SR-IOV and having some management keys installed on the fabric/subnet manager.

@mglants
Copy link
Contributor Author

mglants commented Mar 31, 2021

Found something on the internet
Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:


root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out


Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31

Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

@gabrieleiannetti
Copy link
Contributor

gabrieleiannetti commented Apr 9, 2021

`` to see if your host can access the counters on the fabric ?

Yes, I can confirm that. We are also seeing errors caused by ibqueryerrors.

Exporter Output:

2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

STDERR output from ibqueryerrors:

src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

I would say, that the exporter works as designed with return on error implemented in c6eef51.

Previously running the exporter before commit c6eef51, we did not see any errors in the metrics exported by the exporter. So I doubt that there is no errors in the fabric, since also calling ibquererrors returns the shown errors.

@guilbaults
Copy link
Owner

When that error is printed in STDERR, is there any useful counters in STDOUT ?
If some counters are collected correctly, then probably the script should not abort, but still report the error in the logs and exported stats.

@gabrieleiannetti
Copy link
Contributor

gabrieleiannetti commented Apr 9, 2021

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

As I mentioned before, as the exporter was running without quitting on error,
we did not see any positive values in the exported metrics.

I checked every metric. There was a lot of metrics exported with 0 values and I could identify the nodes and switches in the fabric.

But the exporter should export those errors, right?

@gabrieleiannetti
Copy link
Contributor

2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

We have identified those 13 bad status 110; Connection timed out errors as machines that are still in the fabric, but there are switched off. Think they have power on the ports of the NICs.

@gabrieleiannetti
Copy link
Contributor

Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery n

Hi @mglants,
can you verify that the machines are up and running or are they switched off?

@guilbaults
Copy link
Owner

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

@mglants
Copy link
Contributor Author

mglants commented Apr 10, 2021

Well, I am not familiar with the output yet, but it looks like it is full of errors.
I have attached the output with anonymous host names as XXX.
ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

Can we also include information about ports in down state. In my case it was due we shutdown server for maintenance. As that errors have a path to port with problem

@gabrieleiannetti
Copy link
Contributor

gabrieleiannetti commented Apr 12, 2021

Well, I am not familiar with the output yet, but it looks like it is full of errors.
I have attached the output with anonymous host names as XXX.
ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are:
SymbolErrorCounter -> Problem with a cable
PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped
LinkDownedCounter -> Can happen with a bad cable that flap between up/down
PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

I would like to propose, that the exporter does export a metric for that error "bad status 110; Connection timed out".

Would be also helpful to collect such errors in the fabric.

Thanks for sharing the most interesting errors.

@gabrieleiannetti
Copy link
Contributor

Found something on the internet
Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes
If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:


root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out


Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31

Identify the unresponsive node/s:
From the same node where the direct path command invoked, run:
smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

Just another approach which gives a good overview of the Ports of the switch here:

iblinkinfo --node-name-map ibswitches_GUID-Name -D <direct_path_without_last_number>

In your example you should see at port 18 a line that will probably end with something like "Port not available".

guilbaults pushed a commit that referenced this issue Apr 14, 2021
* Remove return statement
* Reformat code for more readability
@gabrieleiannetti
Copy link
Contributor

gabrieleiannetti commented Apr 16, 2021

@guilbaults

Hi, I would like to contribute a patch for processing an error from STDERR...

I have tried to check which errors can be returned by ibqueryerrors, but it is not clear to me.

My approach would be checking the return code first...

https://docs.oracle.com/cd/E88353_01/html/E72487/ibqueryerrors-8.html

EXIT STATUS
       -1 if scan fails.

       0 if scan succeeds without errors beyond thresholds

       1 if errors are found beyond thresholds or inconsistencies are found in
       check mode.

If the return code is -1 the exporter skips and sets the metric infiniband_scrape_ok=0.
Otherwise it continues, errors from stderr should always be checked then.

I would first implement processing of a "bad status" error.

As I figured out from the sources found here about this error (https://github.com/linux-rdma/infiniband-diags/blob/b48b4a630f54438ba2b529f40fab99c1abba3763/libibnetdisc/src/query_smp.c#L189), it looks like there is the following scheme:

	if ((status = umad_status(umad))) {
		IBND_ERROR("umad (%s Attr 0x%x:%u) bad status %d; %s\n",
			   portid2str(&smp->path), smp->rpc.attr.id,
			   smp->rpc.attr.mod, status, strerror(status));
		if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
			rc = mlnx_ext_port_info_err(engine, smp, mad,
						    smp->cb_data);
	} else if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F))) {
		IBND_ERROR("mad (%s Attr 0x%x:%u) bad status 0x%x\n",
			   portid2str(&smp->path), smp->rpc.attr.id,
			   smp->rpc.attr.mod, status);
		if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
			rc = mlnx_ext_port_info_err(engine, smp, mad,
						    smp->cb_data);

So I would suggest to add a new metric infiniband_bad_status... (data type?),
that should include the following labels:

  • path
  • status
  • error (could be empty, see else if)

As an example how it would look like with that data set:

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,1 Attr 0xff90:1) bad status 110; Connection timed out

  • path: "DR path slid 0; dlid 0; 0,1,1"
  • status: "110"
  • error: "Connection timed out"

If the exporter detects an unknown error not bad status, it could log warnings and signal that something is wrong by setting infiniband_scrape_ok=0.

What do you think - Should we go that way?

@gabrieleiannetti
Copy link
Contributor

I have built an example how to parse and process an STDERR error for bad status:

import re

pattern = r"src\/query\_smp\.c\:[\d]+\; (?:mad|umad) \((DR path .*) Attr .*\) bad status ([\d]+); (.*)"

string = "src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out"

prog = re.compile(pattern)
result = prog.match(string)

if result:
    path = result.group(1)
    status = result.group(2)
    error = result.group(3)

else:
    pass # handle error...

I think this should do the job.

Probably I will create an pull request the next days...

guilbaults pushed a commit that referenced this issue Apr 29, 2021
* add new metric infiniband_bad_status_error
* update infiniband_scrape_ok metric depending on scrape of infiniband_bad_status_error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants