-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query_smp.c:199; Connection timed out #12
Comments
Could you manually run |
First string is about error, the next data is ok. What is the cause of the problem?
Also output before commit c6eef51 just working fine |
Its weird since c6eef51 did not change anything in parsing, but is simply capturing stderr and since you have something in stderr, it abort the collection cycle. I don't have yet access to a HDR switch to verify (will do in a few months), but it seem there is a problem in From what I know, 2 other things could block umad packets, running in a VM with SR-IOV and having some management keys installed on the fabric/subnet manager. |
Found something on the internet Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric. If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc. Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes Example:
Identify the unresponsive node/s: Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1" The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path Example: for direct path "0,1,18", the unresponsive device will be connected to port 18 |
Yes, I can confirm that. We are also seeing errors caused by Exporter Output:
STDERR output from
I would say, that the exporter works as designed with return on error implemented in c6eef51. Previously running the exporter before commit c6eef51, we did not see any errors in the metrics exported by the exporter. So I doubt that there is no errors in the fabric, since also calling |
When that error is printed in STDERR, is there any useful counters in STDOUT ? |
Well, I am not familiar with the output yet, but it looks like it is full of errors. I have attached the output with anonymous host names as XXX. As I mentioned before, as the exporter was running without quitting on error, I checked every metric. There was a lot of metrics exported with 0 values and I could identify the nodes and switches in the fabric. But the exporter should export those errors, right? |
We have identified those 13 |
Hi @mglants, |
PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with The most interresting errors are: Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric. |
Can we also include information about ports in down state. In my case it was due we shutdown server for maintenance. As that errors have a path to port with problem |
I would like to propose, that the exporter does export a metric for that error "bad status 110; Connection timed out". Would be also helpful to collect such errors in the fabric. Thanks for sharing the most interesting errors. |
Just another approach which gives a good overview of the Ports of the switch here:
In your example you should see at port 18 a line that will probably end with something like "Port not available". |
Hi, I would like to contribute a patch for processing an error from STDERR... I have tried to check which errors can be returned by My approach would be checking the return code first...
If the return code is I would first implement processing of a "bad status" error. As I figured out from the sources found here about this error (https://github.com/linux-rdma/infiniband-diags/blob/b48b4a630f54438ba2b529f40fab99c1abba3763/libibnetdisc/src/query_smp.c#L189), it looks like there is the following scheme:
So I would suggest to add a new metric
As an example how it would look like with that data set:
If the exporter detects an unknown error not What do you think - Should we go that way? |
I have built an example how to parse and process an STDERR error for bad status: import re
pattern = r"src\/query\_smp\.c\:[\d]+\; (?:mad|umad) \((DR path .*) Attr .*\) bad status ([\d]+); (.*)"
string = "src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out"
prog = re.compile(pattern)
result = prog.match(string)
if result:
path = result.group(1)
status = result.group(2)
error = result.group(3)
else:
pass # handle error... I think this should do the job. Probably I will create an pull request the next days... |
After commit c6eef51
I got
src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
infiniband_scrape_ok 0.0
The text was updated successfully, but these errors were encountered: