Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(inputs.infiniband): Handle devices without counters #14049

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

jose-d
Copy link
Contributor

@jose-d jose-d commented Oct 3, 2023

  • Updated associated README.md.
  • Wrote appropriate unit tests.
  • Pull request title or commits are in [conventional commit format]

resolves #8135

Not all rdma devices have counters parsable by GetRdmaSysfsStats.
This PR add continue statement to ignore such device (and continue with next device/port) instead of throwing an error and failing

@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Oct 3, 2023

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@powersj powersj changed the title fix(plugin: inputs.infiniband) continue after failing to GetRdmaSysfsStats fix(inputs.infiniband): Handle devices without counters Oct 3, 2023
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Looking at the issue, it seems that there are devices which may report as infiniband capable, won't have counters, and right now this prevents any metrics. Continuing on allows devices that can correctly get collected report metrics.

Does that sound right?

@jose-d
Copy link
Contributor Author

jose-d commented Oct 3, 2023

!signed-cla

@jose-d
Copy link
Contributor Author

jose-d commented Oct 3, 2023

Looking at the issue, it seems that there are devices which may report as infiniband capable, won't have counters, and right now this prevents any metrics. Continuing on allows devices that can correctly get collected report metrics.

Does that sound right?

Almost :)

inputs.infiniband uses rdmamap library which reports all RDMA-capable (not only Infiniband) devices - and some RDMA devices (eg. ROCE) have no counters, and this stops metric collection. Exactly, as you wrote, continuing on allows devices that can correctly get collected report metrics.

@jose-d
Copy link
Contributor Author

jose-d commented Oct 3, 2023

original behavior (metric collection fails on device irdma1 and no other devices are collected:

[jose@n15 telegraf]$ telegraf --test --config /etc/telegraf/telegraf.d/inputs.infiniband.conf 
...
2023-10-03T21:09:50Z E! [inputs.infiniband] Error in plugin: open /sys/class/infiniband/irdma1/ports/1/counters: no such file or directory
2023-10-03T21:09:50Z E! [telegraf] Error running agent: input plugins recorded 1 errors
[jose@n15 telegraf]$

after suggested patch (metric collection skips irdma1 and sucessfully collects mlx5_0

[jose@n15 telegraf]$ ./telegraf --test --config /etc/telegraf/telegraf.d/inputs.infiniband.conf 
...
> infiniband,device=mlx5_0,host=n15.koios.lan,port=1 VL15_dropped=0i,excessive_buffer_overrun_errors=0i,link_downed=0i,link_error_recovery=0i,local_link_integrity_errors=0i,multicast_rcv_packets=312906i,multicast_xmit_packets=0i,port_rcv_constraint_errors=0i,port_rcv_data=10591292i,port_rcv_errors=0i,port_rcv_packets=318648i,port_rcv_remote_physical_errors=0i,port_rcv_switch_relay_errors=0i,port_xmit_constraint_errors=0i,port_xmit_data=55025i,port_xmit_discards=0i,port_xmit_packets=3736i,port_xmit_wait=0i,symbol_error=0i,unicast_rcv_packets=5742i,unicast_xmit_packets=3736i 1696367342000000000
[jose@n15 telegraf]$

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

It seems the only time errors come from this function are when the file does not exist or we cannot read the file itself, so I think this makes sense that we only return what we can read.

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Oct 3, 2023
@srebhan srebhan added fix pr to fix corresponding bug area/network New plugins or feature relating to Network Monitoring plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Oct 4, 2023
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @jose-d!

@srebhan srebhan merged commit 360eeec into influxdata:master Oct 4, 2023
24 checks passed
@github-actions github-actions bot added this to the v1.28.3 milestone Oct 4, 2023
powersj pushed a commit that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/network New plugins or feature relating to Network Monitoring fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[inputs/infiniband] Plugin should not fail when iwarp driver is loaded
3 participants