Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infiniband: do not fail if board_id is not present #556

Merged
merged 1 commit into from Sep 22, 2023

Conversation

dswarbrick
Copy link
Contributor

@dswarbrick dswarbrick commented Aug 5, 2023

Whilst a large majority do, not all InfiniBand drivers expose a board_id in sysfs.

Fixes: prometheus/node_exporter#2769

Whilst a large majority do, not _all_ InfiniBand drivers expose a
board_id in sysfs.

Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
func (fs FS) parseInfiniBandDevice(name string) (*InfiniBandDevice, error) {
path := fs.sys.Path(infinibandClassPath, name)
device := InfiniBandDevice{Name: name}

for _, f := range [...]string{"board_id", "fw_ver", "hca_type"} {
// fw_ver is exposed by all InfiniBand drivers since kernel version 4.10.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically support older kernels than this. Maybe we should just make all of these files/values optional.

Copy link
Contributor Author

@dswarbrick dswarbrick Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this PR is does not change the behaviour for fw_ver, which has been a mandatory requirement of this collector since it first landed in June 2019. It only makes the board_id optional, and hca_type was already optional.

I would be very surprised if anybody is running InfiniBand on kernels older than 4.10. I don't recall a single issue being logged about this collector failing due to missing fw_ver.

Digging through kernel source, the fw_ver sysfs entry appears to be universally implemented from kernel 4.8 onwards (despite docs saying 4.10), and prior to that it is implemented by selected vendors (e.g. Mellanox, which is almost certainly going to be the hardware in use from that era of kernels). The earliest mention of it in the mlx4 driver is kernel 2.6.24.

@SuperQ
Copy link
Member

SuperQ commented Aug 6, 2023

I wonder if we should make the values in the struct *string and set it to nil if the file is not found.

@dswarbrick
Copy link
Contributor Author

I wonder if we should make the values in the struct *string and set it to nil if the file is not found.

Doing so means that consumers of this need to check for non-nil before using the struct members. I'm not sure if that's a good thing or not. Given that Prometheus does not store empty-string labels, my gut feeling is that it's more convenient to simply leave these struct members as empty strings if not found, meaning that node_exporter can just blindly assign them to labels.

@mustafakemalgilor
Copy link

Related-to: prometheus/node_exporter#2769

@SuperQ SuperQ merged commit b2168a3 into prometheus:master Sep 22, 2023
7 checks passed
@dswarbrick dswarbrick deleted the infiniband branch September 22, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infiniband metrics not collected if irdma kernel module is loaded
4 participants