Conversation
This exports a list of connected PCI devices, so that we can track which GPUs and NICs are installed on the fleet. node-exporter doesn't support anything quite like this, the ethtool metric doesn't handle NICs that aren't bound to a kernel driver for example, nor can it identify the attached GPU. The new binary is 9MB, and the new process has 21MB RSS, which seems to mostly be TLS/protobuf dependencies that I guess come in via the OpenCensus libraries, but could also come in via the PCI library, as that has the ability to fetch a PCI ID database from the internet if enabled at runtime (it is not enabled in this binary).
Contributor
Author
|
@Ongy ping |
Ongy
requested changes
Jul 21, 2025
Contributor
Ongy
left a comment
There was a problem hiding this comment.
Generally looks good. Library/Style nitpicks.
ensonic
reviewed
Jul 21, 2025
|
|
||
| for labels, count := range deviceCounts { | ||
| ch <- prometheus.MustNewConstMetric(c.pciDeviceCount, prometheus.GaugeValue, count, labels[0], labels[1], labels[2], labels[3]) | ||
| } |
Contributor
There was a problem hiding this comment.
why do you need the map? Are there duplicates?
Contributor
Author
There was a problem hiding this comment.
This counts up if we have two devices of the same type (eg a card with multiple NICs).
ensonic
reviewed
Jul 21, 2025
Contributor
ensonic
left a comment
There was a problem hiding this comment.
Can we easily disable this for the VMs?
Contributor
Author
Not easily. This adds ~10 metrics and #549 removes ~3000 so it should get better rather than worse. Both hw-exporter and node-exporter are controlled by the "disable-prometheus" label selector if that's in the AppRollout. |
Ongy
approved these changes
Jul 21, 2025
ensonic
approved these changes
Jul 21, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This exports a list of connected PCI devices, so that we can track which
GPUs and NICs are installed on the fleet. node-exporter doesn't support
anything quite like this, the ethtool metric doesn't handle NICs that
aren't bound to a kernel driver for example, nor can it identify the
attached GPU.
The new binary is 9MB, and the new process has 21MB RSS, which seems to
mostly be TLS/protobuf dependencies that I guess come in via the
OpenCensus libraries, but could also come in via the PCI library, as
that has the ability to fetch a PCI ID database from the internet if
enabled at runtime (it is not enabled in this binary).
The extra metric load should be offset by
#549.
Example metrics: