Skip to content

Add hw-exporter#554

Merged
drigz merged 6 commits intomainfrom
hw-exporter
Jul 22, 2025
Merged

Add hw-exporter#554
drigz merged 6 commits intomainfrom
hw-exporter

Conversation

@drigz
Copy link
Copy Markdown
Contributor

@drigz drigz commented Jul 16, 2025

This exports a list of connected PCI devices, so that we can track which
GPUs and NICs are installed on the fleet. node-exporter doesn't support
anything quite like this, the ethtool metric doesn't handle NICs that
aren't bound to a kernel driver for example, nor can it identify the
attached GPU.

The new binary is 9MB, and the new process has 21MB RSS, which seems to
mostly be TLS/protobuf dependencies that I guess come in via the
OpenCensus libraries, but could also come in via the PCI library, as
that has the ability to fetch a PCI ID database from the internet if
enabled at runtime (it is not enabled in this binary).

The extra metric load should be offset by
#549.

Example metrics:

# HELP pci_device_count Number of PCI devices by vendor, product, class, and driver.
# TYPE pci_device_count gauge
pci_device_count{class="Bridge",driver="",product="0x7a90",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="",product="0xa700",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="pcieport",product="0x7ab6",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="pcieport",product="Alder Lake-S PCH PCI Express Root Port #13",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="pcieport",product="Alder Lake-S PCH PCI Express Root Port #8",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="pcieport",product="Raptor Lake PCI Express 5.0 Graphics Port (PEG010)",vendor="Intel Corporation"} 1
pci_device_count{class="Bridge",driver="pcieport",product="Raptor Lake PCIe 4.0 Graphics Port",vendor="Intel Corporation"} 1
pci_device_count{class="Communication controller",driver="",product="Alder Lake-S PCH Serial IO UART #0",vendor="Intel Corporation"} 1
pci_device_count{class="Communication controller",driver="mei_me",product="Alder Lake-S PCH HECI Controller #1",vendor="Intel Corporation"} 1
pci_device_count{class="Display controller",driver="nvidia",product="GA104GL [RTX A4000]",vendor="NVIDIA Corporation"} 1
pci_device_count{class="Mass storage controller",driver="ahci",product="Alder Lake-S PCH SATA Controller [AHCI Mode]",vendor="Intel Corporation"} 1
pci_device_count{class="Mass storage controller",driver="nvme",product="IX SN530 NVMe SSD (DRAM-less)",vendor="Sandisk Corp"} 3
pci_device_count{class="Memory controller",driver="",product="Alder Lake-S PCH Shared SRAM",vendor="Intel Corporation"} 1
pci_device_count{class="Multimedia controller",driver="",product="Alder Lake-S HD Audio Controller",vendor="Intel Corporation"} 1
pci_device_count{class="Multimedia controller",driver="",product="GA104 High Definition Audio Controller",vendor="NVIDIA Corporation"} 1
pci_device_count{class="Network controller",driver="atemsys_pci",product="Ethernet Connection (17) I219-LM",vendor="Intel Corporation"} 1
pci_device_count{class="Network controller",driver="igb",product="I210 Gigabit Network Connection",vendor="Intel Corporation"} 1
pci_device_count{class="Network controller",driver="intel-eth-pci",product="0x7aac",vendor="Intel Corporation"} 1
pci_device_count{class="Network controller",driver="intel-eth-pci",product="0x7aad",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH SPI Controller",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH Serial IO I2C Controller #0",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH Serial IO I2C Controller #1",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH Serial IO I2C Controller #2",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH Serial IO I2C Controller #3",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="",product="Alder Lake-S PCH Serial IO SPI Controller #1",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="i801_smbus",product="Alder Lake-S PCH SMBus Controller",vendor="Intel Corporation"} 1
pci_device_count{class="Serial bus controller",driver="xhci_hcd",product="Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller",vendor="Intel Corporation"} 1

drigz added 2 commits July 16, 2025 12:39
This exports a list of connected PCI devices, so that we can track which
GPUs and NICs are installed on the fleet. node-exporter doesn't support
anything quite like this, the ethtool metric doesn't handle NICs that
aren't bound to a kernel driver for example, nor can it identify the
attached GPU.

The new binary is 9MB, and the new process has 21MB RSS, which seems to
mostly be TLS/protobuf dependencies that I guess come in via the
OpenCensus libraries, but could also come in via the PCI library, as
that has the ability to fetch a PCI ID database from the internet if
enabled at runtime (it is not enabled in this binary).
@drigz drigz requested a review from Ongy July 16, 2025 13:13
@drigz
Copy link
Copy Markdown
Contributor Author

drigz commented Jul 21, 2025

@Ongy ping

Copy link
Copy Markdown
Contributor

@Ongy Ongy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Library/Style nitpicks.

Comment thread src/go/cmd/hw-exporter/main.go Outdated
Comment thread src/go/cmd/hw-exporter/main.go Outdated
Comment thread src/go/cmd/hw-exporter/main.go
Comment thread src/go/cmd/hw-exporter/main.go Outdated
Comment thread src/go/cmd/hw-exporter/main.go Outdated

for labels, count := range deviceCounts {
ch <- prometheus.MustNewConstMetric(c.pciDeviceCount, prometheus.GaugeValue, count, labels[0], labels[1], labels[2], labels[3])
}
Copy link
Copy Markdown
Contributor

@ensonic ensonic Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the map? Are there duplicates?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts up if we have two devices of the same type (eg a card with multiple NICs).

Copy link
Copy Markdown
Contributor

@ensonic ensonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we easily disable this for the VMs?

@drigz
Copy link
Copy Markdown
Contributor Author

drigz commented Jul 21, 2025

Can we easily disable this for the VMs?

Not easily. This adds ~10 metrics and #549 removes ~3000 so it should get better rather than worse.

Both hw-exporter and node-exporter are controlled by the "disable-prometheus" label selector if that's in the AppRollout.

@drigz drigz requested review from Ongy and ensonic July 21, 2025 14:30
@drigz drigz enabled auto-merge (squash) July 22, 2025 07:08
@drigz drigz merged commit eb1d54b into main Jul 22, 2025
6 checks passed
@drigz drigz deleted the hw-exporter branch July 22, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants