Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zoneinfo: 'per-node stats' sections seem to confuse the parser #386

Closed
knweiss opened this issue Jun 11, 2021 · 2 comments
Closed

zoneinfo: 'per-node stats' sections seem to confuse the parser #386

knweiss opened this issue Jun 11, 2021 · 2 comments

Comments

@knweiss
Copy link

knweiss commented Jun 11, 2021

On RHEL 8.3 and 8.4 kernels there seems to be a parsing issue in node-exporter 1.1.2's zoneinfo collector which is based on procfs.

Example:

# uname -a
Linux rhel83 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
# grep -E '(managed|^Node)' /proc/zoneinfo 
Node 0, zone      DMA
        managed  3840
Node 0, zone    DMA32
        managed  580234
Node 0, zone   Normal
        managed  45882525
Node 0, zone  Movable
        managed  0
Node 0, zone   Device
        managed  0
Node 1, zone      DMA
        managed  0
Node 1, zone    DMA32
        managed  0
Node 1, zone   Normal
        managed  46688852
Node 1, zone  Movable
        managed  0
Node 1, zone   Device
        managed  0
# curl -o metrics http://localhost:9100/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  151k    0  151k    0     0  3360k      0 --:--:-- --:--:-- --:--:-- 3360k
# grep node_zoneinfo_managed metrics
# HELP node_zoneinfo_managed_pages Present pages managed by the buddy system
# TYPE node_zoneinfo_managed_pages gauge
node_zoneinfo_managed_pages{node="0",zone=""} 3840
node_zoneinfo_managed_pages{node="0",zone="DMA32"} 580234
node_zoneinfo_managed_pages{node="0",zone="Device"} 0
node_zoneinfo_managed_pages{node="0",zone="Movable"} 0
node_zoneinfo_managed_pages{node="0",zone="Normal"} 4.5882525e+07
node_zoneinfo_managed_pages{node="1",zone=""} 4.6688852e+07
node_zoneinfo_managed_pages{node="1",zone="DMA"} 0
node_zoneinfo_managed_pages{node="1",zone="DMA32"} 0
node_zoneinfo_managed_pages{node="1",zone="Device"} 0
node_zoneinfo_managed_pages{node="1",zone="Movable"} 0

Notice, there's not even a zone="Normal" label for node 1 or a zone="DMA" label for node 0!

From a quick look I suspect this is caused by the "per-node stats" lines of /proc/zoneinfo. The parser resets zoneinfoElement.Zone when it sees such a line (the following numbers are from a different run):

[...]
Node 0, zone   Normal
  pages free     47098578
        min      2448054
        low      3060067
        high     3672080
        spanned  49545216
        present  49545216
        managed  46373028
        protection: (0, 0, 0, 0, 0)
[...]
Node 1, zone   Normal
  per-node stats                                                    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      nr_inactive_anon 1947
[...]
      nr_kernel_misc_reclaimable 0
  pages free     48669597
        min      2464943
        low      3081178
        high     3697413
        spanned  50331648
        present  50331648
        managed  46692949
        protection: (0, 0, 0, 0, 0)
func parseZoneinfo(zoneinfoData []byte) ([]Zoneinfo, error) {

        zoneinfo := []Zoneinfo{}

        zoneinfoBlocks := bytes.Split(zoneinfoData, []byte("\nNode"))
        for _, block := range zoneinfoBlocks {
                var zoneinfoElement Zoneinfo
                lines := strings.Split(string(block), "\n")
                for _, line := range lines {

                        if nodeZone := nodeZoneRE.FindStringSubmatch(line); nodeZone != nil {
                                zoneinfoElement.Node = nodeZone[1]
                                zoneinfoElement.Zone = nodeZone[2]
                                continue
                        }
                        if strings.HasPrefix(strings.TrimSpace(line), "per-node stats") {
                                zoneinfoElement.Zone = ""
                                continue
                        }
@knweiss knweiss changed the title zoneinfo: 'per-node stats' sections seems to confuse the parser zoneinfo: 'per-node stats' sections seem to confuse the parser Jun 11, 2021
@knweiss
Copy link
Author

knweiss commented Jun 15, 2021

Here's a complete /proc/zoneinfo file: zoneinfo.txt

@binjip978
Copy link
Contributor

@discordianfish I can look at it

@SuperQ SuperQ closed this as completed Jul 23, 2021
SuperQ added a commit to prometheus/node_exporter that referenced this issue Jul 23, 2021
* [BUGFIX] Fix zoneinfo parsing prometheus/procfs#386
* [BUGFIX] Fix nvme collector log noise #2091
* [BUGFIX] Fix rapl collector log noise #2092

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit to prometheus/node_exporter that referenced this issue Jul 23, 2021
* [BUGFIX] Fix zoneinfo parsing prometheus/procfs#386
* [BUGFIX] Fix nvme collector log noise #2091
* [BUGFIX] Fix rapl collector log noise #2092

Signed-off-by: Ben Kochie <superq@gmail.com>
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Aug 19, 2022
Changes:
## 1.3.1 / 2021-12-01

* [BUGFIX] Handle nil CPU thermal power status on M1
* [BUGFIX] bsd: Ignore filesystems flagged as MNT_IGNORE.
* [BUGFIX] Sanitize UTF-8 in dmi collector

## 1.3.0 / 2021-10-20

NOTE: In order to support globs in the textfile collector path, filenames exposed by
      `node_textfile_mtime_seconds` now contain the full path name.

* [CHANGE] Add path label to rapl collector
* [CHANGE] Exclude filesystems under /run/credentials
* [CHANGE] Add TCPTimeouts to netstat default filter
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/
* [FEATURE] Add darwin powersupply collector
* [FEATURE] Add support for monitoring GPUs on Linux
* [FEATURE] Add Darwin thermal collector
* [FEATURE] Add os release collector
* [FEATURE] Add netdev.address-info collector
* [FEATURE] Add clocksource metrics to time collector
* [ENHANCEMENT] Support glob textfile collector directories
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering
* [ENHANCEMENT] Add flag to disable guest CPU metrics
* [ENHANCEMENT] Add DMI collector
* [ENHANCEMENT] Add threads metrics to processes collector
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector
* [ENHANCMMENT] Add TCPTimeouts to netstat default filter
* [ENHANCMMENT] Use SysctlTimeval for boottime collector on BSD
* [BUGFIX] ethtool: Sanitize metric names
* [BUGFIX] Fix ethtool collector for multiple interfaces
* [BUGFIX] Fix possible panic on macOS
* [BUGFIX] Collect flag_info and bug_info only for one core
* [BUGFIX] Prevent duplicate ethtool metric names

## 1.2.2 / 2021-08-06

* [BUGFIX] Fix processes collector long int parsing

## 1.2.1 / 2021-07-23

* [BUGFIX] Fix zoneinfo parsing prometheus/procfs#386
* [BUGFIX] Fix nvme collector log noise
* [BUGFIX] Fix rapl collector log noise

## 1.2.0 / 2021-07-15

NOTE: Ignoring invalid network speed will be the default in 2.x
NOTE: Filesystem collector flags have been renamed. `--collector.filesystem.ignored-mount-points` is now `--collector.filesystem.mount-points-exclude` and `--collector.filesystem.ignored-fs-types` is now `--collector.filesystem.fs-types-exclude`. The old flags will be removed in 2.x.

* [CHANGE] Rename filesystem collector flags to match other collectors
* [CHANGE] Make node_exporter print usage to STDOUT
* [FEATURE] Add conntrack statistics metrics
* [FEATURE] Add ethtool stats collector
* [FEATURE] Add flag to ignore network speed if it is unknown
* [FEATURE] Add tapestats collector for Linux
* [FEATURE] Add nvme collector
* [ENHANCEMENT] Add ErrorLog plumbing to promhttp
* [ENHANCEMENT] Add more Infiniband counters
* [ENHANCEMENT] netclass: retrieve interface names and filter before parsing
* [ENHANCEMENT] Add time zone offset metric
* [BUGFIX] Handle errors from disabled PSI subsystem
* [BUGFIX] Fix panic when using backwards compatible flags
* [BUGFIX] Fix wrong value for OpenBSD memory buffer cache
* [BUGFIX] Only initiate collectors once
* [BUGFIX] Handle small backwards jumps in CPU idle

## 1.1.2 / 2021-03-05

* [BUGFIX] Handle errors from disabled PSI subsystem
* [BUGFIX] Sanitize strings from /sys/class/power_supply
* [BUGFIX] Silence missing netclass errors

## 1.1.1 / 2021-02-12

* [BUGFIX] Fix ineffassign issue
* [BUGFIX] Fix some noisy log lines

## 1.1.0 / 2021-02-05

NOTE: We have improved some of the flag naming conventions (PR #1743). The old names are
      deprecated and will be removed in 2.0. They will continue to work for backwards
      compatibility.

* [CHANGE] Improve filter flag names
* [CHANGE] Add btrfs and powersupplyclass to list of exporters enabled by default
* [FEATURE] Add fibre channel collector
* [FEATURE] Expose cpu bugs and flags as info metrics.
* [FEATURE] Add network_route collector
* [FEATURE] Add zoneinfo collector
* [ENHANCEMENT] Add more InfiniBand counters
* [ENHANCEMENT] Add flag to aggr ipvs metrics to avoid high cardinality metrics
* [ENHANCEMENT] Adding backlog/current queue length to qdisc collector
* [ENHANCEMENT] Include TCP OutRsts in netstat metrics
* [ENHANCEMENT] Add pool size to entropy collector
* [ENHANCEMENT] Remove CGO dependencies for OpenBSD amd64
* [ENHANCEMENT] bcache: add writeback_rate_debug stats
* [ENHANCEMENT] Add check state for mdadm arrays via node_md_state metric
* [ENHANCEMENT] Expose XFS inode statistics
* [ENHANCEMENT] Expose zfs zpool state
* [ENHANCEMENT] Added an ability to pass collector.supervisord.url via SUPERVISORD_URL environment variable
* [BUGFIX] filesystem_freebsd: Fix label values
* [BUGFIX] Fix various procfs parsing errors
* [BUGFIX] Handle no data from powersupplyclass
* [BUGFIX] udp_queues_linux.go: change upd to udp in two error strings
* [BUGFIX] Fix node_scrape_collector_success behaviour
* [BUGFIX] Fix NodeRAIDDegraded to not use a string rule expressions
* [BUGFIX] Fix node_md_disks state label from fail to failed
* [BUGFIX] Handle EPERM for syscall in timex collector
* [BUGFIX] bcache: fix typo in a metric name
* [BUGFIX] Fix XFS read/write stats (prometheus/procfs#343)
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
* [BUGFIX] Fix zoneinfo parsing prometheus/procfs#386
* [BUGFIX] Fix nvme collector log noise prometheus#2091
* [BUGFIX] Fix rapl collector log noise prometheus#2092

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
* [BUGFIX] Fix zoneinfo parsing prometheus/procfs#386
* [BUGFIX] Fix nvme collector log noise prometheus#2091
* [BUGFIX] Fix rapl collector log noise prometheus#2092

Signed-off-by: Ben Kochie <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants