memory usage stats are incorrect #46

anarcat · 2022-05-10T15:36:28Z

I setup this exporter to diagnose OOM conditions on a server, but the output it gives me is inconsistent with the stats I'm getting through other systems. In particular, I just don't see the memory numbers add up to the actual memory usage on the machine.

I'm not sure, but I think this might be related to #2 except that I don't think this is just a small adjustment that can be made to switch to cgroups: the current stats just don't work in any meaningful way, so I think they're just buggy.

just to give an example, right now, postgres is taking up 2.3GB of memory according to systemctl:

root@materculae:~# systemctl status postgresql@13-main.service 
● postgresql@13-main.service - PostgreSQL Cluster 13-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Tue 2022-05-10 12:23:23 UTC; 2h 51min ago
    Process: 468675 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
   Main PID: 468680 (postgres)
      Tasks: 13 (limit: 4675)
     Memory: 2.3G
        CPU: 23min 30.783s
     CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
             ├─468680 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
             ├─468691 postgres: checkpointer
             ├─468692 postgres: background writer
             ├─468693 postgres: walwriter
             ├─468694 postgres: autovacuum launcher
             ├─468695 postgres: archiver last was 000000010000053F0000008F
             ├─468696 postgres: stats collector
             ├─468697 postgres: logical replication launcher
             ├─468734 postgres: prometheus postgres [local] idle
             ├─469022 postgres: exonerator-web exonerator 127.0.0.1(58176) idle
             ├─469071 postgres: exonerator-web exonerator 127.0.0.1(58188) idle
             ├─471129 postgres: exonerator-web exonerator 127.0.0.1(58370) idle
             └─471183 postgres: exonerator exonerator 127.0.0.1(58376) SELECT

[...]

... but the exporter is only reporting 21MB RSS and 560MB VSS, so it's obviously way off:

root@materculae:~# curl -s localhost:9558/metrics | grep postgresql@ | grep memory
systemd_process_resident_memory_bytes{name="postgresql@13-main.service"} 2.0967424e+07
systemd_process_virtual_memory_bytes{name="postgresql@13-main.service"} 5.60779264e+08
systemd_process_virtual_memory_max_bytes{name="postgresql@13-main.service"} -1

i used this tool to track down this issue we're facing but it seems like, unfortunately, i'll have to look elsewhere...

thanks for any clarification.

The text was updated successfully, but these errors were encountered:

anarcat · 2022-05-10T15:46:32Z

to expand on this, looking at this:

https://github.com/povilasv/systemd_exporter/blob/d4b06488e59ab3e18ea59a5bd9a7d3c86e894356/systemd/systemd.go#L483-L485

it seems the problem is that we're listing only the main PID which obviously fails for cases like postgresql or apache (which starts multiple processes) or cron jobs (which necessarily start a subprocess).

so i guess it's separate from #2 in the sense that it could be fixed by implementing the above TODO and just add up the memory of all the processes in the slice by hand, without having to go through reimplementing everything with cgroups, which seems to be stalled in #10...

oseiberts11 · 2022-11-01T14:08:35Z

The README.md file promises "If you've chosen to pack 400 threads and 20 processes inside the mysql.service, we will only export metrics on the service unit, not on the individual tasks.". This is absolutely not true (if I were less charitable I would call it a lie).

oseiberts11 · 2022-11-01T14:15:35Z

I created a merge request #65 to fix the README.md so other people don't rely on information that the exporter does not provide.

SuperQ · 2022-11-01T23:14:04Z

We should probably fix this collector to not work the way it does. IMO, it's probably something we should just delete until it works the way users expect.

oseiberts11 · 2022-11-10T09:37:39Z

#67 is probably doing what's expected here.

SuperQ · 2022-12-12T17:02:49Z

I've decided that these metrics are not worth maintaining in this exporter. cgroup-based metrics can be gathered using cAdvisor.

evgeni · 2023-05-30T10:59:09Z

I've opened #87 which exposes systemds own memory metrics, which are a) accurate, b) cheap for us to obtain :)

anarcat mentioned this issue May 10, 2022

WIP: Add more testing + memory stats from cgroups #10

Closed

SuperQ closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory usage stats are incorrect #46

memory usage stats are incorrect #46

anarcat commented May 10, 2022

anarcat commented May 10, 2022

oseiberts11 commented Nov 1, 2022

oseiberts11 commented Nov 1, 2022

SuperQ commented Nov 1, 2022

oseiberts11 commented Nov 10, 2022

SuperQ commented Dec 12, 2022

evgeni commented May 30, 2023 •

edited

Loading

memory usage stats are incorrect #46

memory usage stats are incorrect #46

Comments

anarcat commented May 10, 2022

anarcat commented May 10, 2022

oseiberts11 commented Nov 1, 2022

oseiberts11 commented Nov 1, 2022

SuperQ commented Nov 1, 2022

oseiberts11 commented Nov 10, 2022

SuperQ commented Dec 12, 2022

evgeni commented May 30, 2023 • edited Loading

evgeni commented May 30, 2023 •

edited

Loading