Skip to content

add support for monitoring thp, ballooning, zswap, ksm cow#15000

Merged
ktsaou merged 6 commits intonetdata:masterfrom
ktsaou:vmstat-thp
May 3, 2023
Merged

add support for monitoring thp, ballooning, zswap, ksm cow#15000
ktsaou merged 6 commits intonetdata:masterfrom
ktsaou:vmstat-thp

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented May 2, 2023

Modified the vmstat module of proc module to monitor:

  • transparent huge pages (THP, many charts, including allocations, compaction, splitting, swapout, etc)
  • memory ballooning (when Linux runs as guest VM on a host that supports memory ballooning)
  • zswap operations
  • KSM copy-on-write operations (COW)

Especially the THP is very important since it seems enabled on most Linux distros by default.

@ktsaou ktsaou requested a review from thiagoftsm as a code owner May 2, 2023 22:07
@github-actions github-actions Bot added area/collectors Everything related to data collection collectors/proc labels May 2, 2023
ilyam8
ilyam8 previously approved these changes May 3, 2023
Copy link
Copy Markdown
Member

@ilyam8 ilyam8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but a small note:

Usually having "events/s" as units is a red flag because it means we are aggregating (chart level) non-aggregatable metrics - grouping by host/system w/o selecting a dimension (the Cloud UI) will give useless values. E.g.

thp_swpout
is incremented every time a huge page is swapout in one
piece without splitting.

thp_swpout_fallback
is incremented if a huge page has to be split before swapout.
Usually because failed to allocate some continuous swap space
for the huge page.

The sum of these events doesn't seem useful/correct to me.


And this PR is 15000th contribution to this repo 🎉

@ktsaou
Copy link
Copy Markdown
Member Author

ktsaou commented May 3, 2023

Usually having "events/s" as units is a red flag because it means we are aggregating (chart level) non-aggregatable metrics - grouping by host/system w/o selecting a dimension (the Cloud UI) will give useless values

How do you suggest to do it?
I am not sure why they are useless when grouped. Why they are?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented May 3, 2023

I propose to go with the current implementation, which is why I approved the PR. We do such aggregations pretty often. By "such aggregations" I mean correct when grouped by dimension, wrong when grouped by anything else without selecting a dimension (some obvious examples: system load average, system pressure).

cc @ralphm

Why they are?

I don't find them useful because the sum of:

  • huge page swapout events
  • huge page split events

doesn't seem to make sense to me, they look like 2 completely different metrics. But my understanding can be wrong because it is based only on metrics description, I am not a specialist.

I am not a specialist

If thp_swpout_fallback means a huge page swapout event and not only split, and this swapout is not accounted in thp_swpout then all good.

@ktsaou ktsaou merged commit 50183dc into netdata:master May 3, 2023
@ktsaou
Copy link
Copy Markdown
Member Author

ktsaou commented May 3, 2023

@cakrit this needs to go into the release notes. The description is in the document I shared with you about hugepages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/collectors Everything related to data collection collectors/proc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants