Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PSI support #25

Closed
hakavlad opened this issue May 26, 2019 · 40 comments
Closed

Improve PSI support #25

hakavlad opened this issue May 26, 2019 · 40 comments
Labels
enhancement New feature or request

Comments

@hakavlad
Copy link
Owner

subj

@hakavlad
Copy link
Owner Author

hakavlad commented Jun 3, 2019

added psi_excess_duration

@hakavlad hakavlad added the enhancement New feature or request label Jun 7, 2019
@hakavlad
Copy link
Owner Author

TODO:

psi_path = /proc/pressure/memory
psi_path = /sys/fs/cgroup/unified/system.slice/memory.pressure

->

psi_target = SYSTEM_WIDE
psi_target = /system.slice

@hakavlad
Copy link
Owner Author

Jun 12 03:16:07 user-pc nohang[2157]: PSI avg (89.64) > sigterm_psi_threshold (80.0)
Jun 12 03:16:07 user-pc nohang[2157]: PSI avg exceeded psi_excess_duration (value = 60.0 sec) for 60.0 seconds

---
PSI value was above 80 for 60.7 sec (current value: 88.54)
---

PSI SYSTEM_WIDE full_avg10 was above 80 for 60.7 sec (current value: 88.54)
PSI /system.slice full_avg10 was above 80 for 60.7 sec (current value: 88.54)




@dim-geo
Copy link

dim-geo commented Aug 14, 2019

A bug with the current PSI is that it solely relies on it to kill/terminate processes.
It killed bees (btrfs dedup) while the memory & swap was mostly free.
nohang should check if memory is increasing or if available memory is low.
Memory pressure probably happened because I was compiling a new kernel.
Also, I use muqss scheduler with high frequency timer (1000HZ).


2019-08-14T21:14:47+0300 gentoo systemd[1]: Started Highly configurable OOM prevention daemon.
2019-08-14T21:14:47+0300 gentoo nohang[19609]: config: /etc/nohang/nohang.conf
2019-08-14T21:14:47+0300 gentoo nohang[19609]: Monitoring has started!
2019-08-14T21:15:47+0300 gentoo nohang[19609]: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2019-08-14T21:15:47+0300 gentoo nohang[19609]: PSI avg (74.44) > soft_threshold_max_psi (60.0)
2019-08-14T21:15:47+0300 gentoo nohang[19609]: PSI avg exceeded psi_excess_duration (value = 60.0 sec) for 60.1 seconds
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Found 123 processes with existing /proc/[pid]/exe realpath
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Process with highest badness (found in 6 ms):
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   PID: 2570, Name: bees, badness: 56
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Recheck memory levels...
2019-08-14T21:15:47+0300 gentoo nohang[19609]: PSI avg (74.44) > soft_threshold_max_psi (60.0)
2019-08-14T21:15:47+0300 gentoo nohang[19609]: PSI avg exceeded psi_excess_duration (value = 60.0 sec) for 60.1 seconds
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Victim status (found in 0 ms):
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Name:      bees
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   State:     S (sleeping)
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   PID:       2570
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Ancestry:  PID 2373 (beesd) <= PID 1 (systemd)
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   EUID:      0
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   badness:   56, oom_score:  56, oom_score_adj:  0
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   VmSize:    1751 MiB
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   VmRSS:     1093 MiB  (Anon: 1089 MiB, File: 3 MiB, Shmem: 0 MiB)
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   VmSwap:       1 MiB
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   CGroup_v1:
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   CGroup_v2: /system.slice/system-beesd.slice/beesd@-----------------------------------------
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Realpath:  /usr/libexec/bees
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Lifetime:  2 h 2 min 55 sec
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Implement a corrective action:
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Send SIGTERM to the victim; total response time: 7 ms
2019-08-14T21:15:47+0300 gentoo nohang[19609]: The victim doesn't respond on corrective action in 0.026 sec
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Memory status after implementing a corrective action:
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   MemAvailable: 11238.2 MiB, SwapFree: 3154.1 MiB
2019-08-14T21:15:47+0300 gentoo nohang[19609]: Total stat (what happened in the last 1 min 0 sec):
2019-08-14T21:15:47+0300 gentoo nohang[19609]:   Send SIGTERM to bees: 1
2019-08-14T21:15:47+0300 gentoo nohang[19609]: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2019-08-14T21:16:19+0300 gentoo systemd[1]: Stopping Highly configurable OOM prevention daemon...
2019-08-14T21:16:19+0300 gentoo nohang[19609]: Signal handler called with the SIGTERM signal
2019-08-14T21:16:19+0300 gentoo nohang[19609]: Total stat (what happened in the last 1 min 31 sec):
2019-08-14T21:16:19+0300 gentoo nohang[19609]:   Send SIGTERM to bees: 1
2019-08-14T21:16:19+0300 gentoo nohang[19609]: Exit

@hakavlad
Copy link
Owner Author

Now the main problem is the lack of documentation.

@hakavlad
Copy link
Owner Author

it solely relies on it to kill/terminate processes

What do you suggest instead?

@hakavlad
Copy link
Owner Author

hakavlad commented Aug 14, 2019

It killed bees (btrfs dedup) while the memory & swap was mostly free.

It's OK, it's by design. The daemon responds to low memory levels and to higher psi thresholds independently of each other. Nohang kills a process if PSI threshold exceeded, is's OK.

@hakavlad
Copy link
Owner Author

I forgot to tell you that the PSI on the desktop is a very dangerous thing and it requires an individual approach and caution when setting up.

@hakavlad
Copy link
Owner Author

See also rfjakob/earlyoom#100 (comment)

@hakavlad
Copy link
Owner Author

Simple solution: increase PSI thresholds and psi_excess_duration, or disable psi checking completely.

@hakavlad
Copy link
Owner Author

hakavlad commented Aug 14, 2019

I was compiling a new kernel.

This should explain everything: it causes long-term memory pressure, thresholds are exceeded, and the demon killed a process. Not a bug, just non-optimal settings.

Probably after this precedent, I should offer higher default thresholds for psi, and add warnings in docs.

@hakavlad
Copy link
Owner Author

@dim-geo Could you show me you /proc/mounts and /proc/self/cgroup, please?

@dim-geo
Copy link

dim-geo commented Aug 16, 2019

Hi,
due to muqss kernel I am not sure that psi of memory is measured correctly.
It can stay high for many minutes although processes are idle.

The solution that I propose is that when PSI monitoring is activated, please check also that there is a lack of memory before killing. If no lack of free swap/memory, ignore the high PSI usage.

I will post the mounts and cgroup later.
Thanks!

@hakavlad
Copy link
Owner Author

due to muqss kernel I am not sure that psi of memory is measured correctly.

I don't think that this affects the correct operation of PSI.

If no lack of free swap/memory, ignore the high PSI usage

Bad idea. PSI may be HIGH and system may be frozen at HIGH SwapFree.

These are two independent metrics:

MemAvailable & SwapFree may be at low level and PSI may be about 0. The reverse is also possible: SwapFree=60% and PSI=90.

You may try https://github.com/hakavlad/nohang-extra/blob/master/trash/psi-trigger (SwapFree will be about 50% and PSI values will be high)

@dim-geo
Copy link

dim-geo commented Aug 16, 2019

Hi,

cat /proc/mounts 
proc /proc proc rw,relatime 0 0
none /run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=10240k,nr_inodes=2047179,mode=755 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,noexec 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
/dev/sdb2 / ext4 rw,noatime,nobarrier,commit=100,stripe=32710 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=26,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16937 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /mnt/ram tmpfs rw,nosuid,nodev,relatime,size=20480k,uid=1000,gid=1000 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=1639952k,mode=700,uid=1000,gid=1000 0 0

cat /proc/self/cgroup

0::/user.slice/user-1000.slice/session-1.scope

For the time being I deactivated psi.
I will not test the program now because I do some disk activities. After that I will run and report back

@hakavlad
Copy link
Owner Author

hakavlad commented Aug 17, 2019

@dim-geo Thank you for the report! How did you disable cgroup_v1? Did you add systemd.unified_cgroup_hierarchy=1 in boot cmdline?

@hakavlad
Copy link
Owner Author

@dim-geo Do you use SSD or HDD? Do you use zswap or zram?

@dim-geo
Copy link

dim-geo commented Aug 20, 2019

I have both SSD and HDDs. I use zram.
I executed the program and it indeed caused high PSI. (nohang PSI is still disabled)
I killed it and I now monitor PSI. it is still very high after 10 minutes:

some avg10=99.00 avg60=99.00 avg300=90.78 total=725036638
So I'm afraid that PSI measurements are totally off either due to a kernel bug or MuQSS.

@hakavlad
Copy link
Owner Author

hakavlad commented Aug 20, 2019

Do you use also btrfs? See https://bugzilla.kernel.org/show_bug.cgi?id=196729

The problem happen much more frequently when I used BtrFS. After switching to XFS, this happen less frequently (weekly instead of daily).

@dim-geo
Copy link

dim-geo commented Aug 21, 2019

Yes, I do. I disabled PSI from kernel for the time being because it increases sometimes kworker process. Something there is fishy, I will wait for the next kernels to retest it...

@polarathene
Copy link

polarathene commented Jan 4, 2020

@hakavlad MuQSS may still contribute to the problem @dim-geo experiences. It apparently supports some cgroup features but not all(I'm not sure which ones), and Facebooks PSI utilizes cgroups afaik?

It'd be better to confirm with a different kernel that has proper cgroups support.

EDIT: The cgroup support with PSI is optional, but it might still be using it if the MuQSS kernel advertises it as available, even if not fully supported by MuQSS, thus if PSI attempts to use it, it may not work as expected when it attempts to utilize cgroups that MuQSS lacks proper support for.

@polarathene
Copy link

Also, Facebook uses PSI with their oomd project, afaik Facebook is a big user of BTRFS too, so I'd find it suspicious for it to be BTRFS.

The linked bug report has nothing to do with PSI or cgroups, let alone the age of the report is with the 4.10 kernel(PSI afaik requires 4.20 minimum?). BTRFS itself has also had significant improvements/fixes since 4.10, including getting support for swap files.

I can see how XFS may improve things and be more stable than BTRFS, but that does not rule out MuQSS as the cause, especially since @dim-geo has not verified changing filesystems resolves the issue.

Comment 54 on the bug report also points out the issue is unrelated to BTRFS, citing changes with 4.19 kernel patch versions as well as 5.2/5.3 kernels.

While BTRFS may be a contributing factor in that report, it also seems to potentially be due to a combination of other parts of the system. Other users chime in on that report with similar problems without using BTRFS.

@hakavlad
Copy link
Owner Author

hakavlad commented Jan 31, 2020

bc1bb3c

New nohang behavior with psi_checking_enabled=True: don't shoot if MemAvailable > hard/soft threshold. It works well and it prevents false-positives.

Demo: https://youtu.be/Y6GJqFE_ke4.

Commands:

$ tail /dev/zero
$ stress -m 8 --vm-bytes 88G
$ for i in {1..5}; do tail /dev/zero; done
$ for i in {1..5}; do (tail /dev/zero &); done

with such config keys:

psi_checking_enabled = True
psi_excess_duration = 1
psi_post_action_delay = 5
soft_threshold_max_psi  = 5
hard_threshold_max_psi  = 90
ignore_positive_oom_score_adj = True

MemTotal=9.6GiB, zram disksize=47.8GiB.

Now PSI response enabled by default in nohang-desktop.conf.

@danobi
Copy link

danobi commented Feb 10, 2020

It's relatively unlikely that PSI works correctly with an out of tree scheduler.

@hakavlad
Copy link
Owner Author

@danobi Thanks!

@polarathene
Copy link

Both MuQSS and BMQ task schedulers stub cgroups support(that they're responsible for afaik), which has been known to cause issues with things like CPU accounting metrics being unreliable.

As stated earlier, they're the most likely cause of the issue, and PSI without explicitly disabling the cgroup support is probably trying to use those stubbed out cgroup features, thus itself not working reliably.

These schedulers are known to not play well with other software that also utilizes cgroups for CPU activity.

@hakavlad
Copy link
Owner Author

@polarathene Thanks!

@hakavlad
Copy link
Owner Author

hakavlad commented Mar 7, 2020

Linux-ck with MuQSS provides incorrect PSI metrics: https://imgur.com/a/atIjhUw.
@polarathene @danobi @dim-geo @rfjakob

CPU metrics is always about 100%. After stress test: some memory is always about 100%. Full io is always about 0.

@polarathene
Copy link

@hakavlad ? I thought I had stated that was the likely cause, since MuQSS and BMQ only provide cgroup stubs for CPU features that PSI utilizes?

Unless you explicitly disable PSI from using cgroups, I'm not sure if you can do that by specifically for PSI alone, or if you have to disable cgroups completely:

In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem mounted, pressure stall information is also tracked for tasks grouped into cgroups.

Doing so would possibly cause problems/regressions for anything else that expects cgroup support(reliable or not), and afaik it's only a portion related to CPUs that is stubbed by those task schedulers, while other cgroup support works as expected such as for block I/O schedulers like BFQ that utilize different cgroups features.

If someone looks into it further, and can selectively disable it for PSI alone, and that results in better metrics(for kernels using these task schedulers), let me know :)

@hakavlad
Copy link
Owner Author

linux-ck with MuQSS

https://wiki.archlinux.org/index.php/Linux-ck
https://aur.archlinux.org/packages/linux-ck/

$ uname -a
Linux user-pc 5.7.2-2-ck #1 SMP PREEMPT Fri, 12 Jun 2020 15:49:41 +0000 x86_64 GNU/Linux
$ zcat /proc/config.gz | grep CONFIG_SCHED
CONFIG_SCHED_MUQSS=y
$ zcat /proc/config.gz | grep PSI
CONFIG_PSI=y
CONFIG_PSI_DEFAULT_DISABLED=y

PSI support is disabled by default, it can be enabled by passing psi=1 in kernel boot cmdline.

# dmesg | grep scheduler
[    1.044923] MuQSS CPU scheduler v0.202 by Con Kolivas.

psi-top output before load
https://gist.github.com/hakavlad/cea407a6c74b42df033d04d119293d7a

psi2log -m 2 output during load:
https://gist.github.com/hakavlad/40500c989475bc51e90b624d0843fb5b

psi2log -m 1 output after load:
https://gist.github.com/hakavlad/3d2ae89993d32a4514a06ee4d6b67bdd

psi-top output after load:
https://gist.github.com/hakavlad/8feb829e42122a81f9ea6ed3cafd27f3

psi-top -m cpu output after load:
https://gist.github.com/hakavlad/2dd25bc21cd5f7ae616bf4bb81c640c7

Screenshots:
https://imgur.com/a/kGfHhvV

@polarathene
Copy link

Ah cool. It's effectively same as not including PSI support in the kernel though right?

In my previous post I linked to the PSI cgroup2 interface docs, which do not mention a way for telling PSI not to use cgroups(it will use them if the kernel enabled cgroups via CONFIG_CGROUP=y).

Just to confirm, your tests are with boot parameter psi=1 yes?

@hakavlad
Copy link
Owner Author

your tests are with boot parameter psi=1 yes

Of course, psi is enabled by passing psi=1.

I don't know how to disable cgroup_v2 support.

@hakavlad
Copy link
Owner Author

hakavlad commented Jun 14, 2020

linux-lqx with MuQSS

https://aur.archlinux.org/packages/linux-lqx/

$ uname -a
Linux user-pc 5.6.18-lqx1-1-lqx #1 ZEN SMP PREEMPT Sat, 13 Jun 2020 09:29:49 +0000 x86_64 GNU/Linux
$ zcat /proc/config.gz | grep CONFIG_SCHED
CONFIG_SCHED_MUQSS=y
$ zcat /proc/config.gz | grep PSI
# CONFIG_PSI is not set

PSI cannot be enabled.

5.6.0-18.1-liquorix-amd64 on Ubuntu 20.04

https://liquorix.net/

$ uname -a
Linux user-pc 5.6.0-18.1-liquorix-amd64 #1 ZEN SMP PREEMPT liquorix 5.6-20ubuntu1~focal (2020-06-10) x86_64 x86_64 x86_64 GNU/Linux
$ _=`uname -r`; cat `find /boot |grep -e "/boot/config-$_"` |grep CONFIG_SCHED_
CONFIG_SCHED_MUQSS=y
$ _=`uname -r`; cat `find /boot |grep -e "/boot/config-$_"` |grep CONFIG_PSI
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
# CONFIG_PSI is not set
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set

PSI cannot be enabled by passing psi=1.

@hakavlad
Copy link
Owner Author

linux-pf with BMQ

https://aur.archlinux.org/packages/linux-pf/

$ uname -a
Linux user-pc 5.6.7-pf #1 SMP PREEMPT Sat Jun 13 15:42:08 +09 2020 x86_64 GNU/Linux
$ zcat /proc/config.gz | grep PSI
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
$ zcat /proc/config.gz | grep CONFIG_SCHED
CONFIG_SCHED_BMQ=y

No problem, PSI works fine out of the box.

@hakavlad
Copy link
Owner Author

linux-gc with BMQ

https://aur.archlinux.org/packages/linux-gc/

$ uname -a
Linux user-pc 5.7.2-1-gc #1 SMP PREEMPT Sat, 13 Jun 2020 03:25:24 +0000 x86_64 GNU/Linux
$ zcat /proc/config.gz | grep PSI
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
$ zcat /proc/config.gz | grep CONFIG_SCHED
CONFIG_SCHED_BMQ=y

No problem with PSI out of the box.

@hakavlad
Copy link
Owner Author

XanMod Kernel on Ubuntu 20.04

https://xanmod.org/

$ uname -a
Linux user-pc 5.6.18-xanmod1 #0 SMP PREEMPT Wed Jun 3 10:28:23 -03 2020 x86_64 x86_64 x86_64 GNU/Linux
$ _=`uname -r`; cat `find /boot |grep -e "/boot/config-$_"` |grep CONFIG_PSI
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
# CONFIG_PSI is not set
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
$ _=`uname -r`; cat `find /boot |grep -e "/boot/config-$_"` |grep CONFIG_SCHED_
CONFIG_SCHED_MUQSS=y

PSI works well out of the box with CONFIG_SCHED_MUQSS=y.

@hakavlad
Copy link
Owner Author

hakavlad commented Jun 14, 2020

@polarathene So, I saw kernels that provides correct PSI metrics with MuQSS (XanMod kernel) and with BMQ (linux-gc, linux-pf).

@polarathene
Copy link

@hakavlad did you check CONFIG_CGROUP=y? I have mentioned it twice now. It might be the difference between linux-ck and Xanmod? Or the ck patches(which are separate from MuQSS) is causing the issue? Interesting if BMQ is unaffected as last I knew it also was known to stub out some cgroups.

For liquorix and xanmod. you share output of both having CONFIG_PSI being set twice, and once not set(Is "not set" defaulting to enabled?). Why are you grepping this way instead of zcat? (you can also use zgrep PSI /proc/config.gz)

@hakavlad
Copy link
Owner Author

hakavlad commented Jun 14, 2020

CONFIG_CGROUP_CPUACCT=y is not set in linux-ck. This is the only difference from other kernels.

CONFIG_CGROUP=y is set in all kernels.

Why are you grepping this way instead of zcat?

There is no /proc/config.gz in Ubuntu, Debian, Fedora.

@hakavlad hakavlad reopened this Jun 14, 2020
@hakavlad
Copy link
Owner Author

Memory pressure probably happened because I was compiling a new kernel.

@dim-geo Is this new kernel also without CONFIG_CGROUP_CPUACCT=y?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants