agent gets oom-killed, manifests as virtiofsd dying #1111

c3d · 2020-11-13T09:30:41Z

Get your issue reviewed faster

To help us understand the problem more quickly, please do the following:

Run the kata-collect-data.sh script, which is installed as part of Kata Containers.
```
$ sudo kata-collect-data.sh > /tmp/kata.log
```

I will request this information from the original poster.

Description of problem

The agent can be selected by the Linux OOM killer. When this happens, the symptoms can be very hard to diagnose. In at least one instance, the obvious manifestation was virtiofsd crashing, the reason being it lost the connexion with qemu, and the reason it lost the connexion was because qemu had exited as a result of the guest shutting down.

Expected result

The kata-agent should be excluded from the OOM killer, so that we at least get some reasonable information in the logging,
When the workload is OOM-killed, the agent should report that properly

Actual result

The kata-agent is killed
The guest shuts down
qemu quits
virtiofsd crashes, which is the most visible manifestation.

Further information

After running instructions here: https://github.com/dgibson/kata-vfio-tools/blob/main/podman.md, you get a crash of virfiofsd that looks like this:

Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000086]    unique: 5569490, success, outsize: 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000086] virtio_send_msg: elem 6: with 1 in desc of length 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Got queue event on Queue 1
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 16 out: 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Waiting for Queue 1 event
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000087] fv_queue_worker: elem 6: with 2 out desc of length 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000087] unique: 5569492, opcode: FLUSH (25), nodeid: 7894, insize: 64, pid: 19865
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000087]    unique: 5569492, success, outsize: 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000087] virtio_send_msg: elem 6: with 1 in desc of length 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Got queue event on Queue 1
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 16 out: 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Waiting for Queue 1 event
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000108] fv_queue_worker: elem 6: with 2 out desc of length 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000108] unique: 5569494, opcode: FLUSH (25), nodeid: 7895, insize: 64, pid: 19865
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000108]    unique: 5569494, success, outsize: 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000108] virtio_send_msg: elem 6: with 1 in desc of length 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Got queue event on Queue 1
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 16 out: 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Waiting for Queue 1 event
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000077] fv_queue_worker: elem 6: with 2 out desc of length 64
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000077] unique: 5569496, opcode: FLUSH (25), nodeid: 7896, insize: 64, pid: 19865
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000077]    unique: 5569496, success, outsize: 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000077] virtio_send_msg: elem 6: with 1 in desc of length 16
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Got queue event on Queue 1
Oct 22 13:18:01 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 16 out: 64
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] fv_queue_set_started: qidx=0 started=0
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000003] fv_queue_thread: kill event on queue 0 - quitting
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] fv_remove_watch: TODO! fd=9
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] fv_queue_set_started: qidx=1 started=0
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000004] fv_queue_thread: kill event on queue 1 - quitting
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] fv_remove_watch: TODO! fd=12
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] virtio_loop: Unexpected poll revents 11
Oct 22 13:18:23 virtiofsd[16470]: [ID: 00000001] virtio_loop: Exit

Investigation shows that this is really the guest agent being OOM-killed, which causes qemu to quit:


[ 3053.850497] Out of memory: Killed process 24117 (trinity-c0) total-vm:82928kB, anon-rss:2252kB, file-rss:0kB, shmem-rss:180kB, UID:1000
[ 3053.852054] oom_reaper: reaped process 24117 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:180kB
[ 3054.414057] trinity-main invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 3054.414379] CPU: 1 PID: 24118 Comm: trinity-main Tainted: G        W        --------- -  - 4.18.0-240.el8.x86_64 #1
[ 3054.414564] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.3.0+7638+07cf13d2 04/01/2014
[ 3054.414729] Call Trace:
[ 3054.414785]  dump_stack+0x5c/0x80
[ 3054.414867]  dump_header+0x51/0x308
[ 3054.414936]  oom_kill_process.cold.28+0xb/0x10
[ 3054.415047]  out_of_memory+0x1c1/0x4b0
[ 3054.415121]  __alloc_pages_slowpath+0xc24/0xd40
[ 3054.415227]  __alloc_pages_nodemask+0x245/0x280
[ 3054.415323]  filemap_fault+0x3b8/0x840
[ 3054.415408]  ? alloc_set_pte+0x203/0x480
[ 3054.415479]  ? xas_load+0x8/0x80
[ 3054.415546]  ? xas_find+0x173/0x1b0
[ 3054.415616]  ? filemap_map_pages+0x1a3/0x380
[ 3054.415709]  __do_fault+0x38/0xc0
[ 3054.415785]  do_fault+0x191/0x3c0
[ 3054.415855]  __handle_mm_fault+0x3e6/0x7c0
[ 3054.415926]  handle_mm_fault+0xc2/0x1d0
[ 3054.416007]  __do_page_fault+0x21b/0x4d0
[ 3054.416091]  do_page_fault+0x32/0x110
[ 3054.416162]  ? async_page_fault+0x8/0x30
[ 3054.416231]  async_page_fault+0x1e/0x30
[ 3054.416302] RIP: 0033:0x7f50b27b93eb
[ 3054.416372] Code: Bad RIP value.
[ 3054.416453] RSP: 002b:00007ffd82f88288 EFLAGS: 00010246
[ 3054.416544] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 3054.416685] RDX: 0000000000000000 RSI: 00007f50b28f9bc0 RDI: 0000000000000000
[ 3054.416825] RBP: 00007ffd82f887e0 R08: 00007ffd82f887a7 R09: cccccccccccccccd
[ 3054.416973] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffd82f88800
[ 3054.417114] R13: 0000000000436d08 R14: 00007ffd82f88998 R15: 0000000000000064
[ 3054.417276] Mem-Info:
[ 3054.417349] active_anon:134738 inactive_anon:40188 isolated_anon:0
[ 3054.417349]  active_file:5 inactive_file:17 isolated_file:0
[ 3054.417349]  unevictable:11 dirty:0 writeback:0 unstable:0
[ 3054.417349]  slab_reclaimable:6544 slab_unreclaimable:9099
[ 3054.417349]  mapped:12323 shmem:166773 pagetables:336 bounce:0
[ 3054.417349]  free:13211 free_pcp:0 free_cma:0
[ 3054.417918] Node 0 active_anon:538952kB inactive_anon:160752kB active_file:20kB inactive_file:68kB unevictable:44kB isolated(anon):0kB isolated(file):0kB mapped:49292kB dirty:0kB writeback:0kB shmem:667092kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 8192kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 3054.418428] Node 0 DMA free:7788kB min:372kB low:464kB high:556kB active_anon:3288kB inactive_anon:4264kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:16kB pagetables:32kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3054.418892] lowmem_reserve[]: 0 1854 1854 1854 1854
[ 3054.418990] Node 0 DMA32 free:45056kB min:44680kB low:55848kB high:67016kB active_anon:535564kB inactive_anon:156488kB active_file:64kB inactive_file:0kB unevictable:44kB writepending:0kB present:2080636kB managed:1931188kB mlocked:44kB kernel_stack:2512kB pagetables:1312kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3054.419485] lowmem_reserve[]: 0 0 0 0 0
[ 3054.419573] Node 0 DMA: 4*4kB (M) 4*8kB (UM) 6*16kB (UME) 5*32kB (UM) 1*64kB (E) 2*128kB (U) 4*256kB (UME) 2*512kB (UE) 3*1024kB (UE) 1*2048kB (M) 0*4096kB = 7792kB
[ 3054.419833] Node 0 DMA32: 303*4kB (UME) 181*8kB (UME) 146*16kB (UME) 139*32kB (UME) 47*64kB (UME) 27*128kB (UME) 9*256kB (UME) 7*512kB (UE) 3*1024kB (UME) 4*2048kB (ME) 3*4096kB (UM) = 45348kB
[ 3054.420164] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 3054.420333] Node 0 hugepages_total=512 hugepages_free=504 hugepages_surp=0 hugepages_size=2048kB
[ 3054.420488] 166802 total pagecache pages
[ 3054.420556] 0 pages in swap cache
[ 3054.420624] Swap cache stats: add 0, delete 0, find 0/0
[ 3054.420712] Free swap  = 0kB
[ 3054.420779] Total swap = 0kB
[ 3054.420847] 524157 pages RAM
[ 3054.420912] 0 pages HighMem/MovableOnly
[ 3054.420989] 37383 pages reserved
[ 3054.421071] 0 pages hwpoisoned
[ 3054.421149] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 3054.421312] [  229]     0   229    23506     2835   225280        0             0 systemd-journal
[ 3054.421479] [  242]     0   242    22160     2097   217088        0         -1000 systemd-udevd
[ 3054.421635] [  270]     0   270   479982     9279   323584        0             0 kata-agent
[ 3054.421790] [  285]     0   285     2973      670    69632        0             0 bash
[ 3054.421925] [  301]     0   301     1036      141    45056        0             0 sh
[ 3054.422080] [ 9733]     0  9733     2840      107    61440        0             0 bash
[ 3054.422227] [20159]     0 20159      629       21    45056        0             0 random
[ 3054.422359] [20160]  1000 20160      629       38    45056        0             0 random
[ 3054.422493] [20161]  1000 20161    20732     5985   180224        0             0 trinity-main
[ 3054.422648] [24118]  1000 24118    20732      569    90112        0             0 trinity-main
[ 3054.422808] [24119]  1000 24119    20732      569    90112        0             0 trinity-main
[ 3054.422972] [24120]  1000 24120    20732      569    90112        0             0 trinity-main
[ 3054.423139] [24121]  1000 24121    20732      568    90112        0             0 trinity-main
[ 3054.423297] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=0c0b732bf6cb24ce00761efc8acda4ef8d0e35d5c5bbc33a208de9271baf1bc9,mems_allowed=0,global_oom,task_memcg=/system.slice/kata-agent.service,task=kata-agent,pid=270,uid=0
[ 3054.423683] Out of memory: Killed process 270 (kata-agent) total-vm:1919928kB, anon-rss:23512kB, file-rss:4kB, shmem-rss:13600kB, UID:0
exit
[  OK  ] Stopped target Kata Containers Agent Target.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped target Timers.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Sockets.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
         Unmounting /run/kata-containers/sandbox/shm...
[ 3054.453560] systemd-journald[229]: Successfully sent stream f         ile descriptor tUnmounting Temporary Directory (/tmp)...o service manage
r.
[ 3054.454188] systemd-journald[229]: Successfully sent stream file descriptor to service manager.
         Unmounting /run/sandbox-ns/uts...
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Load Kernel Modules.
[[ 3054.458179] s  OK  ystemd-journald[] 229]: SuccessfulStopped target Slices.ly sent stream f
ile descriptor to service manager.
         Unmounting /run/sandbox-ns/ipc...
         Unmounting /run/kata-containers/shared/containers...
[ 3054.463491] systemd-journald[229]: Successfully sent stream file descriptor to service manager.
         Unmounting /etc/resolv.conf...
[  OK  ] Unmounted /run/kata-containers/sandbox/shm.
[  OK  ] Unmounted /run/sandbox-ns/uts.
[ 3054.469304] systemd-journald[229]: Successfully sent stream file descriptor to service manager.
[  OK  ] Unmounted /run/kata-containers/shared/containers.
[ 3054.472162] systemd-journald[229]: Successfully sent stream file descriptor to service manager.
[  OK  ] Unmounted /etc/resolv.conf.
[  OK  ] Unmounted /run/sandbox-ns/ipc.
[  OK  ] Unmounted Temporary Directory (/tmp).
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Stopped target Swap.
[  OK  ] Reached target Shutdown.
[  OK  ] Reached target Final Step.
         Starting Power-Off...
[ 3054.546854] systemd-journald[229]: Successfully sent stream file descriptor to service manager.
[ 3054.560809] printk: systemd-shutdow: 309 output lines suppressed due to ratelimiting
[ 3054.566663] systemd-shutdown[1]: Found cgroup on /sys/fs/cgroup/systemd, legacy hierarchy
[ 3054.570635] systemd-shutdown[1]: Successfully forked off '(sd-sync)' as PID 24143.
[ 3054.570947] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 3054.571638] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 3054.575048] systemd-journald[229]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 3054.575300] systemd-journald[229]: systemd-journald stopped as pid 229
[ 3054.576507] printk: systemd-journal: 10 output lines suppressed due to ratelimiting
[ 3054.576832] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 3054.579337] systemd-shutdown[1]: Unmounting file systems.
[ 3054.580132] systemd-shutdown[1]: Successfully forked off '(sd-remount)' as PID 24144.
[ 3054.580612] [24144]: Remounting '/' read-only in with options 'size=957268k,nr_inodes=239317'.
[ 3054.581233] systemd-shutdown[1]: All filesystems unmounted.
[ 3054.581402] systemd-shutdown[1]: Deactivating swaps.
[ 3054.581551] systemd-shutdown[1]: All swaps deactivated.
[ 3054.597771] ACPI: Preparing to enter system sleep state S5
[ 3054.598415] reboot: Power down
[ 3054.605350] acpi_power_off called

The text was updated successfully, but these errors were encountered:

c3d · 2020-11-13T09:34:11Z

Solution suggested by Qian: set oom_adj score for kata-agent to prevent that from happening.

Under stress, the agent can be OOM-killed, which exists the sandbox. One possible hard-to-diagnose manifestation is a virtiofsd crash. Fixes: kata-containers#1111 Reported-by: Qian Cai <caiqian@redhat.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>

egernst · 2020-11-13T15:09:21Z

What pod/container spec you running with? Are you specifying memory for the workload?

egernst · 2020-11-13T15:49:21Z

If you run an unconstrained memory hungry workload, I’m not too surprised we see this. We don’t handle “best effort” particularly well, and usually deployments will set a default CPU/mem limit when none is specified.

c3d · 2020-11-13T15:49:43Z

Reproducer, as requested by @egernst on Slack:

podman --runtime=kata-vfio run --security-opt label=type:container_kvm_t -it --rm --cap-add=CAP_IPC_LOCK --device=/dev/vfio/120 --device=/dev/vfio/vfio  fedora sh
Now inside the container:
# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make
# ./random -x 0-100 -f (which just run some syscalls fuzzing)

Not specifying --memory in the original test, but I requested that from the person who first ran into it.

c3d · 2020-11-13T15:50:32Z

(@egernst Sorry I initially commented on the PR instead of the issue)

c3d · 2020-11-16T10:25:52Z

@egernst I have spent a little more time looking at this.

First, I observe a behavior that I find a little strange:

# podman run -it --runtime /usr/bin/kata-runtime fedora bash
[root@e8e5e1b93cc5 /]# cat /proc/meminfo 
MemTotal:        2025124 kB
[...]

But then if I pass the --memory 1G option I get:

 podman run -it --rm --memory 1G --runtime /usr/bin/kata-runtime fedora bash
[root@45abeb8231a7 /]# cat /proc/meminfo 
MemTotal:        3073700 kB

Even in that case, I see:

[root@45abeb8231a7 /]# cat /proc/self/cgroup 
0::/libpod/45abeb8231a7c5c8fd478c7f8270b9ae22aad3fa27a36e62fc26f5ef542ddcf9
[root@45abeb8231a7 /]# cat /sys/fs/cgroup/libpod/45abeb8231a7c5c8fd478c7f8270b9ae22aad3fa27a36e62fc26f5ef542ddcf9/cgroup.controllers

(the file is empty)

So there are two things that are a bit unexpected to me:

There is no mem cgroup setup
The 1G memory option is added to the base memory.

Maybe --memory is the wrong podman option to use?

c3d · 2020-11-16T10:28:48Z

@egernst In any case, I asked the original poster to tell me what /proc/meminfo there is in his config. I suspect it's the default, 2G. Unless the memory cgroup is put in place for the workload, I don't see how we would otherwise protect the agent from being OOM-killed. And I prefer having two layers of OOM-killer-protection rather than one 😄

c3d added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Nov 13, 2020

jodh-intel added this to To do in Issue backlog Nov 13, 2020

c3d mentioned this issue Nov 13, 2020

agent: Adjust OOM Score to avoid agent being killed. #1114

Merged

jodh-intel moved this from To do to In progress in Issue backlog Nov 13, 2020

amshinde closed this as completed in #1114 Dec 2, 2020

Issue backlog automation moved this from In progress to Done Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent gets oom-killed, manifests as virtiofsd dying #1111

agent gets oom-killed, manifests as virtiofsd dying #1111

c3d commented Nov 13, 2020 •

edited

c3d commented Nov 13, 2020

egernst commented Nov 13, 2020

egernst commented Nov 13, 2020

c3d commented Nov 13, 2020

c3d commented Nov 13, 2020

c3d commented Nov 16, 2020

c3d commented Nov 16, 2020

agent gets oom-killed, manifests as virtiofsd dying #1111

agent gets oom-killed, manifests as virtiofsd dying #1111

Comments

c3d commented Nov 13, 2020 • edited

Get your issue reviewed faster

Description of problem

Expected result

Actual result

Further information

c3d commented Nov 13, 2020

egernst commented Nov 13, 2020

egernst commented Nov 13, 2020

c3d commented Nov 13, 2020

c3d commented Nov 13, 2020

c3d commented Nov 16, 2020

c3d commented Nov 16, 2020

c3d commented Nov 13, 2020 •

edited