Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet doesn't recognize child process oom kill #78973

Closed
chinglinwen opened this issue Jun 13, 2019 · 16 comments
Closed

Kubelet doesn't recognize child process oom kill #78973

chinglinwen opened this issue Jun 13, 2019 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@chinglinwen
Copy link

chinglinwen commented Jun 13, 2019

What happened:
There's php-fpm pod, which have many child process, somehow the child process got oom killed

What you expected to happen:
Expect pod to restart

How to reproduce it (as minimally and precisely as possible):

  1. create pod with command: stress --cpu 1 --io 1 --vm 1 --vm-bytes 200M --timeout 30s --backoff 3m

  2. kubectl exec pod shell, execute extra command: stress --cpu 1 --io 1 --vm 1 --vm-bytes 200M --timeout 30s --backoff 3000000
    // this extra command will be oom killed( journalctl -k -e -f ), but pod still running normally

for normal oom kill

run pod with command below
stress --cpu 1 --io 1 --vm 1 --vm-bytes 200M --timeout 30s --backoff 3000000

stress.yaml 
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: stress
spec:
  template:
    metadata:
      labels:
        app: stress
    spec:
      nodeSelector:
       #kubernetes.io/hostname: 172.31.81.114
       #kubernetes.io/hostname: kube-test-10-236
      tolerations:
        # Allow the pod to run on the master.
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      containers:
        - name: stress
          image:  progrium/stress
          command: ["sh", "-c", "stress --cpu 1 --io 1 --vm 1 --vm-bytes 200M --timeout 30s --backoff 3000000"]
          #command: [ "/bin/sh", "-c", "echo hello >&2; echo hello1; sleep 3600000" ]
          resources:
           #requests:
           #  cpu: 0.3
           #  memory: 30M
           limits:
             cpu: 1
             memory: 50M

From the test, I found it's relate to the php-child process, the main process killed does cause pod restart correctly.

Jun 13 13:28:28 online-node-81-113 kernel: php invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
Jun 13 13:28:28 online-node-81-113 kernel: php cpuset=4ed8f6475d229008a46f46d7fd1e33d7ff591f176c293cdfaa4ec240619ecddc mems_allowed=0-1
Jun 13 13:28:28 online-node-81-113 kernel: CPU: 28 PID: 25345 Comm: php Not tainted 4.14.15-1.el7.elrepo.x86_64 #1
Jun 13 13:28:28 online-node-81-113 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.3.4 11/08/2016
Jun 13 13:28:28 online-node-81-113 kernel: Call Trace:
Jun 13 13:28:28 online-node-81-113 kernel:  dump_stack+0x63/0x85
Jun 13 13:28:28 online-node-81-113 kernel:  dump_header+0x9f/0x234
Jun 13 13:28:28 online-node-81-113 kernel:  ? mem_cgroup_scan_tasks+0x96/0xf0
Jun 13 13:28:28 online-node-81-113 kernel:  oom_kill_process+0x21c/0x430
Jun 13 13:28:28 online-node-81-113 kernel:  out_of_memory+0x114/0x4a0
Jun 13 13:28:28 online-node-81-113 kernel:  mem_cgroup_out_of_memory+0x4b/0x80
Jun 13 13:28:28 online-node-81-113 kernel:  mem_cgroup_oom_synchronize+0x2f9/0x320
Jun 13 13:28:28 online-node-81-113 kernel:  ? get_mctgt_type_thp.isra.30+0xc0/0xc0
Jun 13 13:28:28 online-node-81-113 kernel:  pagefault_out_of_memory+0x36/0x7c
Jun 13 13:28:28 online-node-81-113 kernel:  mm_fault_error+0x65/0x152
Jun 13 13:28:28 online-node-81-113 kernel:  __do_page_fault+0x456/0x4f0
Jun 13 13:28:28 online-node-81-113 kernel:  do_page_fault+0x38/0x130
Jun 13 13:28:28 online-node-81-113 kernel:  ? page_fault+0x36/0x60
Jun 13 13:28:28 online-node-81-113 kernel:  page_fault+0x4c/0x60
Jun 13 13:28:28 online-node-81-113 kernel: RIP: 0033:0x7f4861d086bf
Jun 13 13:28:28 online-node-81-113 kernel: RSP: 002b:00007ffc7c9fa7d8 EFLAGS: 00010206
Jun 13 13:28:28 online-node-81-113 kernel: RAX: 00007f485b01b040 RBX: 0000000002800000 RCX: 00000000003a9208
Jun 13 13:28:28 online-node-81-113 kernel: RDX: 0000000002800000 RSI: 00007f48592d2000 RDI: 00007f485bad2000
Jun 13 13:28:28 online-node-81-113 kernel: RBP: 00007f4861f33420 R08: 0000000006440000 R09: 00007f485ec1b048
Jun 13 13:28:28 online-node-81-113 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007f485881b040
Jun 13 13:28:28 online-node-81-113 kernel: R13: 00007f485ec1b040 R14: 0000000006400000 R15: 000055e9863099e0
Jun 13 13:28:28 online-node-81-113 kernel: Task in /docker/4ed8f6475d229008a46f46d7fd1e33d7ff591f176c293cdfaa4ec240619ecddc killed as a result of limit of /docker/4ed8f6475d229008a46f46d7fd1e33d7ff591f176c293cdfaa4ec240619ecddc
Jun 13 13:28:28 online-node-81-113 kernel: memory: usage 57344kB, limit 57344kB, failcnt 28
Jun 13 13:28:28 online-node-81-113 kernel: memory+swap: usage 57344kB, limit 114688kB, failcnt 0
Jun 13 13:28:28 online-node-81-113 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Jun 13 13:28:28 online-node-81-113 kernel: Memory cgroup stats for /docker/4ed8f6475d229008a46f46d7fd1e33d7ff591f176c293cdfaa4ec240619ecddc: cache:0KB rss:57344KB rss_huge:49152KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:57344KB inactive_file:0KB active_file:0KB unevictable:0KB
Jun 13 13:28:28 online-node-81-113 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 13 13:28:28 online-node-81-113 kernel: [25345]     0 25345    42688    17068      63       3        0             0 php
Jun 13 13:28:28 online-node-81-113 kernel: Memory cgroup out of memory: Kill process 25345 (php) score 1195 or sacrifice child
Jun 13 13:28:28 online-node-81-113 kernel: Killed process 25345 (php) total-vm:170752kB, anon-rss:57176kB, file-rss:11096kB, shmem-rss:0kB
Jun 13 13:28:28 online-node-81-113 kernel: oom_reaper: reaped process 25345 (php), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

For more oom logs https://pastebin.com/ECdhtg2z ( oom repeat happen )

Cluster information:

Kubernetes version: v1.14.1
Cloud being used: (put bare-metal if not on a public cloud)
Installation method: kubeadm
Host OS: centos CentOS Linux release 7.4.1708, 4.14.15-1.el7.elrepo.x86_64
CNI and version: kube-router:v0.3.0
CRI and version: 18.06.2-ce

@chinglinwen chinglinwen added the kind/bug Categorizes issue or PR as related to a bug. label Jun 13, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 13, 2019
@chinglinwen
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2019
@chinglinwen chinglinwen changed the title Kubelet doesn't recognize child process oom Kubelet doesn't recognize child process oom kill Jun 13, 2019
@mattjmcnaughton
Copy link
Contributor

Thanks for sharing this issue!

To be sure I'm understanding, the problem your describing is the following: k8s has a main process, lets call it ProcessA, which is functioning normally. There is an additional process, lets call it ProcessB, which is also running on the container, but it was started through an exec. When ProcessB is killed with an OOM error, and ProcessA continues to run, the pod is not restarting, but you would expect that it would?

Does this expectation arise from the documentation? Or are you requesting that k8s work that way?

@chinglinwen
Copy link
Author

I expect that pod would be restarted ( though not sure enough, I think you have more experience about this ), I'm surfacing this issue here

In some way, I just encounter the problem, and suspect the pod to be restarted, so we can know the problem earlier, otherwise, it may need manual restart the pod ( as currently we do, and it's happen frequently )

I'd like to hear your thoughts

by reading the code, is that only pod exit, then the oom event will be received by docker-ce and then to kubelet?

will this parameter help?

--experimental-kernel-memcg-notification                                                                    If enabled, the kubelet will integrate with the kernel memcg notification to determine if memory eviction thresholds are crossed rather than polling.
      --kube-reserved mapStringString

@mattjmcnaughton

@mattjmcnaughton
Copy link
Contributor

I expect that pod would be restarted ( though not sure enough, I think you have more experience about this ), I'm surfacing this issue here

In some way, I just encounter the problem, and suspect the pod to be restarted, so we can know the problem earlier, otherwise, it may need manual restart the pod ( as currently we do, and it's happen frequently )

I'd like to hear your thoughts

by reading the code, is that only pod exit, then the oom event will be received by docker-ce and then to kubelet?

will this parameter help?

--experimental-kernel-memcg-notification                                                                    If enabled, the kubelet will integrate with the kernel memcg notification to determine if memory eviction thresholds are crossed rather than polling.
      --kube-reserved mapStringString

@mattjmcnaughton

I would like someone else to confirm, but I believe that a container will only be considered a "failure" if its main process fails. I wonder, is there anyway that we could use the liveness probe here to accomplish your aims? I think that's what k8s typically uses for custom health checks. Alternatively, I wonder if there's also a way to make the failure of the child process lead to the failure of the parent process?

I don't believe that the --experimental-kernal-memcg-notification flag will make any difference - as far as I can tell, its impacting how the kernel notifies the kubelet if there's a memory issue, not the impact of memory issues for additional processes.

@mattjmcnaughton
Copy link
Contributor

One last thought - I think that you may get the quickest answer if you close this issue and instead ask your question in the #sig-node channel on the k8s slack. I've found it to be slightly quicker for these types of problems.

@chinglinwen
Copy link
Author

chinglinwen commented Jun 16, 2019

Thanks. your reply is very helpful ( I'll try slack too ).

I'll consider try to have such liveness probe, from a brief look, it seems not an easier way ( I don't see any clue how php-fpm can do it, I'm actually don't know php well, it's my colleague writing service in php ).

Using liveness probe have some feels like a workaround, maybe we can solve it once for all the similar case ( rather than implement a more complex liveness probe everywhere ? ), or in other word ( solve it by get the notify of cgroup oom event is the right way? I'm not sure though ).

I would like someone else to confirm

I have the same thought too, so I create this issue ( to surface it up, and for the better discussion of an issue ).

@BenTheElder
Copy link
Member

kubectl exec is mostly a debugging tool, the main process in a container is what affects restart (besides healthchecks).

If your main process is forking, it should be fail if a critical child fails / restart the child itself, or you need to use a liveness probe.

It's expected that child processes might exit for various reasons and it would be a breaking change to restart pods when a child process exits.

I'm having some trouble finding a documentation page that calls this out specifically.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/ talks about the lifecycle and termination but doesn't explicitly state that this is the main process.

However, the same is true of eg docker run --restart ... and docker exec vs kubectl create deployment.yaml and kubectl exec ... 🤔

@BenTheElder
Copy link
Member

see also this previous issue #50632 (comment)

@chinglinwen
Copy link
Author

I'd curious how this is going

a) how to surface the event when a particular container breached its memory limit and had processes killed (if the pod doesn't terminate itself from the oom-killed process) so admins know why things aren't working.

// maybe someway of annotation to change oom behavior?

#50632 (comment)

@kellycampbell @BenTheElder

@BenTheElder
Copy link
Member

@kubernetes/sig-node-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 25, 2019
@desaintmartin
Copy link
Member

This is potentially related to containerd/cgroups#74 as well.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 20, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Akiqqqqqqq
Copy link

Akiqqqqqqq commented Jun 28, 2022

I encountered same issue, that I found my child process got oom killed by Linux oom killer, but my pid=1 process remain running. So I took the experiment couples of times, only to find sometimes child process got killed (the same as above), and sometimes the pid=1 process got killed, the pod OOMKilled, which is pretty wired.
So I was wondering if this issue wasn't relevant to Kubelet. I went through kubelet and containerd source code, I found nothing but epoll notifying OOM event from cgroup.
I started to suspect Linux oom killer, then I found articles:
https://cloud.tencent.com/developer/article/1169107
https://news.ycombinator.com/item?id=20620545

My kernel is 4.4, stil not clear after reading oom_kill.c code
it seems that which process got killed isn't deterministic. Reply to me if I'm wrong. thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants