New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes may die due to possible memory leak caused by emptyDir.medium.Memory #72759

Open
eytan-avisror opened this Issue Jan 10, 2019 · 2 comments

Comments

Projects
None yet
4 participants
@eytan-avisror
Copy link

eytan-avisror commented Jan 10, 2019

/sig node

Possibly related to:
#45419
#72294

What happened:
Summary:
It seems when using emptyDir on medium: Memory some memory may be leaked, eventually causing a docker daemon to hang and node to become NotReady with PLEG issues.

volumes:
- emptyDir:
  medium: Memory
  sizeLimit: "10Mi"

This happens if a pod with such mount keeps crashing for a long period of time (days)

Use case:
Before main container is loaded, an init container pulls application secrets and place them on /etc/secrets (which is a mount for /dev/shm on the node - this is also mounted on the main container).

Problem:

If for some reason the pod starts crashing, after a while, the node will eventually die.

Reproduce:

I was able to reproduce this issue by using dd to write into shared memory from the init container, in the reproduce example the pod sometimes get's evicted (due to sizeLimit) and other times gets OOMKilled/CrashloopBackoff (not sure why evictions are not consistent)

Example:

repro-limited-pub-f45bcfd77-55ltc                                      0/1       Evicted                 0          14m
repro-limited-pub-f45bcfd77-6gqrl                                      0/1       Init:CrashLoopBackOff   1          17m
repro-limited-pub-f45bcfd77-6ts2w                                      0/1       Init:OOMKilled          0          31s
repro-limited-pub-f45bcfd77-7qs2q                                      0/1       Evicted                 0          4m
repro-limited-pub-f45bcfd77-7vxmx                                      0/1       Evicted                 0          3m
repro-limited-pub-f45bcfd77-89ggp                                      0/1       Init:OOMKilled          0          28s
repro-limited-pub-f45bcfd77-8d5kj                                      0/1       Evicted                 0          8m
repro-limited-pub-f45bcfd77-8zxpb                                      0/1       Evicted                 0          11m
repro-limited-pub-f45bcfd77-9bmmh                                      0/1       Evicted                 0          8m

We believe this is unrelated to initContainer, this will also happen on a main container if it crashes for long enough, the use of an InitContainer which is crashing simply reproduces faster since there is no Backoff for init.

In our pre-prod we saw this issue after pods crashed thousands of times over days, some can reproduce faster as it seem to also depend on how much data is written to /dev/shm.

Pre crash symptoms:

System OOMs encountered on node

Warning  SystemOOM                18m (x2 over 18m)  kubelet, ip-xxx-xxx-xxx-xxx.us-west-2.compute.internal  System OOM encountered

Pause container process gets killed by OOM killer before dd which is consuming the memory!

kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
kernel: [23551]     0 23551      254        1       4        0          -998 pause
kernel: [23844]     0 23844    16464    15500      38        0          -998 dd
kernel: Memory cgroup out of memory: Kill process 23551 (pause) score 0 or sacrifice child

Post crash symptoms:

Docker daemon on node is hung, PLEG makes node NotReady:

  Ready            False   Wed, 09 Jan 2019 15:50:48 -0800   Wed, 09 Jan 2019 15:42:27 -0800   KubeletNotReady              PLEG is not healthy: pleg was last seen active 11m23.505775528s ago; threshold is 3m0s

Garbage Collector reports failure to clean up:

Warning  ContainerGCFailed  2m (x3 over 8m)   kubelet, ip-xxx-xxx-xxx-xxx.us-west-2.compute.internal  rpc error: code = DeadlineExceeded desc = context deadline exceeded

What you expected to happen:
Crashing pods should not result in taking down the node.

How to reproduce it (as minimally and precisely as possible):

Run the following spec:
MAKE SURE TO CHANGE THE nodeAffinity TO POINT TO YOUR NODE.
Using 6 replicas on a single node (testing on an instance with 32GB RAM) reproduces the issue faster, also, you may need to change the main container resource request if you are using smaller instance type.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: repro-limited-pub
spec:
  selector:
    matchLabels:
      app: repro-limited-pub
  replicas: 6
  template: 
    metadata:
      name: repro-limited-pub
      labels:
        app: repro-limited-pub
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-xxx-xxx-xxx-xxx.vpc.internal
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: "10Mi"
        name: secrets
      initContainers:
      - name: init
        image: busybox
        resources: {}
        volumeMounts:
        - mountPath: /etc/secrets
          name: secrets
        command: ["dd"]
        args: 
        - "if=/dev/zero"
        - "of=/etc/secrets/zero"
        - "bs=60M"
      containers:
      - name: main
        image: busybox
        resources:
          limits:
            memory: 4Gi
            cpu: 1
          requests:
            cpu: 1
            memory: 4Gi
        volumeMounts:
        - mountPath: /etc/secrets
          name: secrets

On an instance type with 32GB RAM (t2.2xlarge) running 6 guaranteed replicas (with 4GB RAM), it takes about 1-2 hrs to kill the node.

Node Memory Usage:
Query: heapster.node.memory.working_set / heapster.node.memory.node_capacity * 100
screen shot 2019-01-09 at 4 08 06 pm

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Also reproduced on:
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:25:46Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    AWS t2.2xlarge (non EKS)

  • OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux Server"
VERSION="7.4 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.4"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.4 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.4:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.4"
  • Kernel (e.g. uname -a):
Linux ip-xxx-xxx-xxx-xxx.vpc.internal 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Kops
@krmayankk

This comment has been minimized.

Copy link
Contributor

krmayankk commented Jan 10, 2019

/sig node

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 15, 2019

@eytan-avisror seen this? #63641 we don't really honor the sizeLimit and use default linux kernel behaviors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment