New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata-proxy not able to handle too many pods/node #55695

Closed
shyamjvs opened this Issue Nov 14, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@shyamjvs
Member

shyamjvs commented Nov 14, 2017

Kubemark-500 started failing continuously from ~10 days ago (https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce?before=9575).
On digging up a bit, I found out that quite frequently the hollow-nodes are failing to create/delete pods. And there are many such logs from hollow-kubelet:

I1114 08:08:24.861541       7 metadata.go:212] Failed to Get service accounts from gce metadata server: Get http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1114 08:08:25.114072       7 metadata.go:212] Failed to Get service accounts from gce metadata server: Get http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1114 08:12:51.032702       7 metadata.go:212] Failed to Get service accounts from gce metadata server: Get http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/: dial tcp 169.254.169.254:80: getsockopt: connection refused

(taken from https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/9697/artifacts/kubemark-500-minion-group-5f7n/kubelet-hollow-node-68krn.log)

This seems clearly related to changes to metadata-proxy. More specifically change #54653 in the commit diff.

I checked from the currently running kubemark-500 cluster and found that the metadata-proxy pods are crashing frequently:

$ kubectl get pods -n kube-system -l k8s-app=metadata-proxy
NAME                        READY     STATUS             RESTARTS   AGE
metadata-proxy-v0.1-6tnk7   2/2       Running            11         1h
metadata-proxy-v0.1-fmkbd   1/2       CrashLoopBackOff   11         1h
metadata-proxy-v0.1-gp8zp   2/2       Running            11         1h
metadata-proxy-v0.1-hvqlt   1/2       CrashLoopBackOff   11         1h
metadata-proxy-v0.1-p45p5   2/2       Running            7          1h
metadata-proxy-v0.1-pkn6m   2/2       Running            7          1h
metadata-proxy-v0.1-scmxs   2/2       Running            0          1h
metadata-proxy-v0.1-sdccv   2/2       Running            11         1h

And the reason for them crashing seems to be OOM-killing:

$ kubectl describe pods -n kube-system metadata-proxy-v0.1-gp8zp
Name:           metadata-proxy-v0.1-gp8zp
Namespace:      kube-system
Node:           kubemark-500-minion-group-g6h7/10.142.0.6
Start Time:     Tue, 14 Nov 2017 09:26:12 +0000
...
Containers:
  metadata-proxy:
    Container ID:   docker://c52477c5fa28cb12d9b3b2dd5dd196654f8a2e291a64ce54910e0fa479372634
    Image:          gcr.io/google_containers/metadata-proxy:v0.1.4
    Image ID:       docker-pullable://gcr.io/google_containers/metadata-proxy@sha256:9ca02cc773e286b28942103e2482ce74e281ef2c18936610d379b26ecb89c533
    Port:           <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-misc @porridge

@shyamjvs

This comment has been minimized.

Show comment
Hide comment
@shyamjvs

shyamjvs Nov 14, 2017

Member

I'm noticing this issue only on kubemark-500 and NOT on kubemark-100 and kubemark-5000. And my hypothesis is that it's because the no. of hollow-nodes running per real node is high in kubemark-500 compared to kubemark-100 and kubemark-5000:

for kubemark-100, #nodes=3 => ~33 hollow-nodes per node
for kubemark-500, #nodes=7 => ~72 hollow-nodes per node
for kubemark-5000, #nodes=80 => ~63 hollow-nodes per node

This means that a single metadata-proxy pod is not able to handle the load coming from 72 pods trying to access metadata.
Basically metadata-proxy isn't scaling with #pods (that're trying to access metadata) per node.

I'll try to confirm this hypothesis by tweaking the #real-nodes and checking.

Member

shyamjvs commented Nov 14, 2017

I'm noticing this issue only on kubemark-500 and NOT on kubemark-100 and kubemark-5000. And my hypothesis is that it's because the no. of hollow-nodes running per real node is high in kubemark-500 compared to kubemark-100 and kubemark-5000:

for kubemark-100, #nodes=3 => ~33 hollow-nodes per node
for kubemark-500, #nodes=7 => ~72 hollow-nodes per node
for kubemark-5000, #nodes=80 => ~63 hollow-nodes per node

This means that a single metadata-proxy pod is not able to handle the load coming from 72 pods trying to access metadata.
Basically metadata-proxy isn't scaling with #pods (that're trying to access metadata) per node.

I'll try to confirm this hypothesis by tweaking the #real-nodes and checking.

@shyamjvs

This comment has been minimized.

Show comment
Hide comment
@shyamjvs

shyamjvs Nov 14, 2017

Member

Seems like my hypothesis is right. I started a kubemark 1000-node cluster locally on 15 nodes (i.e ~67 hollow-nodes/node) and am seeing those errors on all nodes. But on 16 nodes (i.e ~63 hollow-nodes/node) they're only seen on one node, on which there are more than avg no. of hollow-nodes (66). With 17 nodes, I'm not seeing it at all.

Few options we have here:

  • revert the offending PR #54653 and fix metadata-proxy before bringing back the change: IMO this is the best option
  • increase memory request/limit of metadata-proxy: This can affect the node allocatable capacity and might be undesirable for small nodes
  • increase the #nodes for kubemark-500 job: I don't think we should be doing this as it is not solving the real problem and we'd still be violating "k8s supports 100 pods/node"
Member

shyamjvs commented Nov 14, 2017

Seems like my hypothesis is right. I started a kubemark 1000-node cluster locally on 15 nodes (i.e ~67 hollow-nodes/node) and am seeing those errors on all nodes. But on 16 nodes (i.e ~63 hollow-nodes/node) they're only seen on one node, on which there are more than avg no. of hollow-nodes (66). With 17 nodes, I'm not seeing it at all.

Few options we have here:

  • revert the offending PR #54653 and fix metadata-proxy before bringing back the change: IMO this is the best option
  • increase memory request/limit of metadata-proxy: This can affect the node allocatable capacity and might be undesirable for small nodes
  • increase the #nodes for kubemark-500 job: I don't think we should be doing this as it is not solving the real problem and we'd still be violating "k8s supports 100 pods/node"
@shyamjvs

This comment has been minimized.

Show comment
Hide comment
Member

shyamjvs commented Nov 14, 2017

@porridge

This comment has been minimized.

Show comment
Hide comment
@porridge

porridge Nov 14, 2017

Member
Member

porridge commented Nov 14, 2017

@shyamjvs

This comment has been minimized.

Show comment
Hide comment
@shyamjvs

shyamjvs Nov 14, 2017

Member

Yes, I agree. We need load-testing of metadata proxy.
I sent out the revert PR for restoring back to healthy state, so we don't need to live with the regression until a 'well-tested' fix rolls in.

Member

shyamjvs commented Nov 14, 2017

Yes, I agree. We need load-testing of metadata proxy.
I sent out the revert PR for restoring back to healthy state, so we don't need to live with the regression until a 'well-tested' fix rolls in.

@shyamjvs

This comment has been minimized.

Show comment
Hide comment
@shyamjvs

shyamjvs Nov 14, 2017

Member

I took a quick look at #54653. Seems like there is this change that earlier metadata-proxy had 32MB of memory request & limit totally for itself. After introducing prometheus-to-sd sidecar, that was split evenly across both (i.e 16MB each). This is likely the reason that metadata-proxy is being memory starved now.

Re-adjusting the split could help here. Thoughts?

Member

shyamjvs commented Nov 14, 2017

I took a quick look at #54653. Seems like there is this change that earlier metadata-proxy had 32MB of memory request & limit totally for itself. After introducing prometheus-to-sd sidecar, that was split evenly across both (i.e 16MB each). This is likely the reason that metadata-proxy is being memory starved now.

Re-adjusting the split could help here. Thoughts?

k8s-merge-robot added a commit that referenced this issue Nov 15, 2017

Merge pull request #55715 from shyamjvs/fix-prom-to-sd-sidecar-in-met…
…adata-proxy

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix prometheus-to-sd sidecar in metadata proxy

Ref #55695 (comment)

This is making 2 changes:
- restoring resource requests and limits of the metadata-proxy sidecar as it was before, and remove them for prom-to-sd sidecar (best effort) like at everywhere else
- pass pod name and namespace args to prom-to-sd sidecar (because just noticed)

/cc @ihmccreery @loburm @crassirostris - Does this make sense?
@shyamjvs

This comment has been minimized.

Show comment
Hide comment
@shyamjvs

shyamjvs Nov 15, 2017

Member

Seems like PR #55715 fixed this. I filed a separate issue #55797 for load-testing metadata-proxy.
/close

Member

shyamjvs commented Nov 15, 2017

Seems like PR #55715 fixed this. I filed a separate issue #55797 for load-testing metadata-proxy.
/close

k8s-merge-robot added a commit that referenced this issue Dec 16, 2017

Merge pull request #55813 from ihmccreery/prom-to-sd-resource-limits
Automatic merge from submit-queue (batch tested with PRs 56650, 55813, 56911, 56921, 56871). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Reintroduce memory limits removed in #55715

**What this PR does / why we need it**: Reintroduce memory limits removed in #55715, in order to make metadata-proxy QoS be guaranteed.  Xref #55695.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #55797

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment