Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865

shyamjvs · 2017-06-21T19:13:59Z

While running scalability tests today (as part of #47344) on a 4000-node GCE cluster, this happened during density test termination. Currently, load test is running.
It failed due to some density pod's condition not being updated and on digging up a bit turned out a couple of kubelets (one where the pod was running) crashed:

I0621 08:08:26.374] Jun 21 08:08:26.372: INFO: Waiting up to 3m0s for all (but 50) nodes to be ready
I0621 08:08:27.435] Jun 21 08:08:27.435: INFO: Condition Ready of node e2e-enormous-cluster-minion-group-1-xdwx is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
I0621 08:08:27.437] Jun 21 08:08:27.437: INFO: Condition Ready of node e2e-enormous-cluster-minion-group-nxl2 is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
..... repeats

From the kernel logs:

Jun 21 14:52:07.298991 e2e-enormous-cluster-minion-group-nxl2 kernel: Out of memory: Kill process 13774 (event-exporter) score 1684 or sacrifice child
Jun 21 14:52:07.312821 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 13774 (event-exporter) total-vm:1268972kB, anon-rss:1193588kB, file-rss:0kB
Jun 21 15:09:02.204129 e2e-enormous-cluster-minion-group-nxl2 kernel: fluentd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=883
Jun 21 15:09:02.298774 e2e-enormous-cluster-minion-group-nxl2 kernel: fluentd cpuset=1e88c29d9ecdec0d6d2e380aa6cf9c7b11db5a60b0bda35ac2b3694a58232b47 mems_allowed=0
..
..
Jun 21 17:06:42.055581 e2e-enormous-cluster-minion-group-nxl2 kernel: Memory cgroup out of memory: Kill process 16497 (ip-masq-agent) score 1463 or sacrifice child
Jun 21 17:06:42.055604 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 22398 (iptables-restor) total-vm:25744kB, anon-rss:3660kB, file-rss:0kB
Jun 21 17:07:46.960055 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 21 17:07:46.960183 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor cpuset=8f72983ec1d83e25928f29a8b1ad953265489b4bf721e922db68bd70b11f2f31 mems_allowed=0
..
..
Jun 21 17:08:39.596296 e2e-enormous-cluster-minion-group-nxl2 kernel: Memory cgroup out of memory: Kill process 23412 (iptables) score 1866 or sacrifice child
Jun 21 17:08:39.596318 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 23412 (iptables) total-vm:23460kB, anon-rss:7460kB, file-rss:4kB
Jun 21 17:08:43.074466 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 21 17:08:43.074553 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables cpuset=db1fd653a6e6cd30ed40ddf44828ad5159f83e75a02bee69bf8c648335b75e7e mems_allowed=0

The cluster is still running and to reach the node:
gcloud compute ssh e2e-enormous-cluster-minion-group-nxl2 --project kubernetes-scale --zone us-east1-a

cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-misc @dchen1107 @yujuhong @gmarek

The text was updated successfully, but these errors were encountered:

shyamjvs · 2017-06-21T19:30:36Z

/assign @dchen1107
Feel free to reassign as apt.

shyamjvs · 2017-06-21T19:32:46Z

If you can confirm that the reason is due to not having large-enough nodes, we can rerun with larger ones.

gmarek · 2017-06-21T19:40:03Z

This basically means that between 1.6 and 1.7 resource usage on Nodes grew enough to cause widespread OOMs on 1.7GB machines, when they're running ~30 pause Pods.

shyamjvs · 2017-06-21T20:02:14Z

Seems like there are nodes crashing from time to time (a bit more even for load test I guess):

NAME                                       STATUS                     AGE       VERSION
e2e-enormous-cluster-minion-group-1-3g2q   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-2-fhh2   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-2-mt54   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-2-nhtw   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-2-zh8l   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-3-23h6   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-3-gtgh   NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a
e2e-enormous-cluster-minion-group-6000     NotReady                   6h        v1.8.0-alpha.1.73+a3501fb9948f6a

Most of them OOMs. Let's try with bigger machines tomorrow and see if the problem still persists.

yujuhong · 2017-06-21T20:16:02Z

This basically means that between 1.6 and 1.7 resource usage on Nodes grew enough to cause widespread OOMs on 1.7GB machines, when they're running ~30 pause Pods.

Yep. There are many add-on pods, so it's hard to guess which one uses more resources in 1.7 without a side-by-side comparison. There were new daemonset (ip-masq-agent) added too, so the increase in resource usage may be expected.

kube-proxy was using quite a lot of memory, but I assumed this is by design since the test created ~13k services.

$ kubectl get services --all-namespaces | wc -l
13125

The only thing that caught my attention is that ip-masq-agent got OOM killed because it exceeded its own memory limit. I think the limit might be too small for the load? /cc @dnardo

yujuhong · 2017-06-21T23:16:09Z

oops. I cc the wrong person. should be @dnardo because of #46782

dnardo · 2017-06-21T23:21:47Z

the limits for ip-masq-agent are pretty small so I doubt it's taking too much resources. If it was oomkilled then yeah maybe it was too small, that said I doubled it from a 24 hour max so I'm a bit surprised.

dchen1107 · 2017-06-22T00:18:43Z

@dnardo, what is the current limit? I have asked this in another issue / pr, but no one answered my question. @matchstick Are we sure we want to enable this by default for 1.7 release? I raised my concern related to this before at #46651 (comment)

@davidopp This is the concern I was talking to you yesterday about 1.7 release: newly added daemonsets on every node. This one can totally make the node useless. We need to make sure the node is large enough to include all those default daemons & daemonsets. Your spreadsheet can help answer this question. @kubernetes/kubernetes-release-managers We should include this information into our release notes.

yujuhong · 2017-06-22T01:37:34Z

@dnardo, what is the current limit? I have asked this in another issue / pr, but no one answered my question. @matchstick Are we sure we want to enable this by default for 1.7 release? I raised my concern related to this before at #46651 (comment)

The memory limit is only 8MB which is pretty small. I didn't mean to say that the new daemon is the culprit. Any existing daemon on the node could've had an significant increase in the resource usage, or all of them could have collectively caused the memory to went over limit. Hard to pinpoint what's the exact cause without baseline (1.6) to compare against.

dnardo · 2017-06-22T01:59:57Z

I'm less concerned about ip-masq-agent than I am this

Jun 21 17:07:46.960055 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 21 17:07:46.960183 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor cpuset=8f72983ec1d83e25928f29a8b1ad953265489b4bf721e922db68bd70b11f2f31 mems_allowed=0

Why is iptables-restore being killed. Wouldn't that be kube-proxy calling iptables-restore?

ip-masq-agent doesn't call that.

Lastly, even if the ip-masq-agent was killed, it wouldn't have caused any issues. It would have at least run once, and that would have setup the ip-masq rules. It would never have needed to change after that.

dchen1107 · 2017-06-22T02:12:24Z

Why iptables-restor 's oom_score_adj is so high: 996? If it is the children process of Kube-proxy, it should inherit the oom_score_adj from kube-proxy, which should be set to much lower value by me since 1.4 release as a temporary workaround before we have full story for #22212

Is there a regression in this release? We changed kube-proxy's oom_score_adj as a critical static pod? cc/ @vishh

Automatic merge from submit-queue (batch tested with PRs 42252, 42251, 42249, 47512, 47887) Bump the memory request/limit for ip-masq-daemon. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # issue #47865 **Special notes for your reviewer**: **Release note**: ```release-note ```

gmarek · 2017-06-22T06:32:56Z

We'll run a test using n1-standard-1s to see if they have enough memory.

gmarek · 2017-06-22T19:43:01Z

This happens also on n1-standard-1 Nodes, which seems bad. Ref. #47899

yujuhong · 2017-06-22T20:03:36Z

Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: [ 2003]     0  2003      257        1       4       2        0          -998 pause
Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: [ 2038]     0  2038     2439     1040       9       5        0           996 ip-masq-agent
Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: [12226]     0 12226     9707     4210      24       3        0           996 iptables-restor
Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: Memory cgroup out of memory: Kill process 12226 (iptables-restor) score 1976 or sacrifice child
Jun 22 15:04:47 e2e-enormous-cluster-minion-group-06pw kernel: Killed process 12226 (iptables-restor) total-vm:38828kB, anon-rss:15132kB, file-rss:1708kB
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel: iptables-restor invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel: iptables-restor cpuset=15f0dda763e8e64e102fb993f0dd5047554c1cf6c266f74974a093a2c564f712 mems_allowed=0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel: CPU: 0 PID: 12302 Comm: iptables-restor Not tainted 4.4.52+ #1
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  0000000000000000 ffff8800b0483ca8 ffffffff8130b7d4 ffff8800b0483d88
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  ffff8801281ab500 ffff8800b0483d18 ffffffff811a0523 ffff8800b0483ce0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  ffffffff8113d330 ffff8801281ab500 0000000000000206 ffff8800b0483cf0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel: Call Trace:
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8130b7d4>] dump_stack+0x63/0x8f
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff811a0523>] dump_header+0x65/0x1d4
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8113d330>] ? find_lock_task_mm+0x20/0xb0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8113dacd>] oom_kill_process+0x28d/0x430
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8119bbdb>] ? mem_cgroup_iter+0x1db/0x390
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8119e0e4>] mem_cgroup_out_of_memory+0x284/0x2d0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8119eb59>] mem_cgroup_oom_synchronize+0x2f9/0x310
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff81199820>] ? memory_high_write+0xc0/0xc0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff8113e1a8>] pagefault_out_of_memory+0x38/0xa0
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff81045c17>] mm_fault_error+0x77/0x150
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff810460e4>] __do_page_fault+0x3f4/0x400
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff81046112>] do_page_fault+0x22/0x30
Jun 22 15:04:52 e2e-enormous-cluster-minion-group-06pw kernel:  [<ffffffff815a94d8>] page_fault+0x28/0x30

@dnardo iptables-restore was in the ip-masq-agent's cgroups. That's what caused ip-masq-agent to be OOM killed. From the numbers above, the new limit would not be enough.

dnardo · 2017-06-22T20:24:49Z

I think what might be happening is that when ip-masq-agent writes out its rules, it may be reading all the ip tables rules that are currently configured. That may explain the usage here. Let me take a look and see.

gmarek · 2017-06-22T20:25:29Z

Thanks @dnardo

shyamjvs · 2017-06-23T01:53:34Z

@gmarek Do we have the apiserver logs available for some 5k-node run for 1.6 somewhere? I can't find them anywhere, and they'd be useful for my debugging work. Also, any way to verify if 1.6 scale tests ran with/without services?

dchen1107 · 2017-06-23T02:02:45Z

We had several discussions offline related to this. Here are the summary of the decision and action items what I had: cc/ @kubernetes/kubernetes-release-managers

@dnardo's comment at Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865 (comment) is the last straw to make us decide to disable ip-masq-agent by default. On a large cluster with many services, the memory usage of ip-masq-agent can jump up to 100M, 200M, ... like kube-proxy since both of them are scaled based on # of nodes, # of services, etc. Thinking about the large cluster with many small nodes (4G Memory), before the user schedule any workloads, the overhead introduced here is unacceptable to the users.

The decision is disable ip-masq-agent for OSS k8s 1.7 release by default. @dnardo has a pending pr for this.

But on another hand, we understand there are the users waiting for features: RFC 1918 and network policy. We are going to document in detail how to enable this feature by
- deploying the daemonset and
- turn on the kubelet flag
  and how much extra overhead the user might encounter. So that the user can make the cautious decision on this. @dnardo is going to write the doc for this.

Also @dnardo and the network team is working on how to reduce the overhead. They have several proposals already.

@shyamjvs and @gmarek are going to re-run scalability tests without services tests. So that we can compare the test result with last release. But @gmarek thanks for running the test with services tests. I do have concern with ip-masq-agent's overhead, and raised a couple of times, but we don't have the data to make the final decision. Thanks.
Node team has node perf dashboard. @yguo0905 is collecting the data to compare the memory usages of kubelet and docker on both 1.6 and 1.7 release. We should make sure there is no regression.

shyamjvs · 2017-06-23T02:55:32Z

@dchen1107 Thanks a lot for the detailed update!

Automatic merge from submit-queue Remove limits from ip-masq-agent for now and disable ip-masq-agent in GCE ip-masq-agent when issuing an iptables-save will read any configured iptables on the node. This means that the ip-masq-agent's memory requirements would grow with the number of iptables (i.e. services) on the node. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # #47865 **Special notes for your reviewer**: **Release note**: ```release-note ```

shyamjvs · 2017-06-23T03:54:59Z

FYI, I've uploaded the logs for the current run of gce-enormous-cluster to gcs (available here) and brought down the cluster. Re-kicked a new job with services and ip-masq-agent disabled this time. Let's see how much this helps.

gmarek · 2017-06-23T08:07:23Z

@shyamjvs started test ~10 PM PDT (thanks a lot!), Load test should finish in ~12h, i.e. Friday 10 am PDT.

gmarek · 2017-06-23T15:21:47Z

@dchen1107 - Load test passed. It's highly likely that Density test will pass as well, which means we're golden.

We'll try running those tests with services enabled, but that's not a blocker for release.

yguo0905 · 2017-06-23T16:08:22Z

Here are the resource usage stats for both 1.6.6 and 1.7.0.
https://docs.google.com/spreadsheets/d/1HO3okawImtgbTbvC5SKKl-5a5Y-Bb1u-FIFxO6yvfK4

shyamjvs · 2017-06-23T16:26:12Z

Yup, both load and density test passed, that too with no high-latency requests. List pods 99%ile latency fell all the way from 6s to ~1.5s. Will verify this weekend if just the ip_masq_agent created this mischief or services too.

dchen1107 · 2017-06-23T16:54:54Z

@shyamjvs and @gmarek Thanks for the test result. Please share the test result with the service enabled later.

From looking at @yguo0905's data, there is not much change on memory usage footprint for both Kubelet and docker (same 1.11.2 anyway) between 1.6.6 and 1.7.0-beta3.

I am closing the issue and thanks everyone!

Automatic merge from submit-queue (batch tested with PRs 47993, 47892, 47591, 47469, 47845) Use a different env var to enable the ip-masq-agent addon. We shouldn't mix setting the non-masq-cidr with enabling the addon. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # **Special notes for your reviewer**: **Release note**: ```release-note ``` #47865

shyamjvs · 2017-06-24T19:38:16Z

And.. the load test failed with services enabled. We are seeing a high qps, like before (similar to #47899 (comment)). But disabling the ip-masq-agent helped by removing some ooms and the pod-status/events update requests arising due to it from kubelet. But now fluentd seems to do something similar (it was also there before iirc, but ip-masq-agent dominated).

Out of 9800 qps of 429s, 7k are from kubelet and the rest from npd. And half of those 7k requests are due to fluentd oom-killing (which kubelets respond to by sending PUT pod-status and PATCH events). The other half are PATCH node-status calls (same for npd), but that's just a consequence iiuc.

From the kernel logs on the nodes, fluentd and event-exporter seem to be oom-killed frequently. From fluentd logs it seems like it's not able to handle the log volume:

  2017-06-24 11:04:38 +0000 [warn]: suppressed same stacktrace
2017-06-24 11:05:07 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-06-24 11:05:36 +0000 error_class="Google::Apis::RateLimitError" error="rateLimitExceeded: Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'logging.googleapis.com' for consumer 'project_number:51872839970'." plugin_id="object:3fc8e6c24bd0"
  2017-06-24 11:05:48 +0000 [warn]: suppressed same stacktrace
2017-06-24 11:05:48 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-06-24 11:05:48 +0000 error_class="Google::Apis::RateLimitError" error="rateLimitExceeded: Insufficient tokens for quota 'WriteGroup' and limit 'CLIENT_PROJECT-100s' of service 'logging.googleapis.com' for consumer 'project_number:51872839970'." plugin_id="object:3fc8e6c24bd0"
..
..

cc @crassirostris

shyamjvs · 2017-06-26T09:52:11Z

We can either try running fluentd with higher memory limits or try finding and reducing the source of this high logs traffic. The only difference between this run and the last run (which passed) is enabling of services, so kubeproxy should be the one doing the mischief. We have the logging verbosity level set to v1 (https://github.com/kubernetes/test-infra/blob/master/jobs/ci-kubernetes-e2e-gce-enormous-cluster.env#L23) and still kubeproxy logs are huge.

shyamjvs · 2017-06-26T09:55:17Z

kube-proxy.log was 920B without services and ~6-7 GBs (rotated logs included) with services. It's mainly because of printing out iptables rules which is too much to log for large clusters with many services.
cc @kubernetes/sig-network-misc

crassirostris · 2017-06-26T10:01:55Z

@shyamjvs

From the kernel logs on the nodes, fluentd and event-exporter seem to be oom-killed frequently. From fluentd logs it seems like it's not able to handle the log volume:

Those are different problems. Quota issues are expected to go away in the coming week, OOM issues are expected under the high load (more than 200KB/sec)

gmarek · 2017-06-26T10:07:35Z

@bowei, @kubernetes/sig-network-bugs - KubeProxy shouldn't log this much on v1 level. @shyamjvs is going to file an issue for that.

shyamjvs · 2017-06-26T10:21:52Z

Filed an issue.
@crassirostris Thanks for the lead.

OOM issues are expected under the high load (more than 200KB/sec)

If that's the case, we are sure to thrash fluentd even on moderately big clusters with quite some service endpoints as just this line in kube-proxy can create a logline of multiple MBs.

spiffxp · 2017-06-26T16:56:53Z

/remove-priority P0
/priority critical-urgent

shyamjvs · 2017-06-26T17:00:23Z

FYI, we are now running the test with fluentd disabled but services still enabled to check if there's any problem with kube-proxy.

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 21, 2017

shyamjvs added the release-blocker label Jun 21, 2017

shyamjvs added this to the v1.7 milestone Jun 21, 2017

shyamjvs added the priority/P0 label Jun 21, 2017

k8s-ci-robot assigned dchen1107 Jun 21, 2017

gmarek removed the release-blocker label Jun 21, 2017

dnardo mentioned this issue Jun 22, 2017

Bump the memory request/limit for ip-masq-daemon. #47887

Merged

This was referenced Jun 22, 2017

Bump node machine-type & max-requests-inflight for gce-enormous-cluster kubernetes/test-infra#3153

Closed

update gce-enormous kubernetes/test-infra#3152

Merged

"Too many requests" in apiserver due to kubelet & NPD in 4k-node scale tests #47899

Closed

dchen1107 added the approved-for-milestone label Jun 22, 2017

dnardo mentioned this issue Jun 23, 2017

Remove limits from ip-masq-agent for now and disable ip-masq-agent in GCE #47922

Merged

shyamjvs mentioned this issue Jun 23, 2017

Disable services in gce-enormous-cluster kubernetes/test-infra#3172

Merged

dchen1107 closed this as completed Jun 23, 2017

gmarek mentioned this issue Jun 23, 2017

Run scalability tests for 1.7 #47344

Closed

dnardo mentioned this issue Jun 23, 2017

Use a different env var to enable the ip-masq-agent addon. #47993

Merged

shyamjvs mentioned this issue Jun 26, 2017

kube-proxy logs exploding in 4000-node scale tests with services enabled #48052

Closed

This was referenced Jun 26, 2017

NPD sending too many node-status updates in scale tests kubernetes/node-problem-detector#124

Closed

Disable fluentd in gce-enormous-cluster test kubernetes/test-infra#3195

Merged

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/P0 labels Jun 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865

Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865

shyamjvs commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

gmarek commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

yujuhong commented Jun 21, 2017 •

edited

yujuhong commented Jun 21, 2017

dnardo commented Jun 21, 2017

dchen1107 commented Jun 22, 2017

yujuhong commented Jun 22, 2017

dnardo commented Jun 22, 2017

dchen1107 commented Jun 22, 2017 •

edited

gmarek commented Jun 22, 2017

gmarek commented Jun 22, 2017

yujuhong commented Jun 22, 2017

dnardo commented Jun 22, 2017

gmarek commented Jun 22, 2017

shyamjvs commented Jun 23, 2017

dchen1107 commented Jun 23, 2017 •

edited

shyamjvs commented Jun 23, 2017

shyamjvs commented Jun 23, 2017

gmarek commented Jun 23, 2017

gmarek commented Jun 23, 2017

yguo0905 commented Jun 23, 2017

shyamjvs commented Jun 23, 2017

dchen1107 commented Jun 23, 2017

shyamjvs commented Jun 24, 2017 •

edited

shyamjvs commented Jun 26, 2017

shyamjvs commented Jun 26, 2017

crassirostris commented Jun 26, 2017

gmarek commented Jun 26, 2017

shyamjvs commented Jun 26, 2017 •

edited

spiffxp commented Jun 26, 2017

shyamjvs commented Jun 26, 2017

Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865

Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865

Comments

shyamjvs commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

gmarek commented Jun 21, 2017

shyamjvs commented Jun 21, 2017

yujuhong commented Jun 21, 2017 • edited

yujuhong commented Jun 21, 2017

dnardo commented Jun 21, 2017

dchen1107 commented Jun 22, 2017

yujuhong commented Jun 22, 2017

dnardo commented Jun 22, 2017

dchen1107 commented Jun 22, 2017 • edited

gmarek commented Jun 22, 2017

gmarek commented Jun 22, 2017

yujuhong commented Jun 22, 2017

dnardo commented Jun 22, 2017

gmarek commented Jun 22, 2017

shyamjvs commented Jun 23, 2017

dchen1107 commented Jun 23, 2017 • edited

shyamjvs commented Jun 23, 2017

shyamjvs commented Jun 23, 2017

gmarek commented Jun 23, 2017

gmarek commented Jun 23, 2017

yguo0905 commented Jun 23, 2017

shyamjvs commented Jun 23, 2017

dchen1107 commented Jun 23, 2017

shyamjvs commented Jun 24, 2017 • edited

shyamjvs commented Jun 26, 2017

shyamjvs commented Jun 26, 2017

crassirostris commented Jun 26, 2017

gmarek commented Jun 26, 2017

shyamjvs commented Jun 26, 2017 • edited

spiffxp commented Jun 26, 2017

shyamjvs commented Jun 26, 2017

yujuhong commented Jun 21, 2017 •

edited

dchen1107 commented Jun 22, 2017 •

edited

dchen1107 commented Jun 23, 2017 •

edited

shyamjvs commented Jun 24, 2017 •

edited

shyamjvs commented Jun 26, 2017 •

edited