pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

dcbw · 2017-06-13T16:43:16Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46823/pull-kubernetes-e2e-gce-etcd3/35722/

I0613 08:53:55.962] Ginkgo ran 1 suite in 32m39.916549611s
I0613 08:53:55.963] Test Suite Passed
I0613 08:53:56.003] Shutting down test cluster in background.
W0613 08:53:56.103] 2017/06/13 08:53:55 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.skip=\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]' finished in 32m40.6674528s
W0613 08:53:56.104] 2017/06/13 08:53:55 util.go:129: Running: ./hack/e2e-internal/e2e-down.sh
W0613 08:53:56.104] Project: k8s-jkns-pr-gce-etcd3
W0613 08:53:56.104] Zone: us-central1-f
W0613 08:54:13.272] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-e2e-35722-http-alt].
W0613 08:54:24.064] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-e2e-35722-nodeports].
I0613 08:54:24.165] Bringing down cluster using provider: gce
W0613 08:54:24.844] Project: k8s-jkns-pr-gce-etcd3
W0613 08:54:24.845] Zone: us-central1-f
W0613 08:54:26.479] INSTANCE_GROUPS=e2e-35722-minion-group
W0613 08:54:26.480] NODE_NAMES=e2e-35722-minion-group-57b8 e2e-35722-minion-group-c0px e2e-35722-minion-group-s7tg e2e-35722-minion-group-v8gg
I0613 08:54:26.580] Bringing down cluster
W0613 08:55:18.958] 2017/06/13 08:55:18 util.go:155: Interrupt testing after 55m0s timeout. Will terminate in another 15m
W0613 08:55:18.959] 
W0613 08:55:18.959] 
W0613 08:55:18.959] Command killed by keyboard interrupt
W0613 08:55:18.959] 
W0613 08:55:18.959] 2017/06/13 08:55:18 util.go:131: Step './hack/e2e-internal/e2e-down.sh' finished in 1m22.993032121s
W0613 08:55:18.959] 2017/06/13 08:55:18 util.go:129: Running: ./hack/e2e-internal/e2e-down.sh
W0613 08:55:18.992] Project: k8s-jkns-pr-gce-etcd3
W0613 08:55:18.993] Zone: us-central1-f
I0613 08:55:19.093] Shutting down test cluster in background.
I0613 08:55:20.207] Bringing down cluster using provider: gce
W0613 08:55:20.884] Project: k8s-jkns-pr-gce-etcd3
W0613 08:55:20.884] Zone: us-central1-f
I0613 08:55:22.526] Bringing down cluster
W0613 08:55:22.627] INSTANCE_GROUPS=e2e-35722-minion-group
W0613 08:55:22.627] NODE_NAMES=e2e-35722-minion-group-57b8 e2e-35722-minion-group-c0px e2e-35722-minion-group-s7tg e2e-35722-minion-group-v8gg
W0613 08:55:24.930] ERROR: (gcloud.compute.instance-groups.managed.delete) Some requests did not succeed:
W0613 08:55:24.931]  - The resource 'projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instanceGroupManagers/e2e-35722-minion-group' is not ready
W0613 08:55:24.931] 
W0613 08:55:24.972] Failed to delete instance group(s).
W0613 08:55:26.699] ERROR: (gcloud.compute.instance-templates.delete) Could not fetch resource:
W0613 08:55:26.700]  - The instance_template resource 'projects/k8s-jkns-pr-gce-etcd3/global/instanceTemplates/e2e-35722-minion-template' is already being used by 'projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instanceGroupManagers/e2e-35722-minion-group'
W0613 08:55:26.700] 
I0613 08:55:30.056] Removing etcd replica, name: e2e-35722-master, port: 2379, result: 0
I0613 08:55:31.044] Removing etcd replica, name: e2e-35722-master, port: 4002, result: 0
W0613 08:55:37.289] Updated [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instances/e2e-35722-master].
W0613 08:56:45.139] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instances/e2e-35722-master].
W0613 08:57:03.664] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-master-https].
W0613 08:57:09.487] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-master-etcd].
W0613 08:57:10.020] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-all].

The text was updated successfully, but these errors were encountered:

k8s-github-robot · 2017-06-13T16:52:38Z

@dcbw There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

janetkuo · 2017-06-13T20:36:44Z

Another occurrence https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/47084/pull-kubernetes-e2e-gce-etcd3/35771/

roberthbailey · 2017-06-14T00:46:41Z

Both of the linked runs failed after a timeout:

Interrupt testing after 55m0s timeout.

This is not a cluster lifecycle issue.

dchen1107 · 2017-06-14T16:48:56Z

Like what @roberthbailey mentioned above, the test is simply timeout, and longest distributed test set use 1960.667636731s below. All tests are passed but timeout.

Should we bump the timeout value for this test suite?

Also there is #47479 which doubled ns deletion timeout, but not increase the overall timeout for this test job. This might also introduce more timeout issues.

On another hand, I spent some time look at which test suite in this job use more of time. The top three are the following:

StatefulSet: 884.680s
StatefulSet: 646s
*PersistentVolume:575s

The time spent on other test suites each is less than 300s, and most of them use less than 100s.

My question is if some of StatefulSet should be marked as SLOW test and skipped in this test job? @kow3ns @janetkuo

dchen1107 · 2017-06-14T16:55:36Z

@fejta @spxtr Another question I am having is that how we capture the time spent on each test?

liggitt · 2017-06-14T20:21:00Z

even before #47479, we were bumping up against the 55m timeout

dchen1107 · 2017-06-14T23:18:30Z

A quick updates on this:

@krzyzacy help to increase the timeout (+10m) for pull-kubernetes-e2e-gce-etcd3 to stop the bleeding edge for now: Increase timeout for etcd3 suite for 10min test-infra#3064
By carefully comparing pull-kubernetes-e2e-gce-etcd3 and https://k8s-testgrid.appspot.com/google-gce#gce-etcd3 since they are running on the same set of tests on GCE, I really believe the performance discrepancy is introduced by the test configuration / infra since gce-etcd3 looks fine. The difference between two test jobs are listed bellowing:
- pull test is configured to timeout at 55mins before @krzyzacy's pr , while gce-etcd3 is 50mins
- pull requires extra build time and stage time, gce-etcd3 doesn't require that. ==> should be in 5 mins range.
- pull test is configured to run on 4 nodes and 30 concurrent tests at any given time, while gce-etcd3 only on 3 nodes and 25 concurrent tests. This configuration theoretically should make pull test run faster. On another hand it might take longer to bring up a slight bigger cluster and tear down. ==>But shouldn't be such significant. Both nodes are on debian-based CVM image with same kernel version and configurations. The node's machine type are the same on both test suites too.
- pull test is enabled cache-mutation-detector, while gce-etcd3 are not. Add ability to enable cache mutation detector in GCE #41326 introduced cache mutation detector in GCE. @ncdc what cache-mutation-detector is for? Can we disable it?

I am removing sig/apps since the same set of tests are ok on gce-etcd3

dchen1107 · 2017-06-15T00:21:57Z

#43996 was the first pr to enable cache-mutation-detector. @wojtek-t @gmarek any ideas?

liggitt · 2017-06-15T00:33:43Z

It's required to be confident that shared informers are not having their objects mutated by listeners, causing data inconsistency issues. I would have expected it on in the post-merge job as well, actually

liggitt · 2017-06-15T00:33:59Z

@ncdc @deads2k ^

dchen1107 · 2017-06-15T02:00:49Z

@liggitt, yes, I read #43996.

Here are a list of jobs with this enabled: https://github.com/kubernetes/test-infra/search?utf8=%E2%9C%93&q=ENABLE_CACHE_MUTATION_DETECTOR&type=

One can see that only pull-* jobs with this enabled. This discrepancy might explain why we see pull-federation-* uses more times than ci-federation-* ? cc/ @csbell

On another hand, my comment at #47446 (comment)

pull test is configured to run on 4 nodes and 30 concurrent tests at any given time, while gce-etcd3 only on 3 nodes and 25 concurrent tests. This configuration theoretically should make pull test run faster. On another hand it might take longer to bring up a slight bigger cluster and tear down. ==>But shouldn't be such significant. Both nodes are on debian-based CVM image with same kernel version and configurations. The node's machine type are the same on both test suites too.

The above test configuration might still be the contributor to the API performance downgrade you reported at #47135 (comment) since in pull test, there are more tests running concurrently, which means more API requests to the API server. We might hit some threshold or limit. cc/ @kubernetes/sig-api-machinery-misc

@krzyzacy How safe we clone ci test with the same configuration as pull one without cache-mutation-detector being enabled? This should be temporary (~ 1 day) for the AB comparison. The only concern I am having is if we would hit the quota limit for our test infra.

krzyzacy · 2017-06-15T02:14:48Z

I can make one with cache-mutation and one without and we can compare the performance.

liggitt · 2017-06-15T03:50:30Z

pull requires extra build time and stage time, gce-etcd3 doesn't require that. ==> should be in 5 mins range.

the pull job has GINKGO_TOLERATE_FLAKES=y, the ci job does not. if any flake is encountered, that can lengthen the test run time by the time of the re-run test

also, time to reach the "Running Suite: Kubernetes e2e suite" line from start of build-log.txt:

ci job: ~8 minutes
pull job: ~21 minutes

ianchakeres · 2017-06-15T13:54:44Z

Yesterday I encountered some test flakes inside e2e-gce-etcd3 on this PR #46792. All the failures mention timeouts. I think they relate to this test flake, #47446. The test did pass last night, after the timeout was increased.

Here are the the failed test runs' output:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/36002/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/35975/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/35842/

dchen1107 · 2017-06-16T19:28:45Z

@liggitt We started etcd3-pr-validate job which has the identical setting as the pull one except cache-mutation-detector at: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/

by comparing the time spent for above tests vs. ci-kubernetes-e2e-gce-etcd3: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/ (please ignore those newly flakiness):

I saw the similar runtime to finish those tests roughly. @krzyzacy is going to enable cache-mutation-detector to see if there are performance hit.

dchen1107 · 2017-06-17T03:13:11Z

After enabling cache-mutation-detector at 15:00pm today, the latency to finish etcd3-pr-validate job is increased obviously. From https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/

38 2017-06-16 18:43 PDT took 46m11s 
37 2017-06-16 17:53 PDT took 49m23s 
36 2017-06-16 17:20 PDT took 31m31s 
35 2017-06-16 16:43 PDT took 35m14s 
34 2017-06-16 16:01 PDT took 41m26s 
33 2017-06-16 15:20 PDT took 38m58s 
====================== ABOVE with cache-mutation-detector is on and with 1) GINKGO_TOLERATE_FLAKES=y,  2) running on 4 nodes and 30 concurrent tests at any given time
====================== BELOW without it, but with 1) GINKGO_TOLERATE_FLAKES=y,  2) running on 4 nodes and 30 concurrent tests at any given time
32 2017-06-16 14:48 PDT took 27m41s 
31 2017-06-16 14:18 PDT took 29m16s 
30 2017-06-16 13:47 PDT took 28m47s 
29 2017-06-16 13:16 PDT took 26m4s 
28 2017-06-16 12:45 PDT took 27m26s 
27 2017-06-16 12:14 PDT took 1m23s 
26 2017-06-16 11:43 PDT took 28m30s 
25 2017-06-16 11:12 PDT took 28m10s 
24 2017-06-16 10:41 PDT took 26m22s 
23 2017-06-16 10:10 PDT took 25m53s 
22 2017-06-16 09:36 PDT took 32m41s 
21 2017-06-16 09:31 PDT 
20 2017-06-16 09:00 PDT took 24m57s 
19 2017-06-16 08:29 PDT took 26m7s 
18 2017-06-16 07:52 PDT took 35m50s 
17 2017-06-16 07:21 PDT took 27m34s 
16 2017-06-16 06:50 PDT took 26m42s 
15 2017-06-16 06:19 PDT took 25m14s 
14 2017-06-16 05:48 PDT took 25m54s 
13 2017-06-16 05:15 PDT took 32m7s 
12 2017-06-16 04:44 PDT took 26m29s 
11 2017-06-16 04:13 PDT took 28m3s 
10 2017-06-16 03:42 PDT took 24m49s 
9 2017-06-16 03:10 PDT took 31m3s 
8 2017-06-16 02:39 PDT took 27m3s 
7 2017-06-16 02:08 PDT took 25m30s 
6 2017-06-16 01:37 PDT took 1m13s 
5 2017-06-16 01:06 PDT took 26m30s 
4 2017-06-16 00:35 PDT took 27m23s 
3 2017-06-16 00:04 PDT took 28m48s 
2 2017-06-15 23:33 PDT took 25m7s 
1 2017-06-15 23:02 PDT took 27m39s

I believe the mystery is resolved, and am closing this issue.

dcbw added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 13, 2017

dcbw mentioned this issue Jun 13, 2017

kubelet/network: report but tolerate errors returned from GetNetNS() v2 #46823

Merged

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 13, 2017

dcbw added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2017

janetkuo added sig/release Categorizes an issue or PR as relevant to SIG Release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 13, 2017

janetkuo added this to the v1.7 milestone Jun 13, 2017

timstclair mentioned this issue Jun 13, 2017

audit: Fill in full ObjectRef, include in LevelMetadata #47457

Merged

roberthbailey removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jun 14, 2017

dchen1107 added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jun 14, 2017

dchen1107 added the approved-for-milestone label Jun 14, 2017

krzyzacy mentioned this issue Jun 14, 2017

Increase timeout for etcd3 suite for 10min kubernetes/test-infra#3064

Merged

dchen1107 removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 14, 2017

dchen1107 self-assigned this Jun 14, 2017

dchen1107 mentioned this issue Jun 14, 2017

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

Closed

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 15, 2017

ianchakeres mentioned this issue Jun 15, 2017

Avoid redundant copying of tars during kube-up for gce if the same file already exists #46792

Merged

krzyzacy mentioned this issue Jun 15, 2017

Add a etcd3-pr-validate job, and make gce-etcd3 to master blocking dashboard kubernetes/test-infra#3075

Merged

dchen1107 mentioned this issue Jun 16, 2017

Enable cache mutation for pr-valication test kubernetes/test-infra#3097

Merged

dchen1107 closed this as completed Jun 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

dcbw commented Jun 13, 2017 •

edited

k8s-github-robot commented Jun 13, 2017

janetkuo commented Jun 13, 2017

roberthbailey commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

liggitt commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

dchen1107 commented Jun 15, 2017

liggitt commented Jun 15, 2017

liggitt commented Jun 15, 2017

dchen1107 commented Jun 15, 2017

krzyzacy commented Jun 15, 2017

liggitt commented Jun 15, 2017 •

edited

ianchakeres commented Jun 15, 2017

dchen1107 commented Jun 16, 2017

dchen1107 commented Jun 17, 2017 •

edited

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

Comments

dcbw commented Jun 13, 2017 • edited

k8s-github-robot commented Jun 13, 2017

janetkuo commented Jun 13, 2017

roberthbailey commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

liggitt commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

dchen1107 commented Jun 15, 2017

liggitt commented Jun 15, 2017

liggitt commented Jun 15, 2017

dchen1107 commented Jun 15, 2017

krzyzacy commented Jun 15, 2017

liggitt commented Jun 15, 2017 • edited

ianchakeres commented Jun 15, 2017

dchen1107 commented Jun 16, 2017

dchen1107 commented Jun 17, 2017 • edited

dcbw commented Jun 13, 2017 •

edited

liggitt commented Jun 15, 2017 •

edited

dchen1107 commented Jun 17, 2017 •

edited