Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

Closed
dcbw opened this issue Jun 13, 2017 · 16 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@dcbw
Copy link
Member

dcbw commented Jun 13, 2017

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46823/pull-kubernetes-e2e-gce-etcd3/35722/

I0613 08:53:55.962] Ginkgo ran 1 suite in 32m39.916549611s
I0613 08:53:55.963] Test Suite Passed
I0613 08:53:56.003] Shutting down test cluster in background.
W0613 08:53:56.103] 2017/06/13 08:53:55 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.skip=\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]' finished in 32m40.6674528s
W0613 08:53:56.104] 2017/06/13 08:53:55 util.go:129: Running: ./hack/e2e-internal/e2e-down.sh
W0613 08:53:56.104] Project: k8s-jkns-pr-gce-etcd3
W0613 08:53:56.104] Zone: us-central1-f
W0613 08:54:13.272] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-e2e-35722-http-alt].
W0613 08:54:24.064] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-e2e-35722-nodeports].
I0613 08:54:24.165] Bringing down cluster using provider: gce
W0613 08:54:24.844] Project: k8s-jkns-pr-gce-etcd3
W0613 08:54:24.845] Zone: us-central1-f
W0613 08:54:26.479] INSTANCE_GROUPS=e2e-35722-minion-group
W0613 08:54:26.480] NODE_NAMES=e2e-35722-minion-group-57b8 e2e-35722-minion-group-c0px e2e-35722-minion-group-s7tg e2e-35722-minion-group-v8gg
I0613 08:54:26.580] Bringing down cluster
W0613 08:55:18.958] 2017/06/13 08:55:18 util.go:155: Interrupt testing after 55m0s timeout. Will terminate in another 15m
W0613 08:55:18.959] 
W0613 08:55:18.959] 
W0613 08:55:18.959] Command killed by keyboard interrupt
W0613 08:55:18.959] 
W0613 08:55:18.959] 2017/06/13 08:55:18 util.go:131: Step './hack/e2e-internal/e2e-down.sh' finished in 1m22.993032121s
W0613 08:55:18.959] 2017/06/13 08:55:18 util.go:129: Running: ./hack/e2e-internal/e2e-down.sh
W0613 08:55:18.992] Project: k8s-jkns-pr-gce-etcd3
W0613 08:55:18.993] Zone: us-central1-f
I0613 08:55:19.093] Shutting down test cluster in background.
I0613 08:55:20.207] Bringing down cluster using provider: gce
W0613 08:55:20.884] Project: k8s-jkns-pr-gce-etcd3
W0613 08:55:20.884] Zone: us-central1-f
I0613 08:55:22.526] Bringing down cluster
W0613 08:55:22.627] INSTANCE_GROUPS=e2e-35722-minion-group
W0613 08:55:22.627] NODE_NAMES=e2e-35722-minion-group-57b8 e2e-35722-minion-group-c0px e2e-35722-minion-group-s7tg e2e-35722-minion-group-v8gg
W0613 08:55:24.930] ERROR: (gcloud.compute.instance-groups.managed.delete) Some requests did not succeed:
W0613 08:55:24.931]  - The resource 'projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instanceGroupManagers/e2e-35722-minion-group' is not ready
W0613 08:55:24.931] 
W0613 08:55:24.972] Failed to delete instance group(s).
W0613 08:55:26.699] ERROR: (gcloud.compute.instance-templates.delete) Could not fetch resource:
W0613 08:55:26.700]  - The instance_template resource 'projects/k8s-jkns-pr-gce-etcd3/global/instanceTemplates/e2e-35722-minion-template' is already being used by 'projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instanceGroupManagers/e2e-35722-minion-group'
W0613 08:55:26.700] 
I0613 08:55:30.056] Removing etcd replica, name: e2e-35722-master, port: 2379, result: 0
I0613 08:55:31.044] Removing etcd replica, name: e2e-35722-master, port: 4002, result: 0
W0613 08:55:37.289] Updated [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instances/e2e-35722-master].
W0613 08:56:45.139] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/zones/us-central1-f/instances/e2e-35722-master].
W0613 08:57:03.664] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-master-https].
W0613 08:57:09.487] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-master-etcd].
W0613 08:57:10.020] Deleted [https://www.googleapis.com/compute/v1/projects/k8s-jkns-pr-gce-etcd3/global/firewalls/e2e-35722-minion-all].
@dcbw dcbw added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 13, 2017
@k8s-github-robot
Copy link

@dcbw There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 13, 2017
@dcbw dcbw added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2017
@janetkuo
Copy link
Member

@janetkuo janetkuo added sig/release Categorizes an issue or PR as relevant to SIG Release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 13, 2017
@janetkuo janetkuo added this to the v1.7 milestone Jun 13, 2017
@roberthbailey roberthbailey removed the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jun 14, 2017
@roberthbailey
Copy link
Contributor

Both of the linked runs failed after a timeout:

Interrupt testing after 55m0s timeout.

This is not a cluster lifecycle issue.

@dchen1107
Copy link
Member

Like what @roberthbailey mentioned above, the test is simply timeout, and longest distributed test set use 1960.667636731s below. All tests are passed but timeout.

Should we bump the timeout value for this test suite?

Also there is #47479 which doubled ns deletion timeout, but not increase the overall timeout for this test job. This might also introduce more timeout issues.

On another hand, I spent some time look at which test suite in this job use more of time. The top three are the following:

  • StatefulSet: 884.680s
  • StatefulSet: 646s
    *PersistentVolume:575s

The time spent on other test suites each is less than 300s, and most of them use less than 100s.

My question is if some of StatefulSet should be marked as SLOW test and skipped in this test job? @kow3ns @janetkuo

@dchen1107 dchen1107 added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jun 14, 2017
@dchen1107
Copy link
Member

@fejta @spxtr Another question I am having is that how we capture the time spent on each test?

@liggitt
Copy link
Member

liggitt commented Jun 14, 2017

even before #47479, we were bumping up against the 55m timeout

@dchen1107
Copy link
Member

A quick updates on this:

  • @krzyzacy help to increase the timeout (+10m) for pull-kubernetes-e2e-gce-etcd3 to stop the bleeding edge for now: Increase timeout for etcd3 suite for 10min test-infra#3064

  • By carefully comparing pull-kubernetes-e2e-gce-etcd3 and https://k8s-testgrid.appspot.com/google-gce#gce-etcd3 since they are running on the same set of tests on GCE, I really believe the performance discrepancy is introduced by the test configuration / infra since gce-etcd3 looks fine. The difference between two test jobs are listed bellowing:

    • pull test is configured to timeout at 55mins before @krzyzacy's pr , while gce-etcd3 is 50mins
    • pull requires extra build time and stage time, gce-etcd3 doesn't require that. ==> should be in 5 mins range.
    • pull test is configured to run on 4 nodes and 30 concurrent tests at any given time, while gce-etcd3 only on 3 nodes and 25 concurrent tests. This configuration theoretically should make pull test run faster. On another hand it might take longer to bring up a slight bigger cluster and tear down. ==>But shouldn't be such significant. Both nodes are on debian-based CVM image with same kernel version and configurations. The node's machine type are the same on both test suites too.
    • pull test is enabled cache-mutation-detector, while gce-etcd3 are not. Add ability to enable cache mutation detector in GCE #41326 introduced cache mutation detector in GCE. @ncdc what cache-mutation-detector is for? Can we disable it?

I am removing sig/apps since the same set of tests are ok on gce-etcd3

@dchen1107
Copy link
Member

#43996 was the first pr to enable cache-mutation-detector. @wojtek-t @gmarek any ideas?

@liggitt
Copy link
Member

liggitt commented Jun 15, 2017

It's required to be confident that shared informers are not having their objects mutated by listeners, causing data inconsistency issues. I would have expected it on in the post-merge job as well, actually

@liggitt
Copy link
Member

liggitt commented Jun 15, 2017

@ncdc @deads2k ^

@dchen1107
Copy link
Member

@liggitt, yes, I read #43996.

Here are a list of jobs with this enabled: https://github.com/kubernetes/test-infra/search?utf8=%E2%9C%93&q=ENABLE_CACHE_MUTATION_DETECTOR&type=

One can see that only pull-* jobs with this enabled. This discrepancy might explain why we see pull-federation-* uses more times than ci-federation-* ? cc/ @csbell

On another hand, my comment at #47446 (comment)

pull test is configured to run on 4 nodes and 30 concurrent tests at any given time, while gce-etcd3 only on 3 nodes and 25 concurrent tests. This configuration theoretically should make pull test run faster. On another hand it might take longer to bring up a slight bigger cluster and tear down. ==>But shouldn't be such significant. Both nodes are on debian-based CVM image with same kernel version and configurations. The node's machine type are the same on both test suites too.

The above test configuration might still be the contributor to the API performance downgrade you reported at #47135 (comment) since in pull test, there are more tests running concurrently, which means more API requests to the API server. We might hit some threshold or limit. cc/ @kubernetes/sig-api-machinery-misc

@krzyzacy How safe we clone ci test with the same configuration as pull one without cache-mutation-detector being enabled? This should be temporary (~ 1 day) for the AB comparison. The only concern I am having is if we would hit the quota limit for our test infra.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 15, 2017
@krzyzacy
Copy link
Member

I can make one with cache-mutation and one without and we can compare the performance.

@liggitt
Copy link
Member

liggitt commented Jun 15, 2017

pull requires extra build time and stage time, gce-etcd3 doesn't require that. ==> should be in 5 mins range.

the pull job has GINKGO_TOLERATE_FLAKES=y, the ci job does not. if any flake is encountered, that can lengthen the test run time by the time of the re-run test

also, time to reach the "Running Suite: Kubernetes e2e suite" line from start of build-log.txt:

  • ci job: ~8 minutes
  • pull job: ~21 minutes

@ianchakeres
Copy link
Contributor

Yesterday I encountered some test flakes inside e2e-gce-etcd3 on this PR #46792. All the failures mention timeouts. I think they relate to this test flake, #47446. The test did pass last night, after the timeout was increased.

Here are the the failed test runs' output:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/36002/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/35975/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46792/pull-kubernetes-e2e-gce-etcd3/35842/

@dchen1107
Copy link
Member

@liggitt We started etcd3-pr-validate job which has the identical setting as the pull one except cache-mutation-detector at: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/

by comparing the time spent for above tests vs. ci-kubernetes-e2e-gce-etcd3: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/ (please ignore those newly flakiness):

I saw the similar runtime to finish those tests roughly. @krzyzacy is going to enable cache-mutation-detector to see if there are performance hit.

@dchen1107
Copy link
Member

dchen1107 commented Jun 17, 2017

After enabling cache-mutation-detector at 15:00pm today, the latency to finish etcd3-pr-validate job is increased obviously. From https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/

38 2017-06-16 18:43 PDT took 46m11s 
37 2017-06-16 17:53 PDT took 49m23s 
36 2017-06-16 17:20 PDT took 31m31s 
35 2017-06-16 16:43 PDT took 35m14s 
34 2017-06-16 16:01 PDT took 41m26s 
33 2017-06-16 15:20 PDT took 38m58s 
====================== ABOVE with cache-mutation-detector is on and with 1) GINKGO_TOLERATE_FLAKES=y,  2) running on 4 nodes and 30 concurrent tests at any given time
====================== BELOW without it, but with 1) GINKGO_TOLERATE_FLAKES=y,  2) running on 4 nodes and 30 concurrent tests at any given time
32 2017-06-16 14:48 PDT took 27m41s 
31 2017-06-16 14:18 PDT took 29m16s 
30 2017-06-16 13:47 PDT took 28m47s 
29 2017-06-16 13:16 PDT took 26m4s 
28 2017-06-16 12:45 PDT took 27m26s 
27 2017-06-16 12:14 PDT took 1m23s 
26 2017-06-16 11:43 PDT took 28m30s 
25 2017-06-16 11:12 PDT took 28m10s 
24 2017-06-16 10:41 PDT took 26m22s 
23 2017-06-16 10:10 PDT took 25m53s 
22 2017-06-16 09:36 PDT took 32m41s 
21 2017-06-16 09:31 PDT 
20 2017-06-16 09:00 PDT took 24m57s 
19 2017-06-16 08:29 PDT took 26m7s 
18 2017-06-16 07:52 PDT took 35m50s 
17 2017-06-16 07:21 PDT took 27m34s 
16 2017-06-16 06:50 PDT took 26m42s 
15 2017-06-16 06:19 PDT took 25m14s 
14 2017-06-16 05:48 PDT took 25m54s 
13 2017-06-16 05:15 PDT took 32m7s 
12 2017-06-16 04:44 PDT took 26m29s 
11 2017-06-16 04:13 PDT took 28m3s 
10 2017-06-16 03:42 PDT took 24m49s 
9 2017-06-16 03:10 PDT took 31m3s 
8 2017-06-16 02:39 PDT took 27m3s 
7 2017-06-16 02:08 PDT took 25m30s 
6 2017-06-16 01:37 PDT took 1m13s 
5 2017-06-16 01:06 PDT took 26m30s 
4 2017-06-16 00:35 PDT took 27m23s 
3 2017-06-16 00:04 PDT took 28m48s 
2 2017-06-15 23:33 PDT took 25m7s 
1 2017-06-15 23:02 PDT took 27m39s 

I believe the mystery is resolved, and am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

9 participants