Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple jobs flaking in CI #82533

Closed
alejandrox1 opened this issue Sep 10, 2019 · 9 comments

Comments

@alejandrox1
Copy link
Contributor

commented Sep 10, 2019

Since yesterday, we have been seeing multiple CI jobs fail for a variety of reasons.

Many of the jobs fail at the overall stage with an error message such as the one in https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-gci-kubectl-skew/1171375532827217930#0:build-log.txt%3A4097

W0910 11:10:24.362] 2019/09/10 11:10:24 main.go:314: [Boskos] Fail To Release: 1 error occurred:
W0910 11:10:24.363] 
W0910 11:10:24.363] * status 401 Unauthorized, statusCode 401 releasing k8s-jkns-gci-gce-reboot-1-3, kubetest err: <nil>
W0910 11:10:24.366] Traceback (most recent call last):
W0910 11:10:24.366]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 778, in <module>
W0910 11:10:24.367]     main(parse_args())
W0910 11:10:24.367]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 626, in main
W0910 11:10:24.367]     mode.start(runner_args)
W0910 11:10:24.367]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0910 11:10:24.367]     check_env(env, self.command, *args)
W0910 11:10:24.367]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0910 11:10:24.367]     subprocess.check_call(cmd, env=env)
W0910 11:10:24.367]   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
W0910 11:10:24.367]     raise CalledProcessError(retcode, cmd)
W0910 11:10:24.368] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--check-leaked-resources', '--check-version-skew=false', '--extract=ci/k8s-stable1', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=25', '--skew', '--test_args=--ginkgo.focus=Kubectl --ginkgo.skip=\\[Serial\\] --minStartupPods=8', '--timeout=120m')' returned non-zero exit status 1
E0910 11:10:24.377] Command failed

another common error message is, see https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-gci-kubectl-skew/1171360182899314693#0:build-log.txt%3A108

W0910 09:50:59.574] 2019/09/10 09:50:59 main.go:319: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: status 500 Internal Server Error, status code 500
W0910 09:50:59.578] Traceback (most recent call last):
W0910 09:50:59.579]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 778, in <module>
W0910 09:50:59.579]     main(parse_args())
W0910 09:50:59.579]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 626, in main
W0910 09:50:59.579]     mode.start(runner_args)
W0910 09:50:59.579]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0910 09:50:59.579]     check_env(env, self.command, *args)
W0910 09:50:59.580]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0910 09:50:59.580]     subprocess.check_call(cmd, env=env)
W0910 09:50:59.580]   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
W0910 09:50:59.580]     raise CalledProcessError(retcode, cmd)
W0910 09:50:59.581] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--check-leaked-resources', '--check-version-skew=false', '--extract=ci/k8s-stable1', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=25', '--skew', '--test_args=--ginkgo.focus=Kubectl --ginkgo.skip=\\[Serial\\] --minStartupPods=8', '--timeout=120m')' returned non-zero exit status 1
E0910 09:50:59.587] Command failed

Both of these examples were taken from https://k8s-testgrid.appspot.com/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce and have also been seen in

The other instance of failures that started happening around the same time involve timeouts, such as in https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/1171329981431681026 , where the run is filed with error messages of this sort:

I0910 11:11:54.038] ERROR: get pod list in disruptive-7341: Get https://35.247.0.170/api/v1/namespaces/disruptive-7341/pods: dial tcp 35.247.0.170:443: connect: connection refused

See https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial for more examples.

/milestone v1.16
/priority critical-urgent
/sig testing

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

/assign @lachie83 @Katharine
ptal

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

Adding to this... 1.16-blocking also has a couple runs with 500 erros ( https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sbeta-slow/1171423096985358336#0:build-log.txt%3A108 ):

W0910 14:00:46.469] 2019/09/10 14:00:46 main.go:319: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: status 500 Internal Server Error, status code 500

@alejandrox1 alejandrox1 changed the title Multiple jobs failing in CI Multiple jobs flaking in CI Sep 10, 2019

@Katharine

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

/assign @chases2

adding current oncall

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

This is the PR that went into test-infra around the same time the flakes started happening kubernetes/test-infra#14221

From https://kubernetes.slack.com/archives/C2C40FMNF/p1568131379007500 :

aside from bumping prow, it contained a kubekins bump and a boskos rejig
Some unused PVs were removed and it became a Deployment instead of a StatefulSet.
neither of which are things that should matter

@alejandrox1 alejandrox1 added this to Under investigation (prioritized) in 1.16 CI Signal Sep 10, 2019

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

https://k8s-testgrid.appspot.com/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce just started failing and the errors seems pretty close to what we have seen thus far:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-gci-kubectl-skew/1171498343428263936#0:build-log.txt%3A4044

W0910 19:21:13.324] 2019/09/10 19:21:13 main.go:314: [Boskos] Fail To Release: 1 error occurred:
W0910 19:21:13.325] 
W0910 19:21:13.325] * status 401 Unauthorized, statusCode 401 releasing k8s-boskos-gce-project-08, kubetest err: encountered 1 errors: [Error: 2 leaked resources
W0910 19:21:13.326] +NAME                            ADDRESS/RANGE  TYPE      PURPOSE  NETWORK  REGION    SUBNET  STATUS
W0910 19:21:13.326] +e2e-9f9b4391e5-abe28-master-ip  34.82.183.169  EXTERNAL                    us-west1          IN_USE]
W0910 19:21:13.326] Traceback (most recent call last):
W0910 19:21:13.327]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 778, in <module>
W0910 19:21:13.327]     main(parse_args())
W0910 19:21:13.327]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 626, in main
W0910 19:21:13.327]     mode.start(runner_args)
W0910 19:21:13.327]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0910 19:21:13.328]     check_env(env, self.command, *args)
W0910 19:21:13.328]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0910 19:21:13.328]     subprocess.check_call(cmd, env=env)
W0910 19:21:13.328]   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
W0910 19:21:13.328]     raise CalledProcessError(retcode, cmd)
W0910 19:21:13.329] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--check-leaked-resources', '--check-version-skew=false', '--extract=ci/k8s-stable1', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=25', '--skew', '--test_args=--ginkgo.focus=Kubectl --ginkgo.skip=\\[Serial\\] --minStartupPods=8', '--timeout=120m')' returned non-zero exit status 1
@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

Another interesting set of failures are in https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-slow :
from https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/1171487776680448000#0:build-log.txt%3A807

W0910 18:47:48.689] 2019/09/10 18:47:48 main.go:754: [Boskos] Update of k8s-jkns-e2e-gce-gci-serial failed with status 401 Unauthorized, status code 401 updating k8s-jkns-e2e-gce-gci-serial
W0910 18:47:54.359] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:47:54.363] (kubectl failed, will retry 2 times)
W0910 18:47:55.542] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:47:55.545] (kubectl failed, will retry 1 times)
W0910 18:47:56.743] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:47:56.748] ('kubectl get nodes --no-headers' failed, giving up)
W0910 18:48:11.907] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:11.910] (kubectl failed, will retry 2 times)
W0910 18:48:13.073] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:13.076] (kubectl failed, will retry 1 times)
W0910 18:48:14.267] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:14.272] ('kubectl get nodes --no-headers' failed, giving up)
W0910 18:48:29.465] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:29.470] (kubectl failed, will retry 2 times)
W0910 18:48:30.614] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:30.619] (kubectl failed, will retry 1 times)
W0910 18:48:31.774] The connection to the server localhost:8080 was refused - did you specify the right host or port?
W0910 18:48:31.778] ('kubectl get nodes --no-headers' failed, giving up)
W0910 18:48:31.782] 2019/09/10 18:48:31 process.go:155: Step './hack/e2e-internal/e2e-down.sh' finished in 30m19.309711368s
W0910 18:48:31.782] 2019/09/10 18:48:31 process.go:96: Saved XML output to /workspace/_artifacts/junit_runner.xml.
I0910 18:48:31.883]  Failed to get nodes.
W0910 18:48:51.489] 2019/09/10 18:48:51 main.go:314: [Boskos] Fail To Release: 1 error occurred:
W0910 18:48:51.489] 
W0910 18:48:51.489] * status 401 Unauthorized, statusCode 401 releasing k8s-jkns-e2e-gce-gci-serial, kubetest err: error tearing down previous cluster: error during ./hack/e2e-internal/e2e-down.sh: exit status 1
@liggitt

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

was this cleaned up by kubernetes/test-infra#14285?

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2019

It was!
Thank you @Katharine @michelle192837 and @chases2
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@alejandrox1: Closing this issue.

In response to this:

It was!
Thank you @Katharine @michelle192837 and @chases2
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.16 CI Signal automation moved this from Under investigation (prioritized) to Observing (observe test failure/flake before marking as resolved) Sep 12, 2019

@alejandrox1 alejandrox1 moved this from Observing (observe test failure/flake before marking as resolved) to Resolved (week Sep 9) in 1.16 CI Signal Sep 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.