gke 1.3 test clustster failing bringup because build images not found #28364

bprashanth · 2016-07-01T17:28:23Z

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-ingress-release-1.3/720/consoleFull
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-1.1-1.3-upgrade-cluster/

09:59:55 Jul  1 09:59:55.355: INFO: At {2016-07-01 09:48:14 -0700 PDT} - event for kube-proxy-gke-
jenkins-e2e-default-pool-8ea565a8-7yf8: {kubelet gke-jenkins-e2e-default-pool-8ea565a8-7yf8} 
FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "kube-proxy" with 
ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/kube-
proxy:c7973d967d2273e7856e5471b8c82a7b\""

gce-1.3 seems fine
@kubernetes/goog-testing

The text was updated successfully, but these errors were encountered:

erictune · 2016-07-01T17:35:03Z

Raising prio as this is blocking release.

Dawn said that she has seen this problem before, and thinks it may be a race in the test infra.
Or a GCR.io issue.

erictune · 2016-07-01T17:35:14Z

@david-mcmahon

dchen1107 · 2016-07-01T17:38:01Z

Yes, I saw this before due to some version screw issue in the test infra.

bprashanth · 2016-07-01T17:38:41Z

@kubernetes/goog-gke since http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-release-1.3/ seems ok

erictune · 2016-07-01T17:43:26Z

This seems like it could be an infra failure (build or GCR) because tests initially passed with the latest commits on the branch but then started failing.

erictune · 2016-07-01T17:43:41Z

@david-mcmahon pointed this out.

david-mcmahon · 2016-07-01T17:52:10Z

224+27e212baad161c landed in Job #969 on http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-release-1.3/ and stayed green until Job #974 and then it fell over (between 6:25pm and 6:44pm PDT)

erictune · 2016-07-01T17:56:40Z

Specifically, from http://kubekins.dls.corp.google.com/view/CI%201.3/
I see these failures:

kubernetes-e2e-aws-release-1.3 : failing since before we cut beta.3
kubernetes-e2e-gce-master-on-cvm : I don't understand what version this is running
kubernetes-e2e-gke-reboot-release-1.3 : success was with version v1.3.0-beta.2.224+27e212baad161c, which is the release candidate (based on 27e212b).
kubernetes-e2e-gke-slow-release-1.3 : passed with release candidate.
kubernetes-e2e-gke-release-1.3 : passed with release candidate.
kubernetes-e2e-gke-serial-release-1.3 : last passed with previous candidate 😦
kubernetes-e2e-gke-slow-release-1.3 - passed with release candidate
kubernetes-soak-continuous-e2e-gce-1.3 : has not passed with the current release candidate, but takes a long time to run, so could be due to infra failure.

david-mcmahon · 2016-07-01T18:07:17Z

So if we assume all of the HEAD failures that occurred on the branch after 224+27e212baad161c can be discounted (I'd still like to figure this out), then we're dealing with just kubernetes-e2e-gke-serial-release-1.3 looking bad with 224+27e212baad161c and it's a long list of failures:
http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-serial-release-1.3/146/

erictune · 2016-07-01T18:13:57Z

What is weird is that at least one test passes at 224+27e212baad161c but fails at 1.3.0-beta3

erictune · 2016-07-01T18:17:41Z

Compare:

http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-release-1.3/974/consoleFull
- passed
- GitVersion:"v1.3.0-beta.2.224+27e212baad161c"
http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-release-1.3/975/consoleFull
- failed
- GitVersion:"v1.3.0-beta.3"

erictune · 2016-07-01T18:18:29Z

So, the only different is changing the version tag.

erictune · 2016-07-01T18:19:05Z

Was this caused by asking @ixdy to restart those jenkinses?

roberthbailey · 2016-07-01T18:23:19Z

Have there been any commits to the release branch since we cut the beta3 release last night? In the past, we've sometimes seen issues where tests will fail until the first commit is pushed to the branch after a release has been cut.

erictune · 2016-07-01T18:32:59Z

That must relate to the image pull errors.

ixdy · 2016-07-01T18:41:31Z

CI builds aren't pushed to gcr.io - so something like gcr.io/google_containers/kube- proxy:c7973d967d2273e7856e5471b8c82a7b is side-loaded, at least for GCE. I don't know how cluster deployment for GKE CI works.

erictune · 2016-07-01T18:44:43Z

That explains why:

$ docker pull gcr.io/google_containers/kube-proxy:c7973d967d2273e7856e5471b8c82a7b
Tag c7973d967d2273e7856e5471b8c82a7b not found in repository gcr.io/google_containers/kube-proxy

when I try from my laptop.

ixdy · 2016-07-01T18:45:20Z

Right, we don't push those tags to gcr.io.

ixdy · 2016-07-01T18:49:10Z

Did anything change on GKE side yesterday?

ixdy · 2016-07-01T18:53:31Z

hm, though the other GKE CI builds (on master) are unaffected. This is strange.

mml · 2016-07-01T19:12:43Z

We don't log the gcloud commands we issue. It looks like, among other things, the tag change affects the --cluster-version flag we hand to gcloud container clusters create.

I don't have permission to access the project, but I wonder if we can simply examine the objects created by the current in-progress test to see if they all make sense.

erictune · 2016-07-01T19:16:59Z

@maisem tells me that the way that images are loaded may be different on GKE (vs GCE) for beta vs x.y.z releases.

Also, someone in the GKE meeting just now said that there is sometimes a problem with the test infra when there is a release tag with no commits after it. We might need to push a whitespace change to the branch.

ixdy · 2016-07-01T19:29:42Z

kube-proxy looks OK:

$ kubectl get pods --namespace=kube-system 
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-1q0d   1/1       Running   0          9m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-noo6   1/1       Running   0          9m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-uj0g   1/1       Running   0          10m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-1q0d              1/1       Running   0          9m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-noo6              1/1       Running   0          9m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-uj0g              1/1       Running   0          10m

We're missing other pods?

bprashanth · 2016-07-01T19:33:14Z

Where is that from?

The last run of http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-release-1.3/1010/consoleFull

has

12:14:10 Jul  1 12:13:46.062: INFO: At {2016-07-01 12:02:27 -0700 PDT} - event for kube-proxy-gke- 
jenkins-e2e-default-pool-92b9c9c0-jcxd: {kubelet gke-jenkins-e2e-default-pool-92b9c9c0-jcxd} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "kube-proxy" with ErrImagePull: 
"Tag c7973d967d2273e7856e5471b8c82a7b not found in repository gcr.io/google_containers/kube-
proxy"

ixdy · 2016-07-01T19:34:31Z

That was from the k8s-jkns-gke-serial-1-3 project, though the cluster just turned down.

I think those error messages are red herrings; they only appear while the cluster is still being turned up and images are being loaded.

ixdy · 2016-07-01T19:40:07Z

Looking at gke-slow-1.3 now. Same thing:

$ kubectl get pods --namespace=kube-system
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-nod9   1/1       Running   0          8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-o217   1/1       Running   0          8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-zln1   1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-nod9              1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-o217              1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-zln1              1/1       Running   0          8m

ixdy · 2016-07-01T19:45:05Z

Also curious:

$ kubectl get services --namespace=kube-system
$

Compare to the GKE CI job on master branch:

$ kubectl get pods --namespace=kube-system                                      
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-dz77   1/1       Running   0          5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nda6   1/1       Running   0          5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nlup   1/1       Running   0          5m
heapster-v1.1.0-2096339923-6zmjs                                   2/2       Running   0          4m
kube-dns-v18-012sz                                                 3/3       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-dz77              1/1       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nda6              1/1       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nlup              1/1       Running   0          5m
kubernetes-dashboard-v1.1.0-8tvwt                                  1/1       Running   0          5m
l7-default-backend-v1.0-ve0cr                                      1/1       Running   0          5m
$ kubectl get services --namespace=kube-system
NAME                   CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
default-http-backend   10.183.240.64    nodes         80/TCP          6m
heapster               10.183.245.239   <none>        80/TCP          6m
kube-dns               10.183.240.10    <none>        53/UDP,53/TCP   6m
kubernetes-dashboard   10.183.247.185   <none>        80/TCP          6m

bprashanth · 2016-07-01T19:45:49Z

It's only waiting for 6 pods and at the end I see 6 pods running/ready, the tests should start without addons (some might fail without DNS). Although figuring out why the addons weren't created might be a clue.

roberthbailey · 2016-07-01T19:50:14Z

thanks @ixdy -- that clearly points the finger at the addons as the problem. I think the messages about side loading kube-proxy may have been a red-herring....

ixdy · 2016-07-01T20:01:14Z

It's actually waiting for at least 8 pods, which is why it's failing:

INFO: Waiting up to 10m0s for all pods (need at least 8) in namespace 'kube-system' to be running and ready

bprashanth · 2016-07-01T20:49:36Z

Oh, I guess we made it wait for 8 in https://github.com/kubernetes/test-infra/pull/122/files#diff-a4ca333487b05827cb35ea441b36c00dR22 but that logic is flawed then, because 6 are static pods @girishkalele

(eitherway doesn't explain what happened to the remaining pods in kube-system)

ixdy · 2016-07-01T20:53:25Z

Though that check had been working fine up till last night:

18:31:54 Jun 30 18:31:24.583: INFO: Waiting up to 10m0s for all pods (need at least 8) in namespace 'kube-system' to be running and ready
18:31:54 Jun 30 18:31:24.634: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]
18:31:54 Jun 30 18:31:24.645: INFO: The status of Pod heapster-v1.1.0-2096339923-g1g3f is Pending, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:24.645: INFO: The status of Pod kube-dns-v17-ab3dj is Running, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:24.645: INFO: 9 / 11 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
18:31:54 Jun 30 18:31:24.645: INFO: expected 3 pod replicas in namespace 'kube-system', 2 are Running and Ready.
18:31:54 Jun 30 18:31:24.645: INFO: POD                               NODE                                        PHASE    GRACE  CONDITIONS
18:31:54 Jun 30 18:31:24.645: INFO: heapster-v1.1.0-2096339923-g1g3f  gke-jenkins-e2e-default-pool-f224c8f0-pu77  Pending         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT} ContainersNotReady containers with unready status: [heapster heapster-nanny]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT}  }]
18:31:54 Jun 30 18:31:24.645: INFO: kube-dns-v17-ab3dj                gke-jenkins-e2e-default-pool-f224c8f0-p8wv  Running         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT} ContainersNotReady containers with unready status: [kubedns]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  }]
...
18:31:54 Jun 30 18:31:52.660: INFO: The status of Pod kube-dns-v17-ab3dj is Running, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:52.770: INFO: 9 / 10 pods in namespace 'kube-system' are running and ready (28 seconds elapsed)
18:31:54 Jun 30 18:31:52.770: INFO: expected 3 pod replicas in namespace 'kube-system', 2 are Running and Ready.
18:31:54 Jun 30 18:31:52.770: INFO: POD                 NODE                                        PHASE    GRACE  CONDITIONS
18:31:54 Jun 30 18:31:52.770: INFO: kube-dns-v17-ab3dj  gke-jenkins-e2e-default-pool-f224c8f0-p8wv  Running         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT} ContainersNotReady containers with unready status: [kubedns]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  }]
18:31:54 Jun 30 18:31:52.770: INFO: 
18:31:54 Jun 30 18:31:54.660: INFO: 10 / 10 pods in namespace 'kube-system' are running and ready (30 seconds elapsed)
18:31:54 Jun 30 18:31:54.660: INFO: expected 3 pod replicas in namespace 'kube-system', 3 are Running and Ready.
18:31:54 Jun 30 18:31:54.666: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]

bprashanth · 2016-07-01T20:58:48Z

Yeah sorry my last comment was slightly off topic. The check was not checking what we wanted to. Doesn't explain why addons are not created suddenly.

girishkalele · 2016-07-01T21:07:30Z

@bprashanth

The check for 8 pods will actually catch the "failure to launch addons" failure, since 2 out of the 6 are created by the addon manager.

ixdy · 2016-07-01T21:09:05Z

Since this appears to be GKE-specific, do we want to take discussion to an internal bug?

ixdy · 2016-07-01T21:17:48Z

Opened internal bug 29942775.

ixdy · 2016-07-06T22:49:09Z

Root cause fixed with kubernetes/release#28, and @david-mcmahon manually fixed v1.3.0-beta.3 and v1.3.0.

bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test-infra labels Jul 1, 2016

erictune added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 1, 2016

Random-Liu mentioned this issue Jul 1, 2016

[k8s.io] NodeProblemDetector [k8s.io] KernelMonitor should generate node condition and events for corresponding errors {Kubernetes e2e suite} #28343

Closed

ixdy closed this as completed Jul 6, 2016

gke 1.3 test clustster failing bringup because build images not found #28364

gke 1.3 test clustster failing bringup because build images not found #28364

Comments

bprashanth commented Jul 1, 2016

erictune commented Jul 1, 2016

erictune commented Jul 1, 2016

dchen1107 commented Jul 1, 2016

bprashanth commented Jul 1, 2016

erictune commented Jul 1, 2016

erictune commented Jul 1, 2016

david-mcmahon commented Jul 1, 2016

erictune commented Jul 1, 2016

david-mcmahon commented Jul 1, 2016 • edited Loading

erictune commented Jul 1, 2016

erictune commented Jul 1, 2016 • edited Loading

erictune commented Jul 1, 2016

erictune commented Jul 1, 2016

roberthbailey commented Jul 1, 2016

erictune commented Jul 1, 2016

ixdy commented Jul 1, 2016

erictune commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 1, 2016

mml commented Jul 1, 2016 • edited Loading

erictune commented Jul 1, 2016

ixdy commented Jul 1, 2016

bprashanth commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 1, 2016

bprashanth commented Jul 1, 2016

roberthbailey commented Jul 1, 2016

ixdy commented Jul 1, 2016

bprashanth commented Jul 1, 2016 • edited Loading

ixdy commented Jul 1, 2016

bprashanth commented Jul 1, 2016

girishkalele commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 1, 2016

ixdy commented Jul 6, 2016

david-mcmahon commented Jul 1, 2016 •

edited

Loading

erictune commented Jul 1, 2016 •

edited

Loading

mml commented Jul 1, 2016 •

edited

Loading

bprashanth commented Jul 1, 2016 •

edited

Loading