Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gke 1.3 test clustster failing bringup because build images not found #28364

Closed
bprashanth opened this issue Jul 1, 2016 · 37 comments
Closed
Labels
area/test-infra priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@bprashanth
Copy link
Contributor

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-ingress-release-1.3/720/consoleFull
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-1.1-1.3-upgrade-cluster/

09:59:55 Jul  1 09:59:55.355: INFO: At {2016-07-01 09:48:14 -0700 PDT} - event for kube-proxy-gke-
jenkins-e2e-default-pool-8ea565a8-7yf8: {kubelet gke-jenkins-e2e-default-pool-8ea565a8-7yf8} 
FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "kube-proxy" with 
ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/kube-
proxy:c7973d967d2273e7856e5471b8c82a7b\""

gce-1.3 seems fine
@kubernetes/goog-testing

@bprashanth bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test-infra labels Jul 1, 2016
@erictune erictune added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 1, 2016
@erictune
Copy link
Member

erictune commented Jul 1, 2016

Raising prio as this is blocking release.

Dawn said that she has seen this problem before, and thinks it may be a race in the test infra.
Or a GCR.io issue.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

@david-mcmahon

@dchen1107
Copy link
Member

Yes, I saw this before due to some version screw issue in the test infra.

@bprashanth
Copy link
Contributor Author

@kubernetes/goog-gke since http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-release-1.3/ seems ok

@erictune
Copy link
Member

erictune commented Jul 1, 2016

This seems like it could be an infra failure (build or GCR) because tests initially passed with the latest commits on the branch but then started failing.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

@david-mcmahon pointed this out.

@david-mcmahon
Copy link
Contributor

224+27e212baad161c landed in Job #969 on http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-release-1.3/ and stayed green until Job #974 and then it fell over (between 6:25pm and 6:44pm PDT)

@erictune
Copy link
Member

erictune commented Jul 1, 2016

Specifically, from http://kubekins.dls.corp.google.com/view/CI%201.3/
I see these failures:

  • kubernetes-e2e-aws-release-1.3 : failing since before we cut beta.3
  • kubernetes-e2e-gce-master-on-cvm : I don't understand what version this is running
  • kubernetes-e2e-gke-reboot-release-1.3 : success was with version v1.3.0-beta.2.224+27e212baad161c, which is the release candidate (based on 27e212b).
  • kubernetes-e2e-gke-slow-release-1.3 : passed with release candidate.
  • kubernetes-e2e-gke-release-1.3 : passed with release candidate.
  • kubernetes-e2e-gke-serial-release-1.3 : last passed with previous candidate 😦
  • kubernetes-e2e-gke-slow-release-1.3 - passed with release candidate
  • kubernetes-soak-continuous-e2e-gce-1.3 : has not passed with the current release candidate, but takes a long time to run, so could be due to infra failure.

@david-mcmahon
Copy link
Contributor

david-mcmahon commented Jul 1, 2016

So if we assume all of the HEAD failures that occurred on the branch after 224+27e212baad161c can be discounted (I'd still like to figure this out), then we're dealing with just kubernetes-e2e-gke-serial-release-1.3 looking bad with 224+27e212baad161c and it's a long list of failures:
http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-serial-release-1.3/146/

@erictune
Copy link
Member

erictune commented Jul 1, 2016

What is weird is that at least one test passes at 224+27e212baad161c but fails at 1.3.0-beta3

@erictune
Copy link
Member

erictune commented Jul 1, 2016

@erictune
Copy link
Member

erictune commented Jul 1, 2016

So, the only different is changing the version tag.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

Was this caused by asking @ixdy to restart those jenkinses?

@roberthbailey
Copy link
Contributor

Have there been any commits to the release branch since we cut the beta3 release last night? In the past, we've sometimes seen issues where tests will fail until the first commit is pushed to the branch after a release has been cut.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

That must relate to the image pull errors.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

CI builds aren't pushed to gcr.io - so something like gcr.io/google_containers/kube- proxy:c7973d967d2273e7856e5471b8c82a7b is side-loaded, at least for GCE. I don't know how cluster deployment for GKE CI works.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

That explains why:

$ docker pull gcr.io/google_containers/kube-proxy:c7973d967d2273e7856e5471b8c82a7b
Tag c7973d967d2273e7856e5471b8c82a7b not found in repository gcr.io/google_containers/kube-proxy

when I try from my laptop.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Right, we don't push those tags to gcr.io.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Did anything change on GKE side yesterday?

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

hm, though the other GKE CI builds (on master) are unaffected. This is strange.

@mml
Copy link
Contributor

mml commented Jul 1, 2016

We don't log the gcloud commands we issue. It looks like, among other things, the tag change affects the --cluster-version flag we hand to gcloud container clusters create.

I don't have permission to access the project, but I wonder if we can simply examine the objects created by the current in-progress test to see if they all make sense.

@erictune
Copy link
Member

erictune commented Jul 1, 2016

@maisem tells me that the way that images are loaded may be different on GKE (vs GCE) for beta vs x.y.z releases.

Also, someone in the GKE meeting just now said that there is sometimes a problem with the test infra when there is a release tag with no commits after it. We might need to push a whitespace change to the branch.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

kube-proxy looks OK:

$ kubectl get pods --namespace=kube-system 
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-1q0d   1/1       Running   0          9m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-noo6   1/1       Running   0          9m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-f2ebe5b6-uj0g   1/1       Running   0          10m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-1q0d              1/1       Running   0          9m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-noo6              1/1       Running   0          9m
kube-proxy-gke-jenkins-e2e-default-pool-f2ebe5b6-uj0g              1/1       Running   0          10m

We're missing other pods?

@bprashanth
Copy link
Contributor Author

Where is that from?

The last run of http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-release-1.3/1010/consoleFull

has

12:14:10 Jul  1 12:13:46.062: INFO: At {2016-07-01 12:02:27 -0700 PDT} - event for kube-proxy-gke- 
jenkins-e2e-default-pool-92b9c9c0-jcxd: {kubelet gke-jenkins-e2e-default-pool-92b9c9c0-jcxd} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "kube-proxy" with ErrImagePull: 
"Tag c7973d967d2273e7856e5471b8c82a7b not found in repository gcr.io/google_containers/kube-
proxy"

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

That was from the k8s-jkns-gke-serial-1-3 project, though the cluster just turned down.

I think those error messages are red herrings; they only appear while the cluster is still being turned up and images are being loaded.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Looking at gke-slow-1.3 now. Same thing:

$ kubectl get pods --namespace=kube-system
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-nod9   1/1       Running   0          8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-o217   1/1       Running   0          8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-zln1   1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-nod9              1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-o217              1/1       Running   0          8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-zln1              1/1       Running   0          8m

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Also curious:

$ kubectl get services --namespace=kube-system
$

Compare to the GKE CI job on master branch:

$ kubectl get pods --namespace=kube-system                                      
NAME                                                               READY     STATUS    RESTARTS   AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-dz77   1/1       Running   0          5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nda6   1/1       Running   0          5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nlup   1/1       Running   0          5m
heapster-v1.1.0-2096339923-6zmjs                                   2/2       Running   0          4m
kube-dns-v18-012sz                                                 3/3       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-dz77              1/1       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nda6              1/1       Running   0          5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nlup              1/1       Running   0          5m
kubernetes-dashboard-v1.1.0-8tvwt                                  1/1       Running   0          5m
l7-default-backend-v1.0-ve0cr                                      1/1       Running   0          5m
$ kubectl get services --namespace=kube-system
NAME                   CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
default-http-backend   10.183.240.64    nodes         80/TCP          6m
heapster               10.183.245.239   <none>        80/TCP          6m
kube-dns               10.183.240.10    <none>        53/UDP,53/TCP   6m
kubernetes-dashboard   10.183.247.185   <none>        80/TCP          6m

@bprashanth
Copy link
Contributor Author

It's only waiting for 6 pods and at the end I see 6 pods running/ready, the tests should start without addons (some might fail without DNS). Although figuring out why the addons weren't created might be a clue.

@roberthbailey
Copy link
Contributor

thanks @ixdy -- that clearly points the finger at the addons as the problem. I think the messages about side loading kube-proxy may have been a red-herring....

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

It's actually waiting for at least 8 pods, which is why it's failing:

INFO: Waiting up to 10m0s for all pods (need at least 8) in namespace 'kube-system' to be running and ready

@bprashanth
Copy link
Contributor Author

bprashanth commented Jul 1, 2016

Oh, I guess we made it wait for 8 in https://github.com/kubernetes/test-infra/pull/122/files#diff-a4ca333487b05827cb35ea441b36c00dR22 but that logic is flawed then, because 6 are static pods @girishkalele

(eitherway doesn't explain what happened to the remaining pods in kube-system)

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Though that check had been working fine up till last night:

18:31:54 Jun 30 18:31:24.583: INFO: Waiting up to 10m0s for all pods (need at least 8) in namespace 'kube-system' to be running and ready
18:31:54 Jun 30 18:31:24.634: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]
18:31:54 Jun 30 18:31:24.645: INFO: The status of Pod heapster-v1.1.0-2096339923-g1g3f is Pending, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:24.645: INFO: The status of Pod kube-dns-v17-ab3dj is Running, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:24.645: INFO: 9 / 11 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
18:31:54 Jun 30 18:31:24.645: INFO: expected 3 pod replicas in namespace 'kube-system', 2 are Running and Ready.
18:31:54 Jun 30 18:31:24.645: INFO: POD                               NODE                                        PHASE    GRACE  CONDITIONS
18:31:54 Jun 30 18:31:24.645: INFO: heapster-v1.1.0-2096339923-g1g3f  gke-jenkins-e2e-default-pool-f224c8f0-pu77  Pending         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT} ContainersNotReady containers with unready status: [heapster heapster-nanny]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:22 -0700 PDT}  }]
18:31:54 Jun 30 18:31:24.645: INFO: kube-dns-v17-ab3dj                gke-jenkins-e2e-default-pool-f224c8f0-p8wv  Running         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT} ContainersNotReady containers with unready status: [kubedns]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  }]
...
18:31:54 Jun 30 18:31:52.660: INFO: The status of Pod kube-dns-v17-ab3dj is Running, waiting for it to be either Running or Failed
18:31:54 Jun 30 18:31:52.770: INFO: 9 / 10 pods in namespace 'kube-system' are running and ready (28 seconds elapsed)
18:31:54 Jun 30 18:31:52.770: INFO: expected 3 pod replicas in namespace 'kube-system', 2 are Running and Ready.
18:31:54 Jun 30 18:31:52.770: INFO: POD                 NODE                                        PHASE    GRACE  CONDITIONS
18:31:54 Jun 30 18:31:52.770: INFO: kube-dns-v17-ab3dj  gke-jenkins-e2e-default-pool-f224c8f0-p8wv  Running         [{Initialized True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  } {Ready False {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT} ContainersNotReady containers with unready status: [kubedns]} {PodScheduled True {0001-01-01 00:00:00 +0000 UTC} {2016-06-30 18:31:14 -0700 PDT}  }]
18:31:54 Jun 30 18:31:52.770: INFO: 
18:31:54 Jun 30 18:31:54.660: INFO: 10 / 10 pods in namespace 'kube-system' are running and ready (30 seconds elapsed)
18:31:54 Jun 30 18:31:54.660: INFO: expected 3 pod replicas in namespace 'kube-system', 3 are Running and Ready.
18:31:54 Jun 30 18:31:54.666: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]

@bprashanth
Copy link
Contributor Author

Yeah sorry my last comment was slightly off topic. The check was not checking what we wanted to. Doesn't explain why addons are not created suddenly.

@girishkalele
Copy link

@bprashanth

The check for 8 pods will actually catch the "failure to launch addons" failure, since 2 out of the 6 are created by the addon manager.

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Since this appears to be GKE-specific, do we want to take discussion to an internal bug?

@ixdy
Copy link
Member

ixdy commented Jul 1, 2016

Opened internal bug 29942775.

@ixdy
Copy link
Member

ixdy commented Jul 6, 2016

Root cause fixed with kubernetes/release#28, and @david-mcmahon manually fixed v1.3.0-beta.3 and v1.3.0.

@ixdy ixdy closed this as completed Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

8 participants