-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gke 1.3 test clustster failing bringup because build images not found #28364
Comments
Raising prio as this is blocking release. Dawn said that she has seen this problem before, and thinks it may be a race in the test infra. |
Yes, I saw this before due to some version screw issue in the test infra. |
@kubernetes/goog-gke since http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-release-1.3/ seems ok |
This seems like it could be an infra failure (build or GCR) because tests initially passed with the latest commits on the branch but then started failing. |
@david-mcmahon pointed this out. |
224+27e212baad161c landed in Job #969 on http://kubekins.dls.corp.google.com/view/CI%201.3/job/kubernetes-e2e-gke-release-1.3/ and stayed green until Job #974 and then it fell over (between 6:25pm and 6:44pm PDT) |
Specifically, from http://kubekins.dls.corp.google.com/view/CI%201.3/
|
So if we assume all of the HEAD failures that occurred on the branch after |
What is weird is that at least one test passes at 224+27e212baad161c but fails at 1.3.0-beta3 |
Compare:
|
So, the only different is changing the version tag. |
Was this caused by asking @ixdy to restart those jenkinses? |
Have there been any commits to the release branch since we cut the beta3 release last night? In the past, we've sometimes seen issues where tests will fail until the first commit is pushed to the branch after a release has been cut. |
That must relate to the image pull errors. |
CI builds aren't pushed to gcr.io - so something like |
That explains why: $ docker pull gcr.io/google_containers/kube-proxy:c7973d967d2273e7856e5471b8c82a7b
Tag c7973d967d2273e7856e5471b8c82a7b not found in repository gcr.io/google_containers/kube-proxy when I try from my laptop. |
Right, we don't push those tags to gcr.io. |
Did anything change on GKE side yesterday? |
hm, though the other GKE CI builds (on master) are unaffected. This is strange. |
We don't log the I don't have permission to access the project, but I wonder if we can simply examine the objects created by the current in-progress test to see if they all make sense. |
@maisem tells me that the way that images are loaded may be different on GKE (vs GCE) for beta vs x.y.z releases. Also, someone in the GKE meeting just now said that there is sometimes a problem with the test infra when there is a release tag with no commits after it. We might need to push a whitespace change to the branch. |
kube-proxy looks OK:
We're missing other pods? |
Where is that from? The last run of http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-release-1.3/1010/consoleFull has
|
That was from the k8s-jkns-gke-serial-1-3 project, though the cluster just turned down. I think those error messages are red herrings; they only appear while the cluster is still being turned up and images are being loaded. |
Looking at gke-slow-1.3 now. Same thing: $ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-nod9 1/1 Running 0 8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-o217 1/1 Running 0 8m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-ab08f859-zln1 1/1 Running 0 8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-nod9 1/1 Running 0 8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-o217 1/1 Running 0 8m
kube-proxy-gke-jenkins-e2e-default-pool-ab08f859-zln1 1/1 Running 0 8m |
Also curious: $ kubectl get services --namespace=kube-system
$ Compare to the GKE CI job on master branch: $ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-dz77 1/1 Running 0 5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nda6 1/1 Running 0 5m
fluentd-cloud-logging-gke-jenkins-e2e-default-pool-7b3029b2-nlup 1/1 Running 0 5m
heapster-v1.1.0-2096339923-6zmjs 2/2 Running 0 4m
kube-dns-v18-012sz 3/3 Running 0 5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-dz77 1/1 Running 0 5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nda6 1/1 Running 0 5m
kube-proxy-gke-jenkins-e2e-default-pool-7b3029b2-nlup 1/1 Running 0 5m
kubernetes-dashboard-v1.1.0-8tvwt 1/1 Running 0 5m
l7-default-backend-v1.0-ve0cr 1/1 Running 0 5m
$ kubectl get services --namespace=kube-system
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default-http-backend 10.183.240.64 nodes 80/TCP 6m
heapster 10.183.245.239 <none> 80/TCP 6m
kube-dns 10.183.240.10 <none> 53/UDP,53/TCP 6m
kubernetes-dashboard 10.183.247.185 <none> 80/TCP 6m |
It's only waiting for 6 pods and at the end I see 6 pods running/ready, the tests should start without addons (some might fail without DNS). Although figuring out why the addons weren't created might be a clue. |
thanks @ixdy -- that clearly points the finger at the addons as the problem. I think the messages about side loading kube-proxy may have been a red-herring.... |
It's actually waiting for at least 8 pods, which is why it's failing:
|
Oh, I guess we made it wait for 8 in https://github.com/kubernetes/test-infra/pull/122/files#diff-a4ca333487b05827cb35ea441b36c00dR22 but that logic is flawed then, because 6 are static pods @girishkalele (eitherway doesn't explain what happened to the remaining pods in kube-system) |
Though that check had been working fine up till last night:
|
Yeah sorry my last comment was slightly off topic. The check was not checking what we wanted to. Doesn't explain why addons are not created suddenly. |
The check for 8 pods will actually catch the "failure to launch addons" failure, since 2 out of the 6 are created by the addon manager. |
Since this appears to be GKE-specific, do we want to take discussion to an internal bug? |
Opened internal bug 29942775. |
Root cause fixed with kubernetes/release#28, and @david-mcmahon manually fixed v1.3.0-beta.3 and v1.3.0. |
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-ingress-release-1.3/720/consoleFull
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-1.1-1.3-upgrade-cluster/
gce-1.3 seems fine
@kubernetes/goog-testing
The text was updated successfully, but these errors were encountered: