Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] sig-release-1.16-blocking (gce-cos-k8sstable1-alphafeatures) #84995

Closed
tpepper opened this issue Nov 8, 2019 · 11 comments

Comments

@tpepper
Copy link
Contributor

@tpepper tpepper commented Nov 8, 2019

Which jobs are failing:
sig-release-1.16-blocking (gce-cos-k8sstable1-alphafeatures)

Which test(s) are failing:
BeforeSuite, Overall, Test

Since when has it been failing:
Nov. 4, 2019

Testgrid link:
https://testgrid.k8s.io/sig-release-1.16-blocking#gce-cos-k8sstable1-alphafeatures

Reason for failure:
Possibly relevant log bits:

I1104 23:29:37.767] Nov  4 23:29:37.760: INFO: At 2019-11-04 23:17:55 +0000 UTC - event for fluentd-gcp-v3.2.0-xqjrd: {kubelet test-86ab064463-minion-group-nt8z} FailedMount: MountVolume.SetUp failed for volume "config-volume" : failed to sync configmap cache: timed out waiting for the condition
I1104 23:29:37.771] Nov  4 23:29:37.770: INFO: At 2019-11-04 23:17:55 +0000 UTC - event for metrics-server-v0.3.4-76dd449d4d-htqkn: {default-scheduler } FailedScheduling: 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
I1104 23:29:42.760] panic: Get https://10.0.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.760] 
I1104 23:29:42.760] goroutine 1 [running]:
I1104 23:29:42.760] main.main()
I1104 23:29:42.761] 	/go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b
I1104 23:29:42.761] 
I1104 23:29:42.761] ENDLOG for container kube-system:metrics-server-v0.3.4-76dd449d4d-htqkn:metrics-server
I1104 23:29:42.896] Nov  4 23:29:42.895: INFO: Logs of kube-system/metrics-server-v0.3.4-76dd449d4d-htqkn:metrics-server-nanny on node test-86ab064463-minion-group-nt8z
I1104 23:29:42.897] Nov  4 23:29:42.895: INFO:  : STARTLOG
I1104 23:29:42.897] ERROR: logging before flag.Parse: I1104 23:18:17.017541       1 pod_nanny.go:65] Invoked by [/pod_nanny --config-dir=/etc/config --cpu=40m --extra-cpu=0.5m --memory=40Mi --extra-memory=4Mi --threshold=5 --deployment=metrics-server-v0.3.4 --container=metrics-server --poll-period=300000 --estimator=exponential --minClusterSize=16]
I1104 23:29:42.899] ERROR: logging before flag.Parse: I1104 23:18:17.017691       1 pod_nanny.go:81] Watching namespace: kube-system, pod: metrics-server-v0.3.4-76dd449d4d-htqkn, container: metrics-server.
I1104 23:29:42.899] ERROR: logging before flag.Parse: I1104 23:18:17.017707       1 pod_nanny.go:82] storage: MISSING, extra_storage: 0Gi
I1104 23:29:42.899] ERROR: logging before flag.Parse: I1104 23:18:17.022317       1 pod_nanny.go:109] cpu: 40m, extra_cpu: 0.5m, memory: 40Mi, extra_memory: 4Mi
I1104 23:29:42.900] ERROR: logging before flag.Parse: I1104 23:18:17.022351       1 pod_nanny.go:138] Resources: [{Base:{i:{value:40 scale:-3} d:{Dec:<nil>} s:40m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:41943040 scale:0} d:{Dec:<nil>} s: Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]
I1104 23:29:42.901] ERROR: logging before flag.Parse: E1104 23:18:47.022952       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.901] ERROR: logging before flag.Parse: E1104 23:19:27.023622       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.902] ERROR: logging before flag.Parse: E1104 23:20:07.024440       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.902] ERROR: logging before flag.Parse: E1104 23:20:47.025220       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.902] ERROR: logging before flag.Parse: E1104 23:21:27.026231       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.903] ERROR: logging before flag.Parse: E1104 23:22:07.116361       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.903] ERROR: logging before flag.Parse: E1104 23:22:47.117376       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.903] ERROR: logging before flag.Parse: E1104 23:23:27.216727       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.903] ERROR: logging before flag.Parse: E1104 23:24:07.316309       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.904] ERROR: logging before flag.Parse: E1104 23:24:47.317087       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.904] ERROR: logging before flag.Parse: E1104 23:25:27.317833       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.904] ERROR: logging before flag.Parse: E1104 23:26:07.318587       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.905] ERROR: logging before flag.Parse: E1104 23:26:47.416466       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.905] ERROR: logging before flag.Parse: E1104 23:27:27.516207       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.905] ERROR: logging before flag.Parse: E1104 23:28:07.516999       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.905] ERROR: logging before flag.Parse: E1104 23:28:47.517814       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.906] ERROR: logging before flag.Parse: E1104 23:29:27.518607       1 nanny_lib.go:87] Get https://10.0.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I1104 23:29:42.906] 
I1104 23:29:42.908] ENDLOG for container kube-system:metrics-server-v0.3.4-76dd449d4d-htqkn:metrics-server-nanny
I1104 23:29:43.008] Nov  4 23:29:42.990: FAIL: Error waiting for all pods to be running and ready: 3 / 27 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s

Anything else we need to know:
The test is consistently red and the failure logs are consistently showing the same pattern.

The 1.16 branch had three cherry picks in the failure range:

/sig apps
/sig storage

  • #84175: This didn't look to me like it should be related, but @jsafrane can you review this PR versus the failure please? This landed also on 1.14 and 1.15 and did not fail there.

/sig scheduling
/sig api-machinery

  • #84465: Again didn't look to me like it should be related and @yastij has reviewed the error log and doesn't believe this small scheduling change would have resulted in this issue. Of the three commits on 1.16 in the range, this was the only on 1.16 and not also on 1.15 and 1.14.

/sig api-machinery
/sig instrumentation

  • #84215 Again doesn't look like it should be related, but @RainbowMango @logicalhan can you review this PR versus the failure please? This landed also on 1.14 and 1.15 and did not fail there.

/sig testing
There have also been a lot of test-infra changes in the bf6674209..22ed1adda range where the failure cropped up. I'm not seeing any clearly standing out as related and would value some @kubernetes/sig-testing-pr-reviews double checking.

Am labelling urgent as we're aiming for a 1.16.3 release from this branch on Nov. 13, 2019 and would want this release-blocking test failure clearly understood or fixed ahead of that.
/priority critical-urgent
/area release-eng
/cc @kubernetes/patch-release-team

@tpepper

This comment has been minimized.

Copy link
Contributor Author

@tpepper tpepper commented Nov 8, 2019

#84215 metrics change has been checked by @deads2k who reports it looks pretty unlikely to be related.

@yastij

This comment has been minimized.

Copy link
Member

@yastij yastij commented Nov 8, 2019

From the logs:

Nov  8 10:06:39.297: INFO: At 2019-11-08 09:55:40 +0000 UTC - event for kube-dns-autoscaler-584b9b9fff-b268w: {default-scheduler } FailedScheduling: 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.

Events are being emitted here, so #84465 shouldn't have an impact. cc @liggitt @wojtek-t

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

the last successful run of that job was v1.15.6-beta.0.14+0f71449be978a5 - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures/1191462599863046146

the first failed run of that job was v1.16.3-beta.0.36+3d9fbfdd606aaa - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures/1191493075101290503

is 11/4 when we cut the 1.17 branch and the ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures job switched which branch it was running against?

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

I see this in the logs of the failed 1.16 run:

kube-apiserver:

--runtime-config="batch/v2alpha1=true,extensions/v1beta1=,scheduling.k8s.io/v1alpha1=,settings.k8s.io/v1alpha1="true"

kube-proxy:

E1104 23:19:25.460338 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1alpha1.EndpointSlice: the server could not find the requested resource (get endpointslices.discovery.k8s.io)
I1104 23:19:26.460581 1 reflector.go:158] Listing and watching *v1alpha1.EndpointSlice from k8s.io/client-go/informers/factory.go:134

Enabling the endpoint slice feature without enabling the corresponding API isn't going to work well.

I think the test config for these jobs is at https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/generated/generated.yaml

ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures, ci-kubernetes-e2e-gce-cos-k8sstable2-alphafeatures, and ci-kubernetes-e2e-gce-cos-k8sstable3-alphafeatures enable all alpha features but do not enable all APIs:

ci-kubernetes-e2e-gce-cos-k8sbeta-alphafeatures enables all alpha features and all APIs, as I would expect:

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

looks like https://github.com/kubernetes/test-infra/pull/15023/files#diff-b2787d91828c393fb737a765222c471aR4012-R4013 moved that job into the 1.16-blocking dashboard as of 10/30

That said, the job references --extract=ci/k8s-stable1, and I'm not sure how to track which release that symbol refers to and when it changed from meaning "1.15.x" to "1.16.x"

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

opened kubernetes/test-infra#15193 to at least fix the API enablement issue

I still would like to know which alpha job was running against the 1.16 branch prior to 11/4

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

Notably, that means that between 10/30 (when ci-kubernetes-e2e-gce-cos-k8sbeta-alphafeatures stopped running against 1.16.x) and 11/4 (when ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures started running against 1.16.x), we didn't have coverage of the 1.16 branch. I would have expected the jobs that refer to beta/stable1/stable2/stable3 to all switch at the same time

@liggitt

This comment has been minimized.

Copy link
Member

@liggitt liggitt commented Nov 8, 2019

found one more difference between the beta job and the stable{1,2,3} jobs; opened kubernetes/test-infra#15196

@alejandrox1

This comment has been minimized.

Copy link
Contributor

@alejandrox1 alejandrox1 commented Nov 8, 2019

Notably, that means that between 10/30 (when ci-kubernetes-e2e-gce-cos-k8sbeta-alphafeatures stopped running against 1.16.x) and 11/4 (when ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures started running against 1.16.x), we didn't have coverage of the 1.16 branch. I would have expected the jobs that refer to beta/stable1/stable2/stable3 to all switch at the same time

/cc @justaugustus @cpanato

@liggitt liggitt closed this Nov 8, 2019
@justaugustus justaugustus added this to Done (1.17) in Release Engineering Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.