Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sig-cluster-lifecycle] Reboot [Disruptive] [Feature:Reboot] each node by ordering unclean reboot and ensure they function upon restartChanges #69786

Closed
mortent opened this Issue Oct 14, 2018 · 18 comments

Comments

@mortent
Copy link
Member

mortent commented Oct 14, 2018

Reboot tests have been failing since 10/12

Dashboard: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gke-cos-master-reboot

Sample Failure: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-reboot/18153

This test failure will block 1.13 beta, so please let's fix it before then. Thanks!

/sig cluster-lifecycle
/kind failing-test
/priority important-soon

@jberkus

This comment has been minimized.

Copy link

jberkus commented Oct 15, 2018

Note: this failure is similar to #63612 but is not the same failure, according to @yujuhong

This is a consistent failure rather than a flake, and all nodes are failing to reboot.

It's also been failing consistently since 10/11.

/cc @msau42

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 15, 2018

Copying my notes here:

Looking at one recent test run:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-reboot/18231/build-log.txt

The test failed because kubelet never became "not ready":

I1015 19:33:01.392] Oct 15 19:33:01.390: INFO: Waiting up to 2m0s for node gke-e2e-18231-344e3-default-pool-2b8d22c7-njwh condition Ready to be false
I1015 19:33:01.393] Oct 15 19:33:01.393: INFO: Condition Ready of node gke-e2e-18231-344e3-default-pool-2b8d22c7-njwh is true instead of false. Reason: KubeletReady, message: kubelet is posting ready status. AppArmor enabled
...
I1015 19:34:59.825] Oct 15 19:34:59.825: INFO: Condition Ready of node gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr is true instead of false. Reason: KubeletReady, message: kubelet is posting ready status. AppArmor enabled
I1015 19:34:59.826] Oct 15 19:34:59.825: INFO: Condition Ready of node gke-e2e-18231-344e3-default-pool-2b8d22c7-q5g8 is true instead of false. Reason: KubeletReady, message: kubelet is posting ready status. AppArmor enabled
I1015 19:34:59.826] Oct 15 19:34:59.825: INFO: Condition Ready of node gke-e2e-18231-344e3-default-pool-2b8d22c7-njwh is true instead of false. Reason: KubeletReady, message: kubelet is posting ready status. AppArmor enabled
I1015 19:35:01.825] Oct 15 19:35:01.825: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr didn't reach desired Ready condition status (false) within 2m0s
I1015 19:35:01.826] Oct 15 19:35:01.825: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-q5g8 didn't reach desired Ready condition status (false) within 2m0s
I1015 19:35:01.830] Oct 15 19:35:01.826: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-njwh didn't reach desired Ready condition status (false) within 2m0s
I1015 19:35:01.830] Oct 15 19:35:01.826: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr failed reboot test.
I1015 19:35:01.830] Oct 15 19:35:01.826: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-njwh failed reboot test.
I1015 19:35:01.830] Oct 15 19:35:01.826: INFO: Node gke-e2e-18231-344e3-default-pool-2b8d22c7-q5g8 failed reboot test.

When looking at one of the kubelet logs, I do see that kubelet had problems making Node updates to the api server:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-reboot/18231/artifacts/gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr/kubelet.log

E1015 19:33:26.150380    1303 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr": Get https://35.202.63.115/api/v1/nodes/gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr?resourceVersion=0&timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I1015 19:33:29.552319    1303 cloud_request_manager.go:113] Node addresses from cloud provider for node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr" not collected
E1015 19:33:36.150912    1303 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr": Get https://35.202.63.115/api/v1/nodes/gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I1015 19:33:41.553161    1303 cloud_request_manager.go:113] Node addresses from cloud provider for node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr" not collected
E1015 19:33:46.152035    1303 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr": Get https://35.202.63.115/api/v1/nodes/gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I1015 19:33:53.554253    1303 cloud_request_manager.go:113] Node addresses from cloud provider for node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr" not collected
E1015 19:33:56.152555    1303 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr": Get https://35.202.63.115/api/v1/nodes/gke-e2e-18231-344e3-default-pool-2b8d22c7-01rr?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Errors continue until kubelet is restarted 8 minutes later (the test times out at 2 minutes):

Stopping Kubernetes kubelet...
Stopped Kubernetes kubelet.
Started Kubernetes kubelet.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --max-pods has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I1015 19:41:30.505574    1169 flags.go:33] FLAG: --address="0.0.0.0"
@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 15, 2018

So it seems like even though kubelet is unable to update its Node status, the Node is still showing up as Ready

@jberkus

This comment has been minimized.

Copy link

jberkus commented Oct 15, 2018

Adding sig-node to this discussion:

/sig node

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 15, 2018

I'm trying to repro on a dev 1.11/1.12 cluster so we can get master logs for the node controller

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Oct 15, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 16, 2018

/cc @yujuhong

adding @yujuhong from sig-node

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Oct 16, 2018

/assign @roberthbailey

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Oct 16, 2018

88e50c9...23a8477

#69241 was the only PR that went in during that time (if we don't count infra or gke changes).
This PR touches the node controller and changes the node heartbeat, so it could be the culprit.

I was able to grab the logs from the mater was it was still up.
Snippet from the kube-controller-manage.log showed a lot of error:

E1016 00:06:50.369424       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:51.370891       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:52.372252       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:53.375598       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:54.377153       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:55.378642       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:56.380080       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:57.381627       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:58.383074       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:06:59.384675       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:00.386113       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:01.387608       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:02.389167       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:03.449532       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:04.451502       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:05.452984       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:06.454406       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:07.455937       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)
E1016 00:07:08.459592       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.Lease: the server could not find the requested resource (get leases.coordination.k8s.io)

The PR author @wangzhen127 is out for a couple of days, so he cannot investigate. I can look at it later today maybe. If anyone from GKE wants to help, I can send you all the logs.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Oct 16, 2018

The node controller doesn't seem to be working at all as it wasn't marking the nodes not ready.

@msau42 suggested that the api group may not have been whitelisted in GKE, so I checked and that seemed to be true. We can whitelist it, but for an alpha feature, it should not be properly gated and not affect production.

We can revert PR #69241 to unblock release, and wait for @wangzhen127 to re-submit a patched PR.

/cc @wojtek-t

@wangzhen127

This comment has been minimized.

Copy link
Member

wangzhen127 commented Oct 16, 2018

Thanks for debugging this! Just got time to take a quick look. Does it mean creating the lease lister should also be put behind the feature gate?

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 17, 2018

I think reverting is a good option to unblock stuff.

@wangzhen127 - creating listed may not be necessary, but setting HasSynced is (so if we have to put it after feature gate, we should put both).

@wangzhen127

This comment has been minimized.

Copy link
Member

wangzhen127 commented Oct 17, 2018

Is there any guideline on how to revert my original PR #69241? I haven't done it before. Just revert in my fork and submit a new PR?

I also just put up a fix PR #69933. I am fine with either submitting the fix PR or revert my original PR first.

@wangzhen127

This comment has been minimized.

Copy link
Member

wangzhen127 commented Oct 17, 2018

/reopen
/assign @wangzhen127
I will manually close this after verifying the bots.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 17, 2018

@wangzhen127: Reopening this issue.

In response to this:

/reopen
/assign @wangzhen127
I will manually close this after verifying the bots.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 17, 2018

This is passing in the latest test run. Thanks @wangzhen127 for the quick turnaround

@wangzhen127

This comment has been minimized.

Copy link
Member

wangzhen127 commented Oct 17, 2018

Nice! Closing this now.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 17, 2018

@wangzhen127: Closing this issue.

In response to this:

Nice! Closing this now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.