Join GitHub today
[sig-cluster-lifecycle] Reboot [Disruptive] [Feature:Reboot] each node by ordering unclean reboot and ensure they function upon restartChanges #69786
Reboot tests have been failing since 10/12
This test failure will block 1.13 beta, so please let's fix it before then. Thanks!
Copying my notes here:
Looking at one recent test run:
The test failed because kubelet never became "not ready":
When looking at one of the kubelet logs, I do see that kubelet had problems making Node updates to the api server:
Errors continue until kubelet is restarted 8 minutes later (the test times out at 2 minutes):
#69241 was the only PR that went in during that time (if we don't count infra or gke changes).
I was able to grab the logs from the mater was it was still up.
The PR author @wangzhen127 is out for a couple of days, so he cannot investigate. I can look at it later today maybe. If anyone from GKE wants to help, I can send you all the logs.
The node controller doesn't seem to be working at all as it wasn't marking the nodes not ready.
@msau42 suggested that the api group may not have been whitelisted in GKE, so I checked and that seemed to be true. We can whitelist it, but for an alpha feature, it should not be properly gated and not affect production.
@wangzhen127: Reopening this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.