-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade k8s-infra prow build clusters from v1.14 to v1.15 #1120
Comments
We're sitting on 1.14.10-gke.42 for the control plane, and 1.14.10-gke.37 for the main node pool |
We're still sitting on 1.14. I am waiting until we've drained the bulk of outstanding v1.20 PR's (ref: https://groups.google.com/g/kubernetes-dev/c/YXGBa6pxLzo/discussion) before explicitly triggering this. There's still a chance it'll happen when we're not watching though. |
Upgrading k8s-infra-prow-build's control plane from 1.14.10-gke.42 to 1.15.12-gke.17 |
Control plane upgraded, next the greenhouse nodepool. This may disrupt bazel-based jobs, though they should fall back to not using the cache when it's unavailable. |
Greenhouse nodepool upgraded. Waiting until kubernetes/test-infra#19182 (comment) is resolved before proceeding with the main nodepool |
/assign |
Upgrading k8s-infra-prow-build's default node pool from 1.14.10-gke.42 to 1.15.12-gke.17 |
29/41 nodes done |
And everything is at 1.15.12-gke.17 after ~4 hours ... so I think that node-by-node-upgrade and cluster-autoscaling aren't the best of friends. I suspect this would have gone more quickly if we had spun up an entirely new nodepool that was on 1.15.12 to begin with, and cordoned the the old nodepool. I would like to move up to v1.16 next but I think we'll leave it here for the weekend |
As a rough guess of "how disruptive was this?" I filtered down to kubernetes/kubernetes jobs on prow.k8s.io. Accepting that some presubmits are just gonna fail (on top of whatever flakiness may be out there), this looks reasonably non-disruptive. Specifically I'm looking at the left half of the graph (now - 6h ago) Another guess - look at the plank dashboard. No increase in jobs hitting failure state |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen https://prow.k8s.io/tide-history?repo=kubernetes%2Fkubernetes&branch=master shows tide last issued a TRIGGER action around 1:50pm PT |
@spiffxp: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
looking at that node e2e job, I get this for prowjob yaml: https://prow.k8s.io/prowjob?prowjob=a56a5463-f464-11ea-a7c8-9eb8089ce657 pulling some useful fields from that
let's go look at the build cluster
let's see, are there other pods stuck in Terminating status?
Possible mitigations:
Also:
Definitely think migrating to a new node pool is the upgrade path to use next time |
12:26pm pod create call what's trying to delete it? |
ok, does manually deleting do any better?
no |
How about that node
12:37pm node status is NodeNotReady |
Seems like we're running into kubernetes/kubernetes#72226 Issuing a /test command will create a new pod. Still not clear how to get rid of the pods stuck in terminating |
/would it be reasonable for these kinds of workloads to go down the
route? |
Issuing that /test command caused the pod to disappear Last entry in the log for that pod
So then I manually edited the finalizer for another pod
|
@alejandrox1 I tried that (see #1120 (comment)) and it didn't delete |
Everything that had a deletionTimestamp was hung (this was intended for a markdown table, but the formatting looked worse, so fixed-width it is)
Patched in empty finalizers for everything that had a deletionTimestamp.
Nothing stuck in terminating anymore
|
Looks like most of the pods that were stuck in Terminating are now running. Calling it a night, we'll see if prow/tide end up picking up their results later.
|
(also definitely cordoning and migrating to a new pool next time) |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
v1.14 deprecation was announced here: https://cloud.google.com/kubernetes-engine/docs/release-notes#coming-soon-20200722
The clusters should have been automatically upgraded: https://cloud.google.com/kubernetes-engine/docs/release-notes#scheduled_automatic_upgrades
This issue is to confirm whether they have been, and if not, initiate such an upgrade
/area prow
/wg k8s-infra
/sig testing
The text was updated successfully, but these errors were encountered: