Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify kubelet upgrade process #12326

Closed
1 task
Tracked by #12329
liggitt opened this issue Jan 22, 2019 · 24 comments · Fixed by #26098
Closed
1 task
Tracked by #12329

clarify kubelet upgrade process #12326

liggitt opened this issue Jan 22, 2019 · 24 comments · Fixed by #26098
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/en Issues or PRs related to English language priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/lts Categorizes an issue or PR as relevant to WG LTS.

Comments

@liggitt
Copy link
Member

liggitt commented Jan 22, 2019

Follow up from #11060, tracked in #12329

Upgrade process for kubelet is not sufficiently clear in user-facing documentation:

  • add details to kubelet upgrade procedure (whether drain is required, whether skip-level kubelet upgrades are supported, etc) @kubernetes/sig-node-pr-reviews @kubernetes/sig-storage-pr-reviews

Page to Update:
https://kubernetes.io/docs/setup/version-skew-policy/

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jan 22, 2019
@roberthbailey
Copy link
Contributor

Is this about in-place updates of the kubelet? Do the kubeadm upgrade tests cover that scenario? The kube-up GCE upgrade tests just replace machines running the old kubelet with machines running a newer one, bypassing the in-place upgrade questions here.

Unless or until we have testing for in-place upgrades, the conservative answer is that the upgrade process for a kubelet is to provision a new machine with the desired kubelet version.

@liggitt
Copy link
Member Author

liggitt commented Feb 6, 2019

Is this about in-place updates of the kubelet?

Yes

Do the kubeadm upgrade tests cover that scenario?

I don't know. @kubernetes/sig-cluster-lifecycle, @kubernetes/sig-testing?

@neolit123
Copy link
Member

neolit123 commented Feb 6, 2019

Do the kubeadm upgrade tests cover that scenario?

yes, but our tests are failing due to problems with the upgrade framework in k/k (e.g. ginkgo skipping); possibly some other reasons too. the current tests are pretty much unmaintained and there are plans to replace them with something else next cycle, hopefully.

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade-1-13/#drain-control-plane-and-worker-nodes

we do recommend draining in our 12 -> 13 upgrade process.
skipping minor versions is claimed as unsupported. (edit: not claimed -> claimed)

@liggitt
Copy link
Member Author

liggitt commented Feb 6, 2019

cordon -> drain -> upgrade kubelet -> uncordon is the informal guidance I've seen until now. There may be other deployment-specific reasons to destroy and rebuild nodes (node-level improvements that only take effect for new nodes, etc), but from the kubelet's perspective, drain has been sufficient, as far as I know.

@imkin
Copy link

imkin commented Mar 15, 2019

/wg lts

@k8s-ci-robot k8s-ci-robot added the wg/lts Categorizes an issue or PR as relevant to WG LTS. label Mar 15, 2019
@sftim
Copy link
Contributor

sftim commented Jun 4, 2019

/language en

@k8s-ci-robot k8s-ci-robot added the language/en Issues or PRs related to English language label Jun 4, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 2, 2019
@sftim
Copy link
Contributor

sftim commented Sep 10, 2019

/kind feature
/priority backlog

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 10, 2019
@bowei
Copy link
Member

bowei commented Sep 10, 2019

cc: @freehan

@sftim
Copy link
Contributor

sftim commented Sep 10, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2019
@liggitt
Copy link
Member Author

liggitt commented Oct 28, 2019

There is evidence of users upgrading kubelets between minor versions without draining pods (kubernetes/kubernetes#84443). If that is required, the kubelet upgrade docs need to be made explicit.

@BenTheElder
Copy link
Member

Is it required? Do we have a formal decision on that?

@dlipovetsky
Copy link
Contributor

Please, let also us address whether drain is required when upgrading between patch versions, e.g., 1.17.0 to 1.17.1. These upgrades are, arguably, more frequent than upgrades between minor versions, and therefore users have a greater incentive to skip drain.

@neolit123
Copy link
Member

WRT workload stability, the process is the same for PATCH iterations, so a drain would be required there too. one potential difference with MINOR updates is the CPU checkpoint format of the kubelet is not supposed to change on PATCH iterations.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2020
@detiber
Copy link
Member

detiber commented Jun 17, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 17, 2020
@liggitt
Copy link
Member Author

liggitt commented Sep 9, 2020

/assign @derekwaynecarr @dchen1107

Routing to sig-node leads. If we require draining nodes before minor version upgrade or reconfiguration (and that is required, as far as I can tell), that needs to be made explicit

@sjenning
Copy link
Contributor

sjenning commented Oct 6, 2020

My opinion, a cordon -> drain -> upgrade -> uncordon path is the safest thing to document for all situations. We should be able to do patch level upgrades (z-stream, in x.y.z) without draining but, imho, there is no point in complicating the guidance.

@dchen1107
Copy link
Member

cordon -> drain -> upgrade kubelet -> uncordon is the only path supported by SIG Node today. In the past, there were efforts to do in-place upgrade kubelet, including containerized kubelet by CoreOS team to simplify the upgrading flow, but none of them are officially supported by the community.

Let's make the upgrade flow explicit in the doc for now, but open for the enhancement.

@dlipovetsky
Copy link
Contributor

Now that it's official, I'll work up a docs PR 🙂

@sftim
Copy link
Contributor

sftim commented Oct 8, 2020

/triage accepted

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2021
@sftim
Copy link
Contributor

sftim commented Jan 6, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2021
@liggitt
Copy link
Member Author

liggitt commented Jan 14, 2021

cordon -> drain -> upgrade kubelet -> uncordon is the only path supported by SIG Node today

Opened #26098 to update the doc.

The cluster-upgrade doc already included this information

For each node in your cluster, drain
that node and then either replace it with a new node that uses the {{< skew latestVersion >}}
kubelet, or upgrade the kubelet on that node and bring the node back into service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/en Issues or PRs related to English language priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/lts Categorizes an issue or PR as relevant to WG LTS.
Projects
None yet
Development

Successfully merging a pull request may close this issue.