Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kubernetes Yearly Support Period KEP #1497

Open
wants to merge 7 commits into
base: master
from

Conversation

@youngnick
Copy link

youngnick commented Jan 22, 2020

This KEP is about implementing the recommendations from WG-LTS to increase Kubernetes to one year of support. It does not require code changes, but we (WG-LTS) would very much like it to begin with Kubernetes 1.18 1.19.

@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 22, 2020

/assign wg-lts

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 22, 2020

@youngnick: GitHub didn't allow me to assign the following users: wg-lts.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign wg-lts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from bgrant0607 and mattfarina Jan 22, 2020
@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 22, 2020

/assign @tpepper
/assign @liggitt

@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 22, 2020

/sig wg-lts
/sig sig-architecture
/sig sig-testing
/sig sig-release

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 22, 2020

@youngnick: The label(s) sig/wg-lts, sig/sig-architecture, sig/sig-testing, sig/sig-release cannot be applied, because the repository doesn't have them

In response to this:

/sig wg-lts
/sig sig-architecture
/sig sig-testing
/sig sig-release

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 22, 2020

/assign @youngnick


* Adjust behavior in the kube-apiserver that assumes a component lifetime of ~9 months
* Lifetime of the certificate used for kube-apiserver loopback connections (#86552)
* Expand supported kubelet/apiserver skew to 12 months divided by release cadence minus 1, i.e. currently (12/3)-1=3 releases.

This comment has been minimized.

Copy link
@youngnick

youngnick Jan 22, 2020

Author

This item in particular needs feedback from SIG-Architecture and SIG-Node.

* Expand supported kubelet/apiserver skew to 12 months divided by release cadence minus 1, i.e. currently (12/3)-1=3 releases.
* Currently, supported skew (2 versions) was chosen to allow the oldest kubelet to work against newest apiserver
* Implementation should include adding a test to cover the added skew variations, have it release informing, watch for any errors, fix them, promote the test to release blocking. This is already a gap today and can be treated as a project need orthogonal to this KEP.
* Expand supported kubectl/apiserver skew

This comment has been minimized.

Copy link
@youngnick

youngnick Jan 22, 2020

Author

This item also needs feedback from SIG-CLI and SIG-API-Machinery (client libraries)


* Staff patch releases for one more branch for ~3-4 more months
* Feedback from SIG Release received, see [SIG Release meeting discussion](https://docs.google.com/document/d/1Fu6HxXQu8wl6TwloGUEOXVzZ1rwZ72IAhglnaAMCPqA/edit#bookmark=id.mi11nk75iohl).
* Maintain CI for one more branch for ~3-4 more months

This comment has been minimized.

Copy link
@youngnick

youngnick Jan 22, 2020

Author

This item needs feedback from SIG-Testing.

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Jan 22, 2020

/assign
@youngnick -- please add me as a reviewer here an approver on the KEP.

@justaugustus justaugustus added this to In progress in SIG PM via automation Jan 22, 2020
@justaugustus justaugustus added this to In progress in SIG Release via automation Jan 22, 2020
@justaugustus justaugustus added this to In progress in Release Team via automation Jan 22, 2020
@justaugustus justaugustus added this to In progress in Release Engineering via automation Jan 22, 2020
@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Jan 22, 2020

@kubernetes/sig-release -- for eyes on and reviews

@youngnick youngnick force-pushed the youngnick:one-year-support-window branch from f3bc212 to 2d03ef5 Jan 23, 2020
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 23, 2020

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: youngnick
To complete the pull request process, please assign justaugustus
You can assign the PR to them by writing /assign @justaugustus in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@youngnick youngnick force-pushed the youngnick:one-year-support-window branch 3 times, most recently from 0ffa140 to 9ea5e55 Jan 23, 2020
@youngnick youngnick force-pushed the youngnick:one-year-support-window branch from 9ea5e55 to 3729e4f Jan 23, 2020
@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 23, 2020

I'm also unsure of who the reviewers should be on this one, guidance very much welcomed.


### Graduation Criteria

WG LTS proposes this KEP be implemented as of the 1.18.0 release.

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 23, 2020

Member

Agree 1.18 is too aggressive to get feedback and work through implications with all participating sigs. If the conversations start now then targeting 1.19 seems possible.

This comment has been minimized.

Copy link
@youngnick

youngnick Jan 23, 2020

Author

I'll update, I've updated the description accordingly.

@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Jan 23, 2020

Also, I'm online for another three hours or so, but then intermittently online until the 28th Jan Sydney time. So yeah, no way is this making any 1.18 release deadline, sorry for upset caused there.

Copy link
Member

BenTheElder left a comment

Weighing in a little as an interested-but-un-titled SIG Testing / Release / ... participant :-)

I think testing is being a little overlooked here if we are in fact expanding client skew support?

* Currently, supported skew (2 versions) was chosen to allow the oldest kubelet to work against newest apiserver
* Implementation should include adding a test to cover the added skew variations, have it release informing, watch for any errors, fix them, promote the test to release blocking. This is already a gap today and can be treated as a project need orthogonal to this KEP.
* Expand supported kubectl/apiserver skew
* Currently supported skew (+/- 1 minor version) theoretically allows using an n-1 kubectl to work against all supported apiservers (n-2/n-1/n).

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Jan 23, 2020

Member

this matrix could get interesting to test effectively 🙃, any proposals regarding this?

This comment has been minimized.

Copy link
@youngnick

youngnick Feb 4, 2020

Author

We talked a bit with @liggitt about this one, so I'll leave it to him to comment.

* Staff patch releases for one more branch for ~3-4 more months
* Feedback from SIG Release received, see [SIG Release meeting discussion](https://docs.google.com/document/d/1Fu6HxXQu8wl6TwloGUEOXVzZ1rwZ72IAhglnaAMCPqA/edit#bookmark=id.mi11nk75iohl).
* Maintain CI for one more branch for ~3-4 more months
* Modifying test job and infrastructure patterns may need deprecated on a slower schedule or backwards compatibility with older style tests in older branches may need maintained longer. This is a problem already today (e.g.: bootstrap.py went out of support, but was still in use in release-1.14 branch CI during the end of that branch’s lifetime).

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Jan 23, 2020

Member

bootstrap.py is """out of support"" in a scary error message in the logs for a long time in an attempt to get people to move off of it, but it is not in fact actually unsupported. it is used in the majority of jobs and we cannot afford to break it currently.

might want a different concrete example

This comment has been minimized.

Copy link
@neolit123

neolit123 Feb 4, 2020

Member

possibly better to exclude bootstrap.py as an example and just explain that images and jobs for another branch should be maintained in test-infra.

This comment has been minimized.

Copy link
@youngnick

youngnick Feb 4, 2020

Author

I'm not familiar with bootstrap.py, but that sounds like it's really a deprecation warning rather than an out of support thing?

@tpepper mentioned that bootstrap.py caused a problem with the 1.14 release, I think that's where this came from.

Happy to remove, but I'll need a hand with a suggestion for a concrete example.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Feb 5, 2020

Member

not familiar with what happened in 1.14, but boostrap.py should be fully independent of kubernetes versions. the actual test configs and the test images are version dependent though


### Upgrade / Downgrade Strategy

Unchanged from existing project policy.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Jan 23, 2020

Member

so I get LTS on some version for a year, but then I need to upgrade three times in a row to catch up? hmm..

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 23, 2020

Member

Yes. One problem at a time.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Jan 23, 2020

Member

fair enough, but I'm not sure I understand the motivation if I still need to use 4 versions / year with potentially non-trivially upgrades, qualification / certification, etc. if this is intended to be mitigated in the future that makes sense though I suppose?

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 23, 2020

Member

batching upgrades allows a single rotation of your node pools, a single node certification a year, etc. you can upgrade your control plane to the latest version (currently stepping through each version), then drain and replace your nodes to the latest version in a single step

This comment has been minimized.

Copy link
@jberkus

jberkus Feb 4, 2020

You may not understand the motivation (I don't either), but we based this on the fact that some 1/4 to 1/3 of our users are doing exactly this.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Feb 5, 2020

Member

to be clear I'm not intending to block the KEP in any way on this point, I would love to understand what the trade offs are here though, batching upgrades is an interesting thought.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Feb 5, 2020

Member

[also if we want to change that requirement (I seem to remember some mention of version skew changing?) then we need to call that out as a potentially substantial change]


An increase to the support period of Kubernetes also increases the human burden of maintenance on the patch management (Release Engineering), test-infra, k8s-infra, sig-pm/release, and regular SIG functioning.
Some of these impacts have already been recognized and sub-projects have been set up to mitigate them.
Still, there will be an unavoidable increased cost in dollars for infrastructure resources kept online 3 additional months.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Jan 23, 2020

Member

in addition to engineering costs patching critical bugs in more versions.

This comment has been minimized.

Copy link
@youngnick

youngnick Feb 4, 2020

Author

Fair enough, I'll add this.

- Change the release cadence. It is currently 3 months per release or 4 releases per year. See “Alternatives” below for related work on release cadence.
- Establish an “LTS” release process in the sense perhaps known from other projects (e.g.: pick one release every year or two, give patch support on that release for multiple years).
- Declare that Kubernetes’ dependencies must also have a similar support lifetime per release.
- Align Kubernetes releases in time with those of its dependencies.

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 23, 2020

Member

we do need to consider what we will do if there is a security related issue in a dependency that is only fixed in versions that are significant/risky to update to in older kubernetes releases. To be clear, we already have this issue occasionally, but adding another kubernetes release makes it more likely. outlining how we currently align with our biggest dependencies (go, etcd, cadvisor, cloud provider libraries, etc) would be helpful in understanding if/how this proposal would impact current alignment.

This comment has been minimized.

Copy link
@youngnick

youngnick Feb 4, 2020

Author

I agree that needs to be done, but I'm not sure how to fold it in here.

This comment has been minimized.

Copy link
@jberkus

jberkus Feb 4, 2020

agreed; but the most we can do is have a clearer policy for this. It's already a problem.


## Proposal

The proposal is to extend the Kubernetes support cadence from 9 months to 14 months.

This comment has been minimized.

Copy link
@timothysc

timothysc Feb 3, 2020

Member

simpler is more good-er-er. Can't we keep all policies the same and just do 3 releases a year and split the difference?

This comment has been minimized.

Copy link
@tpepper

tpepper Feb 4, 2020

Contributor

That's certainly a possibility. We discussed both speeding up the release cadence (eg: monthly? twelve releases a year?) and slowing down the cadence (eg: three releases a year space four months apart? annual releases? twice yearly releases?). You proposed a hybrid also. There was not a consensus in the community on which of these modes was preferred. More conversation is required.

In the meantime, in order to address the portion of end-user desire for longer support lifetime we chose to focus first on the support side, and leave the cadence for later conversations and KEPs.

This comment has been minimized.

Copy link
@timothysc

timothysc Feb 4, 2020

Member

FWIW, (2,3 is >> 4), because we always rush the fourth release due to the holidays anyway.

This comment has been minimized.

Copy link
@BenTheElder

BenTheElder Feb 5, 2020

Member

the 4th release is definitely always smaller, the simplicity of slowing down the minor releases is pretty appealing to me...

This comment has been minimized.

Copy link
@thockin

thockin Feb 18, 2020

Member

Slowing releases has the unfortunate side-effect of accumulating more risk on each release. We have historically not had enough engineering rigor to make that very palatable, IMO. This is what makes bi-modal releases (commonly called tick-tock, but that's not really right) attractive, but that requires a different sort of rigor and different rules per-release. Linux, for example, moved away from this.

I'm not sure I am advocating for anything in particular, but "slowing down" is not necessarily less risky.

This comment has been minimized.

Copy link
@jberkus

jberkus Feb 18, 2020

Going over our notes, we discussed making releases every 4 months, but there are contributors who want more frequent releases ... including @timothysc. As such, the WG didn't feel that a 3-releases-per-year plan was feasible. If the folks who want more frequent releases have changed perspective, then we might be able to revisit this.

@youngnick youngnick force-pushed the youngnick:one-year-support-window branch from a8b1297 to 964c10b Feb 4, 2020
@youngnick youngnick force-pushed the youngnick:one-year-support-window branch from 964c10b to 888ac59 Feb 4, 2020
* Implementation should include adding a test to cover the added skew variations, have it release informing, watch for any errors, fix them, promote the test to release blocking. This is already a gap today and can be treated as a project need orthogonal to this KEP.
* Expand supported kubectl/apiserver skew
* Currently supported skew (+/- 1 minor version) theoretically allows using an n-1 kubectl to work against all supported apiservers (n-2/n-1/n).
* Ideally, the latest kubectl would support speaking to all supported apiservers (+0/-3 minor versions).

This comment has been minimized.

Copy link
@tallclair

tallclair Feb 5, 2020

Member

Alternatively, adopt something like kubectl-dispatcher: https://github.com/GoogleCloudPlatform/kubectl-dispatcher

@thockin thockin self-requested a review Feb 13, 2020
Copy link
Member

thockin left a comment

I wrote comments last week but forgot to send them


![Kubernetes versions in production](./20200122-versions-in-production.png)

This, and other responses from the survey, suggest that this 30% of users would better be able to keep their deployments on supported versions if the patch support period were extended to 12-14 months.

This comment has been minimized.

Copy link
@thockin

thockin Feb 18, 2020

Member

What makes us believe that it's the non-annual nature of the cycle that results in this, rather than users who simply don't/can't make the time to upgrade. In other words, what prevents us from doing this survey next year and finding that 30% of users are now 15 months (12+3) out of date instead of 12 (9+3)?

This comment has been minimized.

Copy link
@tpepper

tpepper Feb 18, 2020

Contributor

We did a lot of thinking, discussing, researching on this, but it's fallible and fuzzy. We know:

  1. Some will never upgrade (can't fix)
  2. Some upgrade only when they must to remain in support (this KEP enables them to delay)
  3. Some businesses do annual upgrades (this KEP helps them): we have concrete reports from end users that this is how they operate and it's not unbelievable that this might be a thing in industry
  4. Some upgrade faster (this KEP doesn't impact them)

As a project we don't have strong incentives in place to get people in bucket 2 to move instead to buckets 3 or 4. We already see people running releases outside of the community support lifetime. The weighting of users running older releases may be increased because of this KEP. Regardless of this KEP, my suspicion is the bigger nudge in that direction will be the project becoming more production ready and having more stable/ga v1 APIs.

In this balance, we have undeniable that as a project we have limited resources. There's clear resistance from our developer base to not go to longer and longer support lifetimes, even if users ask for longer and longer, while also a recognition that annual is not unrealistic to be asked for and to consider delivering. The gap is something vendors can and will address if there is value in addressing it.

The underlying question is for how long should a project like ours offer patch release support, component interoperability, and upgrade-ability?

This comment has been minimized.

Copy link
@tpepper

tpepper Feb 18, 2020

Contributor

Also linked in "Alternatives" is a @timothysc discussion document around "bi-modal" or "tick-tock" with monthly/annual modes.

Our thinking was just as our current three month cycle is relatively arbitrary, shifting to monthly dev cycles plus the second mode of annual stable cycles is arbitrary. That's why we left that for separate discussion. Regardless of how the project chooses to deliver stable features (monthly, quarterly, three times a year, annually?), how do end users receive that and how long can they run it once deployed? There's no one right answer.

This comment has been minimized.

Copy link
@mattfarina

mattfarina Mar 12, 2020

Member

Two things come to mind on this...

  1. Many people use the version provided by their public cloud. Those versions are typically keeping at N-2 (the last supported version in the current scheme). They also default to one version older than that being the latest version not supported. If people follow defaults or upgrade when the public cloud providers provide it they will often be out of support or just barely in it. We should understand why cloud providers are so slow to provider newer versions.
  2. There are companies that have certain cycles. For example, I don't expect many shopping businesses to update in the shopping season in the fall. There are other companies that have to deal with tax season. Companies have cycles and a year appeared to be a good support window. On Helm we looked at something shorter than a year for Helm v2 when v3 came out. We were asked to extend it to handle businesses longer planning and action cycles. If we hold another survey of operators it might be worth asking about their cycles for upgrades and changes.

## Proposal

The proposal is to extend the Kubernetes support cadence from 9 months to 14 months.

This comment has been minimized.

Copy link
@thockin

thockin Feb 18, 2020

Member

Slowing releases has the unfortunate side-effect of accumulating more risk on each release. We have historically not had enough engineering rigor to make that very palatable, IMO. This is what makes bi-modal releases (commonly called tick-tock, but that's not really right) attractive, but that requires a different sort of rigor and different rules per-release. Linux, for example, moved away from this.

I'm not sure I am advocating for anything in particular, but "slowing down" is not necessarily less risky.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Feb 18, 2020

jberkus and others added 2 commits Feb 20, 2020
Add some new test plan steps, and documentation locations.

Delete some duplicate text.
Straighten out documentation and testing sections for the KEP.
@youngnick

This comment has been minimized.

Copy link
Author

youngnick commented Feb 20, 2020

@jberkus has made some changes to the wording around the testing, we would appreciate some more eyes on that please everyone. @BenTheElder, under the "you commented, you own it" rule, I invoke your name. 😂

Copy link
Member

mattfarina left a comment

I like the idea of extending the support window.

Improving the upgrade experience may help alleviate some of the issues here (but not all). With so many not having the option to upgrade more quickly (e.g., using what's provided by public clouds) they are pushed to use an older version. Why aren't public clouds offering newer versions more quickly? How can that be improved?


![Kubernetes versions in production](./20200122-versions-in-production.png)

This, and other responses from the survey, suggest that this 30% of users would better be able to keep their deployments on supported versions if the patch support period were extended to 12-14 months.

This comment has been minimized.

Copy link
@mattfarina

mattfarina Mar 12, 2020

Member

Two things come to mind on this...

  1. Many people use the version provided by their public cloud. Those versions are typically keeping at N-2 (the last supported version in the current scheme). They also default to one version older than that being the latest version not supported. If people follow defaults or upgrade when the public cloud providers provide it they will often be out of support or just barely in it. We should understand why cloud providers are so slow to provider newer versions.
  2. There are companies that have certain cycles. For example, I don't expect many shopping businesses to update in the shopping season in the fall. There are other companies that have to deal with tax season. Companies have cycles and a year appeared to be a good support window. On Helm we looked at something shorter than a year for Helm v2 when v3 came out. We were asked to extend it to handle businesses longer planning and action cycles. If we hold another survey of operators it might be worth asking about their cycles for upgrades and changes.
@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Mar 12, 2020

@BenTheElder, under the "you commented, you own it" rule, I invoke your name. 😂

no no no, I own enough things 🙃
when SIGs KEP features release & testing tell them they must have tests :+) I am not terribly active in wg-lts.

If we're going to change support cycles though, this will include our project infra. It sounds(?) like we're not intending to change the skew policy which would mean we wouldn't need any new testing there.

@jberkus

This comment has been minimized.

Copy link

jberkus commented Mar 13, 2020

If we're going to change support cycles though, this will include our project infra. It sounds(?) like we're not intending to change the skew policy which would mean we wouldn't need any new testing there.

What I put in the proposal was that we are changing the skew policy. It's N-1 now, no? This would require it to be N-2.

Unless I'm not getting the current policy?

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Mar 13, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.