Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage upgrade mechanism #52185

Open
lavalamp opened this issue Sep 8, 2017 · 29 comments
Open

Storage upgrade mechanism #52185

lavalamp opened this issue Sep 8, 2017 · 29 comments
Assignees
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@lavalamp
Copy link
Member

lavalamp commented Sep 8, 2017

Before we remove an API object version, we need to be certain that all stored objects have been upgraded to a version that will be readable in the future. The old plan for that was to run the cluster/upgrade-storage-objects.sh script after each upgrade. Unfortunately, there are a number of problems with this:

  • The script is old and incomplete, as api authors haven't been consistently adding their API objects to it (no doubt due to a lack of documentation in the right places).
  • Instructions to run the script in the release notes have been spotty.
  • We believe not all distributions have been running the script after upgrades.

Therefore, we need to design a robust solution for this problem, which works in the face of HA installations, user-provided apiservers, rollbacks, patch version releases, etc. Probably this won't look much like a script that you run after an upgrade, instead it may look more like a system where apiservers come to consensus on the desired storage version, plus a controller that does the job that script was supposed to do after a consensus change.

In the mean time, it is not safe to remove API object versions, so we are instating a moratorium on API object version removal until this system is in place. (Deprecations are still fine.)

@lavalamp lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Sep 8, 2017
@thockin
Copy link
Member

thockin commented Sep 8, 2017

Thanks, Daniel. Good summary. To echo:

it is not safe to remove API object versions, so we are instating a moratorium on API object version removal

Emphasis on removal. It sucks to carry old APIs forward, but this is a liability I think we can't ignore. We've gotten lucky this far.

@kubernetes/sig-network-api-reviews @cmluciano @dcbw @danwinship

@kubernetes/api-approvers @kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Sep 8, 2017
@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 8, 2017

For context the openshift approach is we force a migration before every upgrade starts, and the migration has to complete. Migration is a kubectl cmd that reads all objects and writes a no-op change (which forces storage turn over). We use this for protobuf migration, field defaulting, new versions, storage versions conversion, self link fixup, etc.

It's basically a production version of upgrade-storage-objects.sh - depending on how complicated we want to get, i'd say it's the minimum viable path, while the cluster consensus is probably fine but more complicated.

https://github.com/openshift/origin/blob/master/pkg/oc/admin/migrate/storage/storage.go

@thockin
Copy link
Member

thockin commented Sep 8, 2017

Is this blocking the removal of Alpha APIs, too?

@lavalamp
Copy link
Member Author

lavalamp commented Sep 8, 2017

@smarterclayton Is that an upstream command?

@lavalamp
Copy link
Member Author

lavalamp commented Sep 8, 2017

@thockin I think it depends on how the alpha API is storing its objects.

@smarterclayton
Copy link
Contributor

Downstream, it is a bunch of utility code for a migration framework (we have other migrators for things like RBAC, alpha annotations to beta fields, image references) and then a set of simple commands. I don't know that I think of this as kubectl as much as kubeadm or a separate command for admins (in openshift, oc adm is all admin focused commands and oc is all end user focused, but we don't quite have the equiv in kube)

@lavalamp
Copy link
Member Author

lavalamp commented Sep 8, 2017

@smarterclayton Gotcha. I will take a look at it but that still doesn't sound like an ideal way to solve the problem. Like, it's good if you're going to run one or two of these things with super-well-trained humans doing the upgrade, but Kubernetes isn't installed like that :)

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 9, 2017 via email

@enj
Copy link
Member

enj commented Sep 13, 2017

I have worked closely with OpenShift's installer/upgrade team on their use of the oc adm migrate storage command. I will say running it automatically pre and post upgrade is the trivial part. The command is surprisingly good at finding random edge cases. It could be turned into a controller of sort, though I am not sure what state the controller would try to drive the system from/to.

@timothysc
Copy link
Member

Per conversation on sig-api-machinery, there was general consensus that it makes more sense to have this as a controllers responsibility with some set of strategies.

@bgrant0607
Copy link
Member

We need to figure out downgrades, also, which are similar. Right now, a downgrade will strand storage of any new APIs that are created.

@liggitt
Copy link
Member

liggitt commented Sep 14, 2017

We need to figure out downgrades, also, which are similar. Right now, a downgrade will strand storage of any new APIs that are created.

Brand new resources will get orphaned on downgrade, but should be inert.

New versions of existing resources shouldn't start persisting with the new version until n+1 versions from when they are released (or rolling HA API upgrades won't work). If we only support single version downgrade, there shouldn't be any issues.

@bgrant0607
Copy link
Member

Related: #4855, #3562, #46073, #23233

@bgrant0607
Copy link
Member

@liggitt Inert resources have some of the timebomb characteristics of annotations discussed in #30819, upon re-upgrade.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018
@enj
Copy link
Member

enj commented Jan 9, 2018

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 9, 2018
@lavalamp lavalamp self-assigned this Jan 9, 2018
@thockin
Copy link
Member

thockin commented Jan 17, 2018

Any status update here? We'd like to remove the deprecated v1beta1 NetworkPolicy.

@liggitt
Copy link
Member

liggitt commented Jan 17, 2018

I haven't seen a better suggestion than a no-op read/write cycle via the API on the affected resources.

Enforcing that is done seems like the responsibility of the deployment mechanism, so it can be staged after all apiservers in a HA deployment are upgraded.

@liggitt
Copy link
Member

liggitt commented Jan 17, 2018

In the meantime, you can start disabling serving the deprecated object by default. The types still remain, so they can be read from storage, and we still have the in tree debt, but it pushes people to use the new types as new clusters are deployed.

@bgrant0607 bgrant0607 mentioned this issue Jan 22, 2018
4 tasks
@bgrant0607
Copy link
Member

Speaking of HA, today each apiserver has its own independent default storage versions, which are imposed immediately upon apiserver upgrade. That doesn't seem desirable.

@bgrant0607 bgrant0607 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2018
@caesarxuchao
Copy link
Member

@liggitt liggitt added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 14, 2019
@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@athenabot
Copy link

@lavalamp @caesarxuchao
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@caesarxuchao
Copy link
Member

/remove-triage unresolved

@k8s-ci-robot k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label May 6, 2019
@dims
Copy link
Member

dims commented Jan 24, 2022

long-term-issue (note to self)

@andrewsykim
Copy link
Member

/assign

@shaneutt
Copy link
Member

This is marked as priority/important-soon but given the last updates, that wouldn't seem to be the case. Can someone catch me up on what the status is here? Is this simply stale since we don't have anyone to move it forward?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 4, 2024
@enj
Copy link
Member

enj commented Apr 5, 2024

/lifecycle frozen

The foundational pieces are being worked on here:

kubernetes/enhancements#2339
kubernetes/enhancements#4192

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 5, 2024
@enj
Copy link
Member

enj commented Apr 5, 2024

The #sig-api-machinery-storageversion-dev slack channel is available as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests