Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PRR questionnaire section to New Event API KEP #1772

Merged
merged 2 commits into from
May 26, 2020

Conversation

chelseychen
Copy link
Contributor

A new PRR questionnaire section was added to the KEP template as a required section in #1620, so this pull request is to add this section to align with that requirement and answer all questions regarding the fact that we're targeting GA graduation in 1.19.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @chelseychen. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 14, 2020
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels May 14, 2020
@chelseychen
Copy link
Contributor Author

chelseychen commented May 14, 2020

/assign @wojtek-t

Please take a look and feel free to add new reviewers, thanks!

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first, high-level pass. I will have more comments further, but those are the high-level ones.

- [x] Other
- Describe the mechanism:

The new Event API can be enabled / disabled by switching to use the new / old Event proto and libs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to separate two things here:
(1) the API itself
(2) the use of the API.

(1) We've never used feature-gates for the API itself. Though, you can enable/disabled specific APIs at kube-apiserver level using --runtime-config flag.
(2) Regarding using the library. We should have added a feature-gate a gate it based on that, but we didn't. My fault.
So for now, there is no way to disable it.
[We only have a fallback, that if the API is disabled, we simply fallback to the old libraries. And we probably should continue with this pattern, given we're already graduating to GA, so featurage-gate will already be locked to true.]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put something along those lines in the doc. And then answering the questions below is not needed (as it's kind of answered in what I described above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let me pull in others opinions.

@johnbelamaric - given my explanation above (ignoring for a moment a fact that it was a bad decision and we shouldn't do this that way), I'm willing towards saying that "rollback is not supported". Or should we interpret it as "if you rollback to the release where component X wasn't migrated to use the new API, it will be rolled back".

[FTR - currently only scheduler is migrated to the new API]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things. I think, "rollback supported" is probably a bad term. Using "rollback supported" to describe the ability to disable the feature (i.e, roll back the enablement as described below) without rolling back the version I think is a little confusing. We probably should call it "disable-supported".

In this case, it sounds like you are saying, the way to disable this is not via a feature gate but instead by disabling the API, in which case the library is smart enough to use the prior API. IIUC, this causes no data loss on existing events created with the new API, and presumably we round-trip those events so you'll still be able to query events created with the new API after disabling it, by using the old API. If that's correct, then I would say "rollback supported" is true. It's not feature gate based, but it is possible and works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should call it "disable-supported".

+1

If that's correct, then I would say "rollback supported" is true. It's not feature gate based, but it is possible and works.

Yes - what yo wrote is the right summary.
Fair enough - I think that makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - IIRC we discussed a while ago fallbacking to core. once we move move to GA I think we should still fallback for safety. might be worth considering for other APIs (if we're not already) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the use. the event recorder is used by a broad range of clients, many out of tree, with potentially different skew support statements than kube-scheduler/kube-controller-manager, etc.

The broader the use, the more we should accomodate running against a range of server versions, so fallback for event recorder makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fallback for event recorder makes sense to me.

Agree (at least for the next couple releases).

Also - currently this logic is purely baked into scheduler:
https://github.com/kubernetes/kubernetes/blob/96e13de777a9eb57f87889072b68ac40467209ac/cmd/kube-scheduler/app/server.go#L281
We should probably make that part of the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the description of the mechanism, not sure if I got it correct. Let me know if it needs any changes.

Try to be as paranoid as possible - e.g. what if some components will restart
in the middle of rollout?

A rollout could fail if some components restart in the middle of the rollout. Then those components will continue using the old Event API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't understand this one - can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping - I think I still don't understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe this should also be separated into two cases? (1) The rollout of API won't affect already running workloads; (2) The rollout of the API migration of individual components might fail and impact running workloads. e.g., the component restarts causes the rollout didn't finish, so new Event API didn't take effect. Not sure if I understand this correctly.

@wojtek-t
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2020

* **Will enabling / using this feature result in introducing new API types?**

Yes, a new API type "eventsv1.Event" is being introduced. The migration of Event API will cause creation of new types of Event objects. The number of Event objects depends on cluster state, which theoretically won't be too large due to deduplication logic and reasonable-cardinality of objects in the system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself - we need better answer here.

@chelseychen chelseychen force-pushed the event branch 5 times, most recently from 309816c to 26fb915 Compare May 19, 2020 21:31
Try to be as paranoid as possible - e.g. what if some components will restart
in the middle of rollout?

A rollout could fail if some components restart in the middle of the rollout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this one - can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied in your previous comment but you might miss it:
I'm thinking maybe this should also be separated into two cases? (1) The rollout of API won't affect already running workloads; (2) The rollout of the API migration of individual components might fail and impact running workloads. e.g., the component restarts causes the rollout didn't finish, so new Event API didn't take effect. Not sure if I understand this correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with (1).
Re (2) - I still don't fully understand it. So let me maybe rephrase it. Is that what you meant for (2):
The rollout of the API migration of individual components may cause components fail to initialize in case of bug. It will not affect running workloads though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Thanks for rewording it more clearly. I've updated the answer.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 26, 2020
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels May 26, 2020
@chelseychen chelseychen force-pushed the event branch 2 times, most recently from d36e697 to 9071e38 Compare May 26, 2020 14:30
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 26, 2020
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chelseychen - thanks a lot, two more comments, but I think we're pretty close

Try to be as paranoid as possible - e.g. what if some components will restart
in the middle of rollout?

A rollout could fail if some components restart in the middle of the rollout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with (1).
Re (2) - I still don't fully understand it. So let me maybe rephrase it. Is that what you meant for (2):
The rollout of the API migration of individual components may cause components fail to initialize in case of bug. It will not affect running workloads though.

@wojtek-t
Copy link
Member

/lgtm

/assign @dashpole
For approval from the SIG.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2020
@dashpole
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chelseychen, dashpole

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2020
@k8s-ci-robot k8s-ci-robot merged commit 1b92f28 into kubernetes:master May 26, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants