Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next steps after Events API v1beta1: migrate all components; event style guide; pipeline for persistent event storage #76674

Open
shoucongc opened this issue Apr 16, 2019 · 24 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@shoucongc
Copy link

shoucongc commented Apr 16, 2019

Overview:
After kubernetes/enhancements#383 is implemented (#65782), new v1beta1 events API will be ready to use. (Latest approved design doc)
However, to achieve the end goal - "make events useful and safe", we still have couple steps to take.
This is the umbrella tracking issue to cover all next steps we have in our mind now.

Proposed next steps:

  1. Migrate all k8s components to use new events API.

Any k8s components emitting events now need to be migrated to use new event API.
Since old event package lives in "k8s.io/client-go/tools/record". A quick code search finds ~235 places depending on this package. It includes controller-manager, kube-scheduler, kubelet, etc.

  1. Propose a "K8s events style guide", get community adoption and enforce an API-like review process for future major event updates.

The approved design mentioned "all new Events will need to go through API-like review (process will be described in the style guide)", but didn't have concrete proposal. We need to finish this event style guide and establish a review process for event change.
Without this, events are informational and can't be used reliably in prod. Needless to say, they won't be able to be digested and used by downstream monitoring & logging products.

  1. Build events publisher / subscriber infrastructure for downstream consumption.

With this mechanism, people (OSS / cloud vendors) can build persistent event storage, use events as signals for troubleshooting products, etc.

Contributors
Above proposals are based on several offline discussions among @yastij, @shoucongc, and @wojtek-t

@shoucongc shoucongc added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 16, 2019
@yastij
Copy link
Member

yastij commented Apr 16, 2019

/sig scalability
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2019
@shoucongc
Copy link
Author

/assign shoucongc
/assign yastij

@yastij
Copy link
Member

yastij commented Apr 16, 2019

cc @wojtek-t

@shoucongc
Copy link
Author

cc @gmarek @piosz

@liggitt liggitt added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 20, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2019
@wojtek-t
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2019
@jfbai
Copy link
Contributor

jfbai commented Sep 18, 2019

/assign

I'd like to take controller-manager part.

@yastij
Copy link
Member

yastij commented Sep 18, 2019

xref #76674

to track leaderelection migration

@liggitt
Copy link
Member

liggitt commented Sep 23, 2019

should we be switching clients from using stable APIs (v1) to using beta APIs (events/v1beta1)? that means that clusters currently running only stable APIs would no longer work properly, right?

@yastij
Copy link
Member

yastij commented Sep 23, 2019

@liggitt - #49112 was introduced a while ago.
IIRC it introduced conversions, so the rollout/rollback shouldn't introduce any data loss (the v1 API contains all the fields of added in v1beta1). Perhaps you were talking about something else ?

@liggitt
Copy link
Member

liggitt commented Sep 23, 2019

Perhaps you were talking about something else ?

Yes, a cluster that chooses to only enable stable APIs. Currently, the core v1 events API is what components use to report events. I would not expect the core kubernetes components to switch to a less stable API for event reporting (i.e. they should wait until events.k8s.io/v1 is available to switch to it)

@yastij
Copy link
Member

yastij commented Sep 23, 2019

@liggitt - in term of API both are aligned (core/v1 and events.k8s.io/v1beta1). we don't want to switch all the clients. The plan would be to switch the kubelet, move to events.k8s.io to v1, then switch other clients.

Last release #78447 moved the scheduler to use events.k8s.io/v1beta1 which would help us to graduate the API

cc @wojtek-t

@wojtek-t
Copy link
Member

Hmm - I was actually basing it on the KEP; https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20190131-event-series.md but now I realized that's not actually what it is saying.

That said, I also wanted to point out that we already migrated scheduler to new events (in 1.16).

@yastij
Copy link
Member

yastij commented Sep 23, 2019

@liggitt @wojtek-t - we can revisit the KEP to make the kubelet migration gate the GA graduation ?

this would enable us to get a signal early in the release (and we can always revert it if we find issues). Once this is done, we can move to GA then migrate the other clients, thoughts ?

@liggitt
Copy link
Member

liggitt commented Sep 23, 2019

I also wanted to point out that we already migrated scheduler to new events (in 1.16).

We probably should not have done that unconditionally. Moving from a stable API to a require a beta API is not desired. It could be reasonable for components to use the beta API if available (since it is more expressive), falling back to the stable one.

@yastij
Copy link
Member

yastij commented Sep 24, 2019

@liggitt - makes sense. Also note that this API wasn't used by anyone before AFAIK, due to the missing implementation, so I'm not sure anyone disabled it (e.g. via runtime-config).

For the next PRs do you prefer us:

  • to fallback on core/v1 if events.k8s.io/v1beta1 is not available ?
  • add an action required to avoid disabling events.k8s.io/v1beta1 ?

@wojtek-t
Copy link
Member

My personal opinion is that:

  • we should add some action required in 1.16 release notes
  • promote the API to v1 in 1.17

@liggitt - does that sounds reasonable to you?

@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

  • we should add some action required in 1.16 release notes

I'd prefer to fix the v1.16 scheduler to use v1 events if v1beta1 Events are unavailable. How hard would it be to make #78447 conditional and pick that to v1.16?

promote the API to v1 in 1.17

If there's good evidence the API in its current shape meets our needs, that seems reasonable. Note that the scheduler/controllers should not switch to using the new v1 API until the release after the v1 API is available.

@yastij
Copy link
Member

yastij commented Sep 24, 2019

@liggitt - shouldn't be hard, we'd have check in #78447 if events.k8s.io/v1beta1 is served and fallback on core/v1 if not

@wojtek-t
Copy link
Member

@liggitt - shouldn't be hard, we'd have check in #78447 if events.k8s.io/v1beta1 is served and fallback on core/v1 if not

We should built in that check into recorder.

If there's good evidence the API in its current shape meets our needs, that seems reasonable.

I think we believe it is.
@yastij - I think we should start working on it.

Note that the scheduler/controllers should not switch to using the new v1 API until the release after the v1 API is available.

If we create a fallback for not-registered beta API, is it also needed?
And switch the recorder to use new API in 1.18?

@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

We should built in that check into recorder.

Agree.

If we create a fallback for not-registered beta API, is it also needed?

As long as event recorders fall back to core v1 for a single release if the new events.k8s.io/{v1,v1beta1} APIs aren't available, that's sufficient

@yastij
Copy link
Member

yastij commented Sep 24, 2019

@liggitt - should this gate 1.16.1 ?

@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

I would expect a bug open in the 1.16 milestone, and a known issue added to the 1.16 release notes. I'd like to see a fix make 1.16.1, but as long as the work is prioritized, and the issue is well communicated, 1.16.2 seems acceptable.

@liggitt
Copy link
Member

liggitt commented Oct 23, 2019

Moving from a stable API to a require a beta API is not desired. It could be reasonable for components to use the beta API if available (since it is more expressive), falling back to the stable one.

xref kubernetes/enhancements#1332

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants