Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign Event API #383

Open
gmarek opened this issue Aug 4, 2017 · 142 comments
Open

Redesign Event API #383

gmarek opened this issue Aug 4, 2017 · 142 comments
Assignees
Labels
sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status

Comments

@gmarek
Copy link

gmarek commented Aug 4, 2017

Feature Description

  • One-line feature description (can be used as a release note): Add more structure to Event API and change deduplication logic so Events won't overload the cluster
  • Primary contact (assignee): @gmarek
  • Responsible SIGs: instrumentation
  • KEP: new-event-api-ga-graduation
  • Design proposal link (community repo): design goolge doc - design discussions in github are too painful for me
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred: @timothysc @wojtek-t
  • Approver (likely from SIG/area to which feature belongs): @bgrant0607 @thockin @countspongebob
  • Feature target (which target equals to which milestone):
    • Beta: 1.8 [done]
    • GA: 1.19 [done]
@idvoretskyi
Copy link
Member

idvoretskyi commented Aug 9, 2017

@gmarek the feature submission deadline has passed (Aug 1). Please, submit a feature exception (https://github.com/kubernetes/features/blob/master/EXCEPTIONS.md) to have this feature present in 1.8 release.

@idvoretskyi idvoretskyi added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 9, 2017
@jdumars
Copy link
Member

jdumars commented Aug 16, 2017

@gmarek as you probably saw in the other feature comments, I am trying to understand how some features didn't get into the features repo before the deadline. This is only for the purpose of improving our release process and notifications for next time, not for blaming or pointing fingers. We're also trying to understand if there was prior work done on the feature, or if it was created after the freeze date.

@gmarek
Copy link
Author

gmarek commented Aug 16, 2017

Yup, there's quite some work done on this, with (big) design doc shared with kubernetes-dev and in-depth discussion on SIG scale. For this we'll probably make it disabled by default, as there's not enough time to let it soak. Is it possible to have a 'quiet' release for things like that? @jdumars

@jdumars
Copy link
Member

jdumars commented Aug 16, 2017

@gmarek that's an interesting question. My personal opinion is to provide as much transparency as possible, so we maintain a bond of trust with our user community. Being as you get to write the release notes, you can add something short there about it. And, thanks for clarifying the feature itself.

@countspongebob
Copy link
Member

countspongebob commented Aug 16, 2017

Personal perspective on this, largely repeating comments I've made before. But, as this is a case in point...

  • SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.
  • There is continued confusion over what is a feature. I echo @jbeda in calling for these to be renamed "efforts". The implication would be 100% coverage, but see my first point.

We had a discussion in SIG Scalability about especially point #2 with no clear resolution. A few of lobbied @gmarek to do the feature submission not withstanding the points above and he agreed to to do.

@idvoretskyi
Copy link
Member

idvoretskyi commented Aug 18, 2017

@jdumars @countspongebob @gmarek the main point to discuss here - is about the formal dates and deadlines, and what will happen if one will avoid them. We have agreed that the feature freeze for 1.8 (https://github.com/kubernetes/features/blob/master/release-1.8/release-1.8.md) is August 1, so all the features have to be submitted to the features repo before this date.

If people, responsible for the release and the overall community feel that this deadline is not mandatory, it can be discussed and removed. From our (PM group) standpoint, the feature freeze is necessary from the high-level point of view (including planning of the roadmap, marketing activities, etc.). But if there are some reasons why we shouldn't have a feature freeze, again, let's discuss them.

PS. It has been a long-discussed question in the community, even before SIG-PM was established. Now it might be a good time to solve it.

@idvoretskyi
Copy link
Member

idvoretskyi commented Aug 18, 2017

@countspongebob

SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.

SIG PM is not empowered, but release team is. SIG PM is responsible for managing the features on the high level, so we would be able to provide release team with the clearest and transparent information about the feature.

@gmarek
Copy link
Author

gmarek commented Aug 18, 2017

@idvoretskyi - IIUC the exception process is a SIG-PM thing. I haven't heard complains from release team about developing features that are not enabled and don't impact current behavior (plus it's highly unlikely it will be finished in 1.8 timeframe). I'm happy to discuss it as soon as any doubts appear.

Please correct me if I'm wrong - the goal is to track features that will ship in a current release, not the development process that may span across multiple releases. If I'm not mistaken this means that "features" (for lack of the better word) that are disabled and not ready to be enabled don't need to be tracked, right?

@gmarek
Copy link
Author

gmarek commented Aug 18, 2017

Also note that there's not clear what constitutes a 'feature' and where's the border between new feature and 'improvement' that doesn't need a feature repo issue.

Slight OT, but related to shipping features - it was widely acknowledged that @kubernetes/sig-scalability-misc have power to block features which cause performance degradation bad enough to make Kubernetes clusters break our performance SLOs (this is of course decided together with the release team). This is decided close to the release dates, when scale tests on a given release are finished. I'm saying this to make clear that feature repo can't be treated as source of truth about features that will ship in a given release.

@idvoretskyi
Copy link
Member

idvoretskyi commented Oct 2, 2017

@gmarek any plans to continue development of this item for 1.9?

@idvoretskyi idvoretskyi added this to the next-milestone milestone Oct 2, 2017
@bgrant0607
Copy link
Member

bgrant0607 commented Oct 2, 2017

@idvoretskyi
Copy link
Member

idvoretskyi commented Oct 2, 2017

@bgrant0607 perfect. Updating the milestone.

@idvoretskyi idvoretskyi modified the milestones: next-milestone, 1.9 Oct 2, 2017
@gmarek
Copy link
Author

gmarek commented Oct 3, 2017

PR is also ready for review (not started because of 1.8): kubernetes/kubernetes#49112

@idvoretskyi idvoretskyi added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Nov 13, 2017
@idvoretskyi
Copy link
Member

idvoretskyi commented Nov 13, 2017

@gmarek can you confirm that it's alpha for 1.9?

@zacharysarah
Copy link
Contributor

zacharysarah commented Nov 22, 2017

@gmarek 👋 Please indicate in the 1.9 feature tracking board
whether this feature needs documentation. If yes, please open a PR and add a link to the tracking spreadsheet. Thanks in advance!

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2021
@tomkerkhove
Copy link

tomkerkhove commented Oct 13, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2021
@k8s-triage-robot
Copy link

k8s-triage-robot commented Jan 11, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2022
@wojtek-t
Copy link
Member

wojtek-t commented Jan 11, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2022
@dgrisonnet
Copy link
Member

dgrisonnet commented Jan 11, 2022

/assign

@k8s-triage-robot
Copy link

k8s-triage-robot commented Apr 11, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2022
@dgrisonnet
Copy link
Member

dgrisonnet commented Apr 12, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022
@logicalhan
Copy link
Contributor

logicalhan commented May 12, 2022

We need to find an owner for this. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

@wojtek-t
Copy link
Member

wojtek-t commented May 12, 2022

. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

They are round-trippable across each other, so it's not a bit deal from that POV.
But I agree we should finally finalize this migration...

@k8s-triage-robot
Copy link

k8s-triage-robot commented Aug 10, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@dgrisonnet
Copy link
Member

dgrisonnet commented Aug 31, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2022
@jdumars
Copy link
Member

jdumars commented Sep 1, 2022

Hey folks, can we summarize what exactly is needed to wrap this up? That may help us identify people who can work on this.

@lavalamp
Copy link
Member

lavalamp commented Sep 1, 2022

Hey folks, can we summarize what exactly is needed to wrap this up?

+1, there's like 300 comments on this and it's impossible to read now, I actually would like to close this for a followup issue, or edit the OP with remaining needs. I don't know the state either.

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 1, 2022

The KEP is stable and the new Events API has also graduated to stable but not all the kubernetes components have been migrated from the core/v1 API to the events/v1 one which is the reason why we are keeping this issue alive. Out of all the components, I think only the kube-scheduler was migrated. There was some concerns in the past that weren't adressed which kind of paused the total migration.

For example the kube-controller-manager migration was halted: kubernetes/kubernetes#82834

I was just discussing that topic with @liggitt today since downstream we are often hit by misconfigured clients spamming the apiserver with similar events which often results in the creation of tens of thousands of Events object in the apiserver. The go-to way to solve this problem is to migrate to the new Events API since it was designed to reduce the number of calls made to the apiserver when identical Events are being emitted in a short span of time.

So far we've find two areas where the current implementation of the new Events API could be improve in order to make the migration a lot safer for all the components.

  1. The current UX of the events/v1 API is very different from the core/v1 one. For instance, when an event is emitted with the core/v1 API, it is directly created in the apiserver, but for the events/v1 API, it is completely different. The new API introduces the notion of EventSeries. A serie is a suite of similar events that occured in a time windows of less than 6 minutes from one another. When an event is first emitted, a call will be send to the apiserver to create the Event object. If another similar event is emitted is less than 6 minutes from the first one, instead of being created on the apiserver directly, an EventSeries will be created and the counter of the event will increase in the cache. But the Event object will only be updated when the EventSerie is finished or when the event broadcaster hits a 30 minutes heartbeat. So if a serie of similar events occur, which is often the case for pod startup issues, the user will only know that the event occured multiple times after ~30 minutes which is far from ideal.
    To improve that, we want to change that 30 minutes heartbeat mechanism to a backoff one on series, which would allow having a more responsive API without spamming the apiserver too much.

  2. The second aspect is about rate limiting. There was a time where the kubelet was creating an insane amount of unique events which impacted the kube-apiserver masssively. To prevent that a rate limiting mechanism was added. I know that there is one that limit the size of the queue of events to 1000 objects before dropping them (https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/event.go#L41), but Jordan mentionned that there might be something else. I still need to investigate that part, but as of today, I wouldn't be able to safely that the new API has mechanism to protect against a surge of unique events, which is something we definitely want to make sure before migrating all the components.

I've only started looking at that again yesterday so there might be some more thing to think about, but for now my plan is to prepare a new KEP detailing the migration process and the different safeguards that are in place to prevent any issues post update.

akhilerm pushed a commit to akhilerm/apimachinery that referenced this issue Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

New API group for Events.

Fix kubernetes/enhancements#383

cc @shyamjvs 

```release-note
Add events.k8s.io api group with v1beta1 API containing redesigned Event type.
```

Kubernetes-commit: 60c20901911c710491a57eb8b9c48850cdbab054
@k8s-triage-robot
Copy link

k8s-triage-robot commented Nov 30, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
@dgrisonnet
Copy link
Member

dgrisonnet commented Nov 30, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
None yet
Development

No branches or pull requests