Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #114237: tools/events: retry on AlreadyExist for Series #114236: tools/events: fix data race when emitting series #112334: events: fix EventSeries starting count discrepancy #119375

Conversation

dgrisonnet
Copy link
Member

@dgrisonnet dgrisonnet commented Jul 17, 2023

Cherry pick of #114237 #114236 #112334 on release-1.26.

#114237: tools/events: retry on AlreadyExist for Series
#114236: tools/events: fix data race when emitting series
#112334: events: fix EventSeries starting count discrepancy

For details on the cherry pick process, see the cherry pick requests page.

Special notes for your reviewer:

#112334 is the real user-facing bug fix that I am trying to backport here. All Kubernetes users are affected by this bug and could be seeing the following error in their kube-scheduler logs:

'Event "apiserver-7bc6866f9c-2vk6r.17123d9ef072db8c" is invalid: [series.count: Invalid value: "": should be at least 2]' (will not retry!)

In practice this error will cause the first repetition of an Event to be skipped which means that users will be aware that an Event occured but they will only know that it kept repeating itself after 30 minutes which in the case of the kube-scheduler might cause the users to be unaware of scheduling failures for quite a long time.

#114236 #114237 are side fixes that are required by #112334 integration test. They fix two different bugs that allow sending the same event consecutively without discarding any of them. e.g:
https://github.com/kubernetes/kubernetes/pull/112334/files#diff-4e1bcdea5320a17d795eb29ce0a180740058e0d09f27d4a21613c9e59b75a93fR132-R133

Update the Event series starting count when emitting isomorphic events from 1 to 2.

When attempting to record a new Event and a new Serie on the apiserver
at the same time, the patch of the Serie might happen before the Event
is actually created. In that case, we handle the error and try to create
the Event. But the Event might be created during that period of time and
it is treated as an error today. So in order to handle that scenario, we
need to retry when a Create call for a Serie results in an AlreadyExist
error.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
There was a data race in the recordToSink function that caused changes
to the events cache to be overriden if events were emitted
simultaneously via Eventf calls.

The race lies in the fact that when recording an Event, there might be
multiple calls updating the cache simultaneously. The lock period is
optimized so that after updating the cache with the new Event, the lock
is unlocked until the Event is recorded on the apiserver side and then
the cache is locked again to be updated with the new value returned by
the apiserver.

The are a few problem with the approach:

1. If two identical Events are emitted successively the changes of the
   second Event will override the first one. In code the following
   happen:
   1. Eventf(ev1)
   2. Eventf(ev2)
   3. Lock cache
   4. Set cache[getKey(ev1)] = &ev1
   5. Unlock cache
   6. Lock cache
   7. Update cache[getKey(ev2)] = &ev1 + Series{Count: 1}
   8. Unlock cache
   9. Start attempting to record the first event &ev1 on the apiserver side.

   This can be mitigated by recording a copy of the Event stored in
   cache instead of reusing the pointer from the cache.

2. When the Event has been recorded on the apiserver the cache is
   updated again with the value of the Event returned by the server.
   This update will override any changes made to the cache entry when
   attempting to record the new Event since the cache was unlocked at
   that time. This might lead to some inconsistencies when dealing with
   EventSeries since the count may be overriden or the client might even
   try to record the first isomorphic Event multiple time.

   This could be mitigated with a lock that has a larger scope, but we
   shouldn't want to reflect Event returned by the apiserver in the
   cache in the first place since mutation could mess with the
   aggregation by either allowing users to manipulate values to update
   a different cache entry or even having two cache entries for the same
   Events.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
The kube-apiserver validation expects the Count of an EventSeries to be
at least 2, otherwise it rejects the Event. There was is discrepancy
between the client and the server since the client was iniatizing an
EventSeries to a count of 1.

According to the original KEP, the first event emitted should have an
EventSeries set to nil and the second isomorphic event should have an
EventSeries with a count of 2. Thus, we should matcht the behavior
define by the KEP and update the client.

Also, as an effort to make the old clients compatible with the servers,
we should allow Events with an EventSeries count of 1 to prevent any
unexpected rejections.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Jul 17, 2023
@k8s-ci-robot k8s-ci-robot added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 17, 2023
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 17, 2023
@dgrisonnet
Copy link
Member Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jul 17, 2023
@dgrisonnet
Copy link
Member Author

/cc @alculquicondor

@alculquicondor
Copy link
Member

Excellent!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 04c78f2c2c1a88b9209d42ffe6e91fabbbb49b2b

@jiahuif
Copy link
Member

jiahuif commented Jul 20, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 20, 2023
@alculquicondor
Copy link
Member

/assign @aojea

@wojtek-t
Copy link
Member

/lgtm
/approve

@kubernetes/release-managers for cherrypick approval

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2023
@alculquicondor
Copy link
Member

/cc @kubernetes/release-managers

@k8s-ci-robot k8s-ci-robot requested a review from a team July 26, 2023 14:15
Copy link
Member

@xmudrii xmudrii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RelEng:
/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, wojtek-t, xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xmudrii xmudrii added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 2, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Aug 2, 2023
@xmudrii
Copy link
Member

xmudrii commented Aug 2, 2023

/test pull-kubernetes-unit

@k8s-ci-robot k8s-ci-robot merged commit 694c7d3 into kubernetes:release-1.26 Aug 2, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants