New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated cherry pick of #114237: tools/events: retry on AlreadyExist for Series #114236: tools/events: fix data race when emitting series #112334: events: fix EventSeries starting count discrepancy #119375
Automated cherry pick of #114237: tools/events: retry on AlreadyExist for Series #114236: tools/events: fix data race when emitting series #112334: events: fix EventSeries starting count discrepancy #119375
Conversation
When attempting to record a new Event and a new Serie on the apiserver at the same time, the patch of the Serie might happen before the Event is actually created. In that case, we handle the error and try to create the Event. But the Event might be created during that period of time and it is treated as an error today. So in order to handle that scenario, we need to retry when a Create call for a Serie results in an AlreadyExist error. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
There was a data race in the recordToSink function that caused changes to the events cache to be overriden if events were emitted simultaneously via Eventf calls. The race lies in the fact that when recording an Event, there might be multiple calls updating the cache simultaneously. The lock period is optimized so that after updating the cache with the new Event, the lock is unlocked until the Event is recorded on the apiserver side and then the cache is locked again to be updated with the new value returned by the apiserver. The are a few problem with the approach: 1. If two identical Events are emitted successively the changes of the second Event will override the first one. In code the following happen: 1. Eventf(ev1) 2. Eventf(ev2) 3. Lock cache 4. Set cache[getKey(ev1)] = &ev1 5. Unlock cache 6. Lock cache 7. Update cache[getKey(ev2)] = &ev1 + Series{Count: 1} 8. Unlock cache 9. Start attempting to record the first event &ev1 on the apiserver side. This can be mitigated by recording a copy of the Event stored in cache instead of reusing the pointer from the cache. 2. When the Event has been recorded on the apiserver the cache is updated again with the value of the Event returned by the server. This update will override any changes made to the cache entry when attempting to record the new Event since the cache was unlocked at that time. This might lead to some inconsistencies when dealing with EventSeries since the count may be overriden or the client might even try to record the first isomorphic Event multiple time. This could be mitigated with a lock that has a larger scope, but we shouldn't want to reflect Event returned by the apiserver in the cache in the first place since mutation could mess with the aggregation by either allowing users to manipulate values to update a different cache entry or even having two cache entries for the same Events. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
The kube-apiserver validation expects the Count of an EventSeries to be at least 2, otherwise it rejects the Event. There was is discrepancy between the client and the server since the client was iniatizing an EventSeries to a count of 1. According to the original KEP, the first event emitted should have an EventSeries set to nil and the second isomorphic event should have an EventSeries with a count of 2. Thus, we should matcht the behavior define by the KEP and update the client. Also, as an effort to make the old clients compatible with the servers, we should allow Events with an EventSeries count of 1 to prevent any unexpected rejections. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
/kind bug |
/cc @alculquicondor |
Excellent! /lgtm |
LGTM label has been added. Git tree hash: 04c78f2c2c1a88b9209d42ffe6e91fabbbb49b2b
|
/triage accepted |
/assign @aojea |
/lgtm @kubernetes/release-managers for cherrypick approval |
/cc @kubernetes/release-managers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RelEng:
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, wojtek-t, xmudrii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-unit |
Cherry pick of #114237 #114236 #112334 on release-1.26.
#114237: tools/events: retry on AlreadyExist for Series
#114236: tools/events: fix data race when emitting series
#112334: events: fix EventSeries starting count discrepancy
For details on the cherry pick process, see the cherry pick requests page.
Special notes for your reviewer:
#112334 is the real user-facing bug fix that I am trying to backport here. All Kubernetes users are affected by this bug and could be seeing the following error in their kube-scheduler logs:
In practice this error will cause the first repetition of an Event to be skipped which means that users will be aware that an Event occured but they will only know that it kept repeating itself after 30 minutes which in the case of the kube-scheduler might cause the users to be unaware of scheduling failures for quite a long time.
#114236 #114237 are side fixes that are required by #112334 integration test. They fix two different bugs that allow sending the same event consecutively without discarding any of them. e.g:
https://github.com/kubernetes/kubernetes/pull/112334/files#diff-4e1bcdea5320a17d795eb29ce0a180740058e0d09f27d4a21613c9e59b75a93fR132-R133