New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix concurrent map writes error in kube-apiserver #106045
Fix concurrent map writes error in kube-apiserver #106045
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @chenlinx17! |
Hi @chenlinx17. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
87c5c1b
to
3fbcb58
Compare
/release-note-none |
the CLA has been signed. |
/cc @hzxuzhonghu @sttts |
/priority important-soon |
/assign @tkashem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have concurrent execution overlap of the handler chain, so ideally we want to move the lock inside audit.LogAnnotation
so that the annotations map is always protected, not just from multiple deletion go routines. But this is not trivial, we can address this is a follow up PR.
kubernetes/staging/src/k8s.io/apiserver/pkg/audit/request.go
Lines 246 to 259 in 904e972
// LogAnnotation fills in the Annotations according to the key value pair. | |
func LogAnnotation(ae *auditinternal.Event, key, value string) { | |
if ae == nil || ae.Level.Less(auditinternal.LevelMetadata) { | |
return | |
} | |
if ae.Annotations == nil { | |
ae.Annotations = make(map[string]string) | |
} | |
if v, ok := ae.Annotations[key]; ok && v != value { | |
klog.Warningf("Failed to set annotations[%q] to %q for audit:%q, it has already been set to %q", key, value, ae.AuditID, ae.Annotations[key]) | |
return | |
} | |
ae.Annotations[key] = value | |
} |
@@ -28,6 +29,10 @@ import ( | |||
type auditHandler struct { | |||
Interface | |||
ae *auditinternal.Event | |||
// Indicate whether concurrency protection is required | |||
concurrencyProtection bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's not very costly to always have concurrency on - so i would recommend removing the bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked all the call points of the audit.LogAnnotation
. Currently, only calls from auditHandler
trigger this problem.
If I move the lock inside audit.LogAnnotation
, Theoretically, it would be better to declare the lock in auditinternal.Event
, but I haven't seen sync.Mutex
declared in types.go
before, should I declare the lock as a global variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I haven't seen sync.Mutex declared in types.go before, should I declare the lock as a global variable?
you are right, the types do not have any internal states. moving the lock inside audit.LogAnnotation
is not trivial so let's address it in a separate PR.
I think what you have right now should resolve the panic we are seeing. We also need to back port this PR to supported releases, so let's keep it short so it is back portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I declare the lock as a global variable?
that would not be at request scoped, so all requests in flight would fight for this lock which is not good :)
I was looking at the test |
a048d1a
to
82e138f
Compare
I've added a testcase to simulate a |
so, when you run delete collection with multiple go routines - each delete worker is going to add the same annotations in to the
so looks like you should see the following warning: kubernetes/staging/src/k8s.io/apiserver/pkg/audit/request.go Lines 231 to 234 in ea07644
have you run into this when you ran with multiple go routines in your dev environment? |
@@ -28,6 +29,8 @@ import ( | |||
type auditHandler struct { | |||
Interface | |||
ae *auditinternal.Event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you are protecting ae
, as a convention i would suggest moving the lock
type auditHandler struct {
...
mutex sync.Mutex
ae *auditinternal.Event
}
@@ -28,6 +29,8 @@ import ( | |||
type auditHandler struct { | |||
Interface | |||
ae *auditinternal.Event | |||
// Mutex protects the annotations of ae |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to protect the 'Annotations' map of the audit event from concurrent writes
// Simulate the scenario store.DeleteCollection | ||
workers := 5 | ||
toProcess := make(chan int, 2*workers) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have one go routine that reads ae.Annotations
and another go routine that writes to it via auditHandler.Admit
- shouldn't that be enough for showing the data race? Just thinking out loud on a more simple version of the test that will expose the data race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since my lock protection is in auditHandler.logAnnotations
, the test requires both coroutines to read/write ae.Annotations
via the auditHandler.Admit
or auditHandler.Validate
.
Yes, I did. In fact, when I delete nodes in batches, there is a high probability that this bug occurs. I think the reason is that I add the webhook admission for node update request, which will trigger |
82e138f
to
b9e6a75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, please add the TODO requested.
@@ -27,7 +28,9 @@ import ( | |||
// auditHandler logs annotations set by other admission handlers | |||
type auditHandler struct { | |||
Interface | |||
ae *auditinternal.Event | |||
// to protect the 'Annotations' map of the audit event from concurrent writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: move the lock near the
Annotations field of the of the audit events so it is always protected from concurrent access.
b9e6a75
to
7c67665
Compare
/lgtm Thanks! |
(for approval) /assign @sttts |
@chenlinx17 once this merges, i think we need to back port it. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chenlinx17, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Cherry pick #106045 to 1.22: Fix concurrent map writes error in kube-apiserver
Cherry pick #106045 to 1.21: Fix concurrent map writes error in kube-apiserver
Cherry pick #106045 to 1.20: Fix concurrent map writes error in kube-apiserver
What type of PR is this?
/kind bug
What this PR does / why we need it:
this PR adds a
sync.mutex
inauditHandler
to protect mapaudit.Event.Annotations
from concurrent writesWhich issue(s) this PR fixes:
Fixes a concurrent map writes bug when recording audit event in the delete-collection scenario
Special notes for your reviewer:
this bug trigger condition:
1、enabled audit log for kube-apiserver
2、enabled multi-goroutine for delete collection (--delete-collection-workers > 1)
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: