Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in informer transformers #124344

Merged
merged 1 commit into from Apr 24, 2024

Conversation

wojtek-t
Copy link
Member

Fix #124337

Fix a race condition in transforming informer happening when objects were accessed during Resync operation

/kind bug
/priority critical-urgent
/sig api-machinery

/assign @liggitt
/cc @linxiulei

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Apr 17, 2024
@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Apr 17, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Apr 17 07:56:14 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 17, 2024
}

// queueReplaceActionLocked appends to the delta list for the object.
// Called must lock first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Called must lock first.
// Caller must lock first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// queueReplaceActionLocked appends to the delta list for the object.
// Caller must lock first.
func (f *DeltaFIFO) queueReplaceActionLocked(actionType DeltaType, isReplace bool, obj interface{}) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing to have queueReplaceAction take an isReplace bool param... if I understand correctly, we need this so Replace can indicate the internal action type even when emitDeltaTypeReplaced is false?

if so, would something like this be clearer?

diff --git a/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go b/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go
index 7160bb1ee72..5ab888d7c4d 100644
--- a/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go
+++ b/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go
@@ -440,22 +440,37 @@ func isDeletionDup(a, b *Delta) *Delta {
 // queueActionLocked appends to the delta list for the object.
 // Caller must lock first.
 func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {
+	return f.queueInternalActionLocked(actionType, actionType, obj)
+}
+
+// queueActionLocked appends to the delta list for the object.
+// The actionType is emitted, and must honor emitDeltaTypeReplaced.
+// The internalActionType is only used within this function, and must ignore emitDeltaTypeReplaced.
+// Caller must lock first.
+func (f *DeltaFIFO) queueInternalActionLocked(actionType, internalActionType DeltaType, obj interface{}) error {
 	id, err := f.KeyOf(obj)
 	if err != nil {
 		return KeyError{obj, err}
 	}
 
	...

 	if f.transformer != nil {
-		var err error
-		obj, err = f.transformer(obj)
-		if err != nil {
-			return err
+		_, isTombstone := obj.(DeletedFinalStateUnknown)
+		isSyncAction := internalActionType == Sync
 		 	...
 		}
 	}
 
@@ -638,7 +653,7 @@ func (f *DeltaFIFO) Replace(list []interface{}, _ string) error {
 			return KeyError{item, err}
 		}
 		keys.Insert(key)
-		if err := f.queueActionLocked(action, item); err != nil {
+		if err := f.queueInternalActionLocked(action, Replaced, item); err != nil {
 			return fmt.Errorf("couldn't enqueue object: %v", err)
 		}
 	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it's cleaner - changed

@wojtek-t
Copy link
Member Author

Cluster failed to create master

/retest

Comment on lines +150 to +152
// It's recommended for the TransformFunc to be idempotent.
// It MUST be idempotent if objects already present in the cache are passed to
// the Replace() to avoid re-mutating them. Default informers do not pass
// existing objects to Replace though.
Copy link

@tpantelis tpantelis Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little wary of this constraint. Consumers of the TransformFunc will not be aware of nor have control over how or where the object they receive came from. Nor should they care - that's an implementation detail of the DeltaFIFO. Also the DeltaFIFO itself does not have control over the origin of the objects from the upstream informers. I think it would be ideal for the DeltaFIFO to provide protection for the TransformFunc from potential data races. If an object already exists in the knownObjects cache and, more specifically, points to the same instance then make a copy of it before passing to the TransformFunc. The TransformFunc would then truly be free to mutate the object w/o any additional idempotency constraint. In fact we could go a step further and elide invoking the TransformFunc altogether in this case since we know the object came from the cache and was already transformed. Then we wouldn't really even need to special-case the tombstone or Sync action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that preventing more races is desired, however these for now these don't expose default usage of it. We should protect it further, but preventing existing issues seems more important to start with.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however these for now these don't expose default usage of it.

I'm not clear on what you mean by this.

I think the text above would be a bit vague and confusing to users of the API. It would be for me if I hadn't studied the code and understand the issue and what it's trying say. With the changes in this PR, the potential race issue with passing cached objects to the DeltaFIFO could only occur with a custom informer and it's probably unlikely anyone is doing that so I'd omit this text. In fact, this should really documented as part of the DeltaFIFO API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however these for now these don't expose default usage of it.

I'm not clear on what you mean by this.

After the fix in this PR to omit Sync events and DeletedFinalStateUnknown objects, none of the uses of DeltaFIFO that client-go sets up (default informers / clients, etc) feed objects already present in knownObjects back into the transformer.

That means the only way a DeltaFIFO user would encounter that is if they wired up a non-default usage of DeltaFIFO to pass the same objects into DeltaFIFO multiple times as if they were new objects.

Someone doing that is presumably the same person wiring up the TransformFunc, and they need to understand the implications of what they are doing. Doing this effectively already breaks the TransformFunc sees the object before any other actor, and it is now safe to mutate the object in place instead of making a copy. statement / guarantee, so the TransformFunc author must be aware of the implications of mutating objects it gets when they are being provided by something with different behavior than default client-go clients / informers.

If an object already exists in the knownObjects cache and, more specifically, points to the same instance then make a copy of it before passing to the TransformFunc

The tracking / comparison cost to safeguard TransformFunc in that case is too high, and even if we did, TransformFunc would have to be aware of and protect against disrupting a non-default object provider.

Copy link

@tpantelis tpantelis Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasoning. My point is that the TransformFunc is part of a higher-level API and is much more widely used and visible then the DeltaFIFO. The New*Informer functions are really the external-facing aspect of the cache module and I'm sure most users of these aren't even aware of DeltaFIFO and its semantics (I wasn't until I investigated this issue). As you mentioned, the author of a non-default usage of DeltaFIFO, if there are any, is "presumably the same person wiring up the TransformFunc", therefore document the constraints and implications re: caching and the TransformFunc in the DeltaFIFO API rather than here. This makes sense b/c the constraint that the TransformFunc be idempotent only applies if a non-default DeltaFIFO producer is used. As I said, I think the paragraph above will only confuse most people. Anyway, that's my 2 cents...

// through the transform object separately (when it was added / updated prior
// to the delete), so the TransformFunc can likely safely ignore such objects
// (i.e., just return the input object).
// TransformFunc (similarly to ResourceEventHandler functions).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can delete this sentence fragment, the DeletedFinalStateUnknown bit no longer applies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +150 to +152
// It's recommended for the TransformFunc to be idempotent.
// It MUST be idempotent if objects already present in the cache are passed to
// the Replace() to avoid re-mutating them. Default informers do not pass
// existing objects to Replace though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however these for now these don't expose default usage of it.

I'm not clear on what you mean by this.

After the fix in this PR to omit Sync events and DeletedFinalStateUnknown objects, none of the uses of DeltaFIFO that client-go sets up (default informers / clients, etc) feed objects already present in knownObjects back into the transformer.

That means the only way a DeltaFIFO user would encounter that is if they wired up a non-default usage of DeltaFIFO to pass the same objects into DeltaFIFO multiple times as if they were new objects.

Someone doing that is presumably the same person wiring up the TransformFunc, and they need to understand the implications of what they are doing. Doing this effectively already breaks the TransformFunc sees the object before any other actor, and it is now safe to mutate the object in place instead of making a copy. statement / guarantee, so the TransformFunc author must be aware of the implications of mutating objects it gets when they are being provided by something with different behavior than default client-go clients / informers.

If an object already exists in the knownObjects cache and, more specifically, points to the same instance then make a copy of it before passing to the TransformFunc

The tracking / comparison cost to safeguard TransformFunc in that case is too high, and even if we did, TransformFunc would have to be aware of and protect against disrupting a non-default object provider.

// place to call the transform func.
//
// If obj is a DeletedFinalStateUnknown tombstone or the action is a Sync,
// then the object have already done through the transformer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// then the object have already done through the transformer.
// then the object have already gone through the transformer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cici37
Copy link
Contributor

cici37 commented Apr 23, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 23, 2024
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - PTAL

// place to call the transform func.
//
// If obj is a DeletedFinalStateUnknown tombstone or the action is a Sync,
// then the object have already done through the transformer.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// through the transform object separately (when it was added / updated prior
// to the delete), so the TransformFunc can likely safely ignore such objects
// (i.e., just return the input object).
// TransformFunc (similarly to ResourceEventHandler functions).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@liggitt
Copy link
Member

liggitt commented Apr 24, 2024

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 24, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5e221e38115bde0dd943e48a3b1848a3289453be

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0b15f8c into kubernetes:master Apr 24, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeltaFIFO TransformFunc can result in a data race on re-sync
6 participants