-
Notifications
You must be signed in to change notification settings - Fork 20
fix: fix data race in integration test #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
| // generateRawContent strips all the unnecessary fields to prepare the objects for dispatch. | ||
| func generateRawContent(object *unstructured.Unstructured) ([]byte, error) { | ||
| // Make a deep copy of the object as we are modifying it. | ||
| object = object.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who else is read/write it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The informer. I put the story in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the go-client informer returns a deep copy of the object so there is no race condition between those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The informer lister lists resources from the indexer, which is backed by the underlying threadsafestore. Reading the doc: https://pkg.go.dev/k8s.io/client-go/tools/cache#ThreadSafeStore, "TL;DR caveats: you must not modify anything returned by Get or List as it will break the indexing feature in addition to not being thread safe." It's required to keep the objects r/o. I don't see a deep copy as I dig the code path.
Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
michaelawyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ;)
Signed-off-by: Wantong Jiang <wantjian@microsoft.com>
Description of your changes
Recently we have seen some integration test flakiness due to data race, the race stack looks like:
Stack Trace
As can be seen from the stack, the race happens when the v1alpha1 CRP controller reconciles some resource and updates while the shared informer reads it during resource deletion.
This race is introduced by PR #14 when calling

selectResources(v1alpha1 CRP controller) orselectResourcesForPlacement(v1beta1/v1 CRP controller), we removed a deepCopy:And in
generateRawContent()function, we modify the object directly. These objects are pointers directly returned from the informer: https://github.com/kubefleet-dev/kubefleet/blob/main/pkg/controllers/clusterresourceplacement/resource_selector.go#L318C3-L332C4. So the controller and informer are essentially touching the same memory.Why race only show up in the v1alpha1 integration test?
Both v1alpha1 controller and v1beta1 controller should have the same issue as they share the same
generateRawContent()function. The flaky test is https://github.com/kubefleet-dev/kubefleet/blob/main/test/integration/cluster_placement_test.go#L274 where during the test, it dynamically creates an endpointSlice resource and make sure CRP picks it up. And there's aDeferCleanup()to delete the resource after the test. There's also anAfterEach()where we unjoin the memberclusters and delete the CRP. https://github.com/kubefleet-dev/kubefleet/blob/main/test/integration/cluster_placement_test.go#L111C2-L117C4. The race happens when the controller is trying to reconcile the CRP (not deleted yet) due to member cluster leaving, it re-selects the resources while the tests calls deletion on the endpointSlice obj, see test log:0dad88b#diff-81bbeec7975586def83fe739a55b057049a1c8f852dab38aa16337c8797b53e2
In other tests, we either
DeferCleanupa resource that is not being picked up the CRP or we wait for the CRP to be cleaned up completely before removing any additional resource. So other tests do not observe this race issue.How to verify it actually fixes?
I did some local test. Before the fix, I ran the integration test and I observed the data race the 3rd time I ran. With the fix, I ran the integration test 20 times locally and didn't see it again.
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer