-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GC sync race condition #56446
Fix GC sync race condition #56446
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -173,21 +173,14 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p | |
// Get the current resource list from discovery. | ||
newResources := GetDeletableResources(discoveryClient) | ||
|
||
// Detect first or abnormal sync and try again later. | ||
if oldResources == nil || len(oldResources) == 0 { | ||
oldResources = newResources | ||
return | ||
} | ||
|
||
// Decide whether discovery has reported a change. | ||
if reflect.DeepEqual(oldResources, newResources) { | ||
glog.V(5).Infof("no resource updates from discovery, skipping garbage collector sync") | ||
return | ||
} | ||
|
||
// Something has changed, so track the new state and perform a sync. | ||
// Something has changed, time to sync. | ||
glog.V(2).Infof("syncing garbage collector with updated resources from discovery: %v", newResources) | ||
oldResources = newResources | ||
|
||
// Ensure workers are paused to avoid processing events before informers | ||
// have resynced. | ||
|
@@ -212,9 +205,19 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p | |
utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err)) | ||
return | ||
} | ||
// TODO: WaitForCacheSync can block forever during normal operation. Could | ||
// pass a timeout channel, but we have to consider the implications of | ||
// un-pausing the GC with a partially synced graph builder. | ||
if !controller.WaitForCacheSync("garbage collector", stopCh, gc.dependencyGraphBuilder.IsSynced) { | ||
utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync")) | ||
return | ||
} | ||
|
||
// Finally, keep track of our new state. Do this after all preceding steps | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a V(2) log message here that sync completed... want to be able to pair with the log message from line 183 to know resync completed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
// have succeeded to ensure we'll retry on subsequent syncs if an error | ||
// occured. | ||
oldResources = newResources | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although it hasn't been reported anywhere (that I know of), @liggitt noticed this potential bug during the course of reviewing the original patch. If there are no objections, I'd like to bundle the fix in this PR. |
||
glog.V(2).Infof("synced garbage collector") | ||
}, period, stopCh) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it block forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if newResources contains GVKs that are being removed, and are gone by the time we get here, I think blocks until the informers for newResources are all synced, which will never happen successfully for the now-missing GVKs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we plumb a flag into the reflector to instruct it to not retry listing if the error is 404?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know... requires more thought. @ironcladlou can you open an issue to track this to make sure it stays high on our radar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an open issue for this? Based on the logs, I think it is the cause of #60037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt @ironcladlou I couldn't find the issue tracking this race
I am wondering if something like #61057 will be enough to fix/mitigate this race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jennybuckley looks like I neglected to create the followup issue for this. I've been slammed the past couple weeks with things unrelated to Kube but I'll try to take a look at #61057 today. Thank you!