Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GC sync race condition #56446

Merged
merged 4 commits into from
Nov 28, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
19 changes: 11 additions & 8 deletions pkg/controller/garbagecollector/garbagecollector.go
Original file line number Diff line number Diff line change
Expand Up @@ -173,21 +173,14 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p
// Get the current resource list from discovery.
newResources := GetDeletableResources(discoveryClient)

// Detect first or abnormal sync and try again later.
if oldResources == nil || len(oldResources) == 0 {
oldResources = newResources
return
}

// Decide whether discovery has reported a change.
if reflect.DeepEqual(oldResources, newResources) {
glog.V(5).Infof("no resource updates from discovery, skipping garbage collector sync")
return
}

// Something has changed, so track the new state and perform a sync.
// Something has changed, time to sync.
glog.V(2).Infof("syncing garbage collector with updated resources from discovery: %v", newResources)
oldResources = newResources

// Ensure workers are paused to avoid processing events before informers
// have resynced.
Expand All @@ -212,9 +205,19 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p
utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err))
return
}
// TODO: WaitForCacheSync can block forever during normal operation. Could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it block forever?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if newResources contains GVKs that are being removed, and are gone by the time we get here, I think blocks until the informers for newResources are all synced, which will never happen successfully for the now-missing GVKs.

  1. add new resource (CRD, add-on APIService, etc)
  2. GC GetDeletableResources sees new resource, notices something has changed
  3. delete new resource
  4. GC resyncMonitors sets up new monitors/informers for newResources, including for the now-gone resource
  5. GC WaitForCacheSync waits until the informers are all synced, which will never succeed for the now-gone resource

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we plumb a flag into the reflector to instruct it to not retry listing if the error is 404?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... requires more thought. @ironcladlou can you open an issue to track this to make sure it stays high on our radar?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an open issue for this? Based on the logs, I think it is the cause of #60037

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @ironcladlou I couldn't find the issue tracking this race

I am wondering if something like #61057 will be enough to fix/mitigate this race condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jennybuckley looks like I neglected to create the followup issue for this. I've been slammed the past couple weeks with things unrelated to Kube but I'll try to take a look at #61057 today. Thank you!

// pass a timeout channel, but we have to consider the implications of
// un-pausing the GC with a partially synced graph builder.
if !controller.WaitForCacheSync("garbage collector", stopCh, gc.dependencyGraphBuilder.IsSynced) {
utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync"))
return
}

// Finally, keep track of our new state. Do this after all preceding steps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a V(2) log message here that sync completed... want to be able to pair with the log message from line 183 to know resync completed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// have succeeded to ensure we'll retry on subsequent syncs if an error
// occured.
oldResources = newResources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it hasn't been reported anywhere (that I know of), @liggitt noticed this potential bug during the course of reviewing the original patch. If there are no objections, I'd like to bundle the fix in this PR.

glog.V(2).Infof("synced garbage collector")
}, period, stopCh)
}

Expand Down