New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent garbage collector from attempting to sync with 0 resources #61201

Merged
merged 2 commits into from Mar 16, 2018

Conversation

@jennybuckley
Contributor

jennybuckley commented Mar 14, 2018

What this PR does / why we need it:
As of #55259 we enabled garbagecollector.GetDeletableResources to return partial discovery results (including an empty set of discovery results).
This had the unintended consequence of allowing the garbage collector to enter a blocked state that can only be fixed by restarting.

From this comment:

  1. The Sync function periodically calls GetDeletableResources

  2. According to the comment above GetDeletableResources, All discovery errors are considered temporary. Upon encountering any error, GetDeletableResources will log and return any discovered resources it was able to process (which may be none)., an error in discovery causes the discovery client to no longer discover resources in the cluster, but instead of failing and returning an error, it simply logs the error as garbagecollector.go:601] failed to discover preferred resources: %vthe server was unable to return a response in the time allotted, but may still be processing the request and returns an empty list of resources

  3. The Sync function, upon recieving an empty resource list from discovery, detects that the resources have changed, and calls resyncMonitors, which calls dependencyGraphBuilder.syncMonitors with map[] as the argument as shown in the log as garbagecollector.go:189] syncing garbage collector with updated resources from discovery: map[], which sets the list of monitors to an empty list because it thinks there are no resources to monitor.

  4. Lastly the Sync function calls controller.WaitForCacheSync, which calls cache.WaitForCacheSync, which will continually retry the garbagecollector.IsSynced function until it returns true, but it will always return false because len(gb.monitors) is 0.

This PR prevents that specific race condition from arising.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60037

Release note:

Fix bug allowing garbage collector to enter a broken state that could only be fixed by restarting the controller-manager.
@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 14, 2018

/sig api-machinery
/kind bug
/priority critical-urgent

@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 14, 2018

@lavalamp

This comment has been minimized.

Member

lavalamp commented Mar 14, 2018

Looks awesome, and given the failure mode, looks like an equally awesome unit test shouldn't be too hard to add? :)

@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 14, 2018

@lavalamp I have a unit test running locally which fails before this commit and passes after, but it only works by modifying the code for the monitors so they always say they are synced. Once I figure out how to fake that part, or get them to actually sync, I will add it here.

@k8s-ci-robot k8s-ci-robot added size/L and removed size/XS labels Mar 15, 2018

@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 16, 2018

Resubmitting with just the test, to show that it fails, after that I'll submit the fix to show that it passes.

@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 16, 2018

@lavalamp lavalamp added this to the v1.10 milestone Mar 16, 2018

@lavalamp

This comment has been minimized.

Member

lavalamp commented Mar 16, 2018

I am a little worried that the test will be racy due to the time.Sleep()s, but I really want this fix. Maybe you can work on a followup?

/label status/approved-for-milestone

/lgtm
/approved

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 16, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jennybuckley, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 16, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jennybuckley @lavalamp

Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 16, 2018

Automatic merge from submit-queue (batch tested with PRs 61284, 61119, 61201). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit f8f67da into kubernetes:master Mar 16, 2018

14 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation jennybuckley authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@janetkuo

This comment has been minimized.

Member

janetkuo commented Mar 17, 2018

This is awesome, thanks @jennybuckley! Since this is introduced in 1.9, can we cherrypick this to 1.9?

@jennybuckley

This comment has been minimized.

Contributor

jennybuckley commented Mar 17, 2018

Yes, also the bug causing PR has been cherrypicked to 1.8 as well, so I think we should cherrypick this to 1.8 and 1.9

jpbetz added a commit that referenced this pull request Mar 20, 2018

Merge pull request #61317 from jennybuckley/automated-cherry-pick-of-#…
…61201-upstream-release-1.9

Automated cherry pick of #61201: Prevent garbage collector from attempting to sync with 0 resources

xmudrii added a commit to xmudrii/kubernetes that referenced this pull request Mar 27, 2018

Merge pull request kubernetes#61201 from jennybuckley/fix-gc-empty-map
Automatic merge from submit-queue (batch tested with PRs 61284, 61119, 61201). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Prevent garbage collector from attempting to sync with 0 resources

**What this PR does / why we need it**:
As of kubernetes#55259 we enabled garbagecollector.GetDeletableResources to return partial discovery results (including an empty set of discovery results).
This had the unintended consequence of allowing the garbage collector to enter a blocked state that can only be fixed by restarting.

From [this comment](kubernetes#60037 (comment)):

> 1. The Sync function periodically calls GetDeletableResources
>
> 2. According to the comment above GetDeletableResources, All discovery errors are considered temporary. Upon encountering any error, GetDeletableResources will log and return any discovered resources it was able to process (which may be none)., an error in discovery causes the discovery client to no longer discover resources in the cluster, but instead of failing and returning an error, it simply logs the error as garbagecollector.go:601] failed to discover preferred resources: %vthe server was unable to return a response in the time allotted, but may still be processing the request and returns an empty list of resources
>
> 3. The Sync function, upon recieving an empty resource list from discovery, detects that the resources have changed, and calls resyncMonitors, which calls dependencyGraphBuilder.syncMonitors with map[] as the argument as shown in the log as garbagecollector.go:189] syncing garbage collector with updated resources from discovery: map[], which sets the list of monitors to an empty list because it thinks there are no resources to monitor.
>
> 4. Lastly the Sync function calls controller.WaitForCacheSync, which calls cache.WaitForCacheSync, which will continually retry the garbagecollector.IsSynced function until it returns true, but it will always return false because len(gb.monitors) is 0.

This PR prevents that specific race condition from arising.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes kubernetes#60037

**Release note**:
```release-note
Fix bug allowing garbage collector to enter a broken state that could only be fixed by restarting the controller-manager.
```

apiextensions-apiserver: finalization test

k8s-merge-robot added a commit that referenced this pull request Mar 29, 2018

Merge pull request #61318 from jennybuckley/automated-cherry-pick-of-#…
…61201-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #61201: Prevent garbage collector from attempting to sync with 0 resources

Cherry pick of #61201 on release-1.8.

#61201: Prevent garbage collector from attempting to sync with 0 resources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment