Fix GC sync race condition #56446

ironcladlou · 2017-11-27T21:13:32Z

Remove faulty diff detection logic from GC sync which leads to a race
condition: If the GC's discovery client is returning a fully up to date
view of server resources during the very first GC sync, the sync
function will never sync monitors or reset the REST mapper unless
discovery changes again. This causes REST mapping to fail for any custom
types already present in discovery.

Fixes #56262.

NONE

/cc @liggitt @caesarxuchao

Remove faulty diff detection logic from GC sync which leads to a race condition: If the GC's discovery client is returning a fully up to date view of server resources during the very first GC sync, the sync function will never sync monitors or reset the REST mapper unless discovery changes again. This causes REST mapping to fail for any custom types already present in discovery.

ironcladlou · 2017-11-27T21:14:53Z

So far no flakes over 300 stress runs.

ironcladlou · 2017-11-27T21:16:13Z

Bug was introduced in 3d6d57a

Only track the last synced resources when all preceding steps have completed to ensure that failures will be correctly retried.

ironcladlou · 2017-11-27T21:30:53Z

pkg/controller/garbagecollector/garbagecollector.go

+		// Finally, keep track of our new state. Do this after all preceding steps
+		// have succeeded to ensure we'll retry on subsequent syncs if an error
+		// occured.
+		oldResources = newResources


Although it hasn't been reported anywhere (that I know of), @liggitt noticed this potential bug during the course of reviewing the original patch. If there are no objections, I'd like to bundle the fix in this PR.

liggitt · 2017-11-27T21:42:11Z

pkg/controller/garbagecollector/garbagecollector.go

 		}
+
+		// Finally, keep track of our new state. Do this after all preceding steps


add a V(2) log message here that sync completed... want to be able to pair with the log message from line 183 to know resync completed

caesarxuchao · 2017-11-27T23:08:39Z

pkg/controller/garbagecollector/garbagecollector.go

@@ -212,9 +205,19 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p
 			utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err))
 			return
 		}
+		// TODO: WaitForCacheSync can block forever during normal operation. Could


When will it block forever?

if newResources contains GVKs that are being removed, and are gone by the time we get here, I think blocks until the informers for newResources are all synced, which will never happen successfully for the now-missing GVKs.

add new resource (CRD, add-on APIService, etc)

GC GetDeletableResources sees new resource, notices something has changed

delete new resource

GC resyncMonitors sets up new monitors/informers for newResources, including for the now-gone resource

GC WaitForCacheSync waits until the informers are all synced, which will never succeed for the now-gone resource

Shall we plumb a flag into the reflector to instruct it to not retry listing if the error is 404?

I don't know... requires more thought. @ironcladlou can you open an issue to track this to make sure it stays high on our radar?

Is there an open issue for this? Based on the logs, I think it is the cause of #60037

@liggitt @ironcladlou I couldn't find the issue tracking this race

I am wondering if something like #61057 will be enough to fix/mitigate this race condition

@jennybuckley looks like I neglected to create the followup issue for this. I've been slammed the past couple weeks with things unrelated to Kube but I'll try to take a look at #61057 today. Thank you!

caesarxuchao · 2017-11-27T23:11:31Z

The patch lgtm. I asked a question regarding the todo.

liggitt · 2017-11-28T02:37:14Z

/lgtm

spiffxp · 2017-11-28T14:50:56Z

/status approved-for-milestone
getting rid of the flakiest PR flake? sure

ironcladlou · 2017-11-28T15:52:23Z

/status in-progress

k8s-ci-robot · 2017-11-28T15:52:25Z

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

ironcladlou · 2017-11-28T15:52:49Z

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

😛

liggitt · 2017-11-28T16:07:08Z

/status in-progress

k8s-github-robot · 2017-11-28T16:08:02Z

[MILESTONENOTIFIER] Milestone Pull Request Current

@caesarxuchao @ironcladlou @lavalamp @liggitt

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Pull Request Labels

sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

caesarxuchao · 2017-11-28T19:18:41Z

/approve

k8s-github-robot · 2017-11-28T19:18:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, ironcladlou, liggitt

Associated issue: 56262

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/garbagecollector/OWNERS~~ [caesarxuchao]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-11-28T21:08:10Z

Automatic merge from submit-queue (batch tested with PRs 56446, 56437). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-github-robot · 2018-03-16T22:34:15Z

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

jennybuckley · 2018-03-17T15:55:00Z

I just noticed that the bug this fixes is also in 1.8, would it be useful to cherrypick this back to 1.8?

Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref #61057 #56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ```

Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ``` Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 27, 2017

ironcladlou mentioned this pull request Nov 27, 2017

GC restmapping integration test failures (TestCRDDeletionCascading, TestCustomResourceCascadingDeletion, TestMixedRelationships) #56262

Closed

k8s-github-robot assigned lavalamp and caesarxuchao Nov 27, 2017

Ensure sync failures are correctly retried

9b2886d

Only track the last synced resources when all preceding steps have completed to ensure that failures will be correctly retried.

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 27, 2017

ironcladlou commented Nov 27, 2017

View reviewed changes

liggitt reviewed Nov 27, 2017

View reviewed changes

ironcladlou added 2 commits November 27, 2017 16:47

Add more GC sync logging

eeeabce

Add a GC deadlock note

a62d07c

caesarxuchao reviewed Nov 27, 2017

View reviewed changes

liggitt added this to the v1.9 milestone Nov 28, 2017

liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. queue/fix sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 28, 2017

k8s-github-robot added the milestone/needs-approval label Nov 28, 2017

k8s-ci-robot assigned liggitt Nov 28, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2017

k8s-ci-robot added the status/approved-for-milestone label Nov 28, 2017

k8s-github-robot added milestone/needs-attention and removed milestone/needs-approval labels Nov 28, 2017

k8s-ci-robot added the status/in-progress label Nov 28, 2017

k8s-github-robot removed the milestone/needs-attention label Nov 28, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2017

k8s-github-robot merged commit 7ce780d into kubernetes:master Nov 28, 2017

tpepper mentioned this pull request Nov 28, 2017

e2e_node: small fixes to setup_host.sh for Ubuntu Trusty #56108

Merged

jennybuckley mentioned this pull request Mar 12, 2018

Fix GC resource-discovery/cache-sync race condition #61057

Closed

k8s-github-robot removed the cherrypick-candidate label Mar 16, 2018

liggitt mentioned this pull request May 23, 2018

Avoid deadlock in gc resync if available resources change during sync #64235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GC sync race condition #56446

Fix GC sync race condition #56446

ironcladlou commented Nov 27, 2017 •

edited

ironcladlou commented Nov 27, 2017

ironcladlou commented Nov 27, 2017

ironcladlou Nov 27, 2017

liggitt Nov 27, 2017

ironcladlou Nov 27, 2017

caesarxuchao Nov 27, 2017

liggitt Nov 28, 2017

caesarxuchao Nov 28, 2017

liggitt Nov 28, 2017

jennybuckley Mar 7, 2018

jennybuckley Mar 12, 2018 •

edited

ironcladlou Mar 14, 2018

caesarxuchao commented Nov 27, 2017

liggitt commented Nov 28, 2017

spiffxp commented Nov 28, 2017

ironcladlou commented Nov 28, 2017

k8s-ci-robot commented Nov 28, 2017

ironcladlou commented Nov 28, 2017 •

edited

liggitt commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

caesarxuchao commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

k8s-github-robot commented Mar 16, 2018

jennybuckley commented Mar 17, 2018

		}

		// Finally, keep track of our new state. Do this after all preceding steps

Fix GC sync race condition #56446

Fix GC sync race condition #56446

Conversation

ironcladlou commented Nov 27, 2017 • edited

ironcladlou commented Nov 27, 2017

ironcladlou commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jennybuckley Mar 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao commented Nov 27, 2017

liggitt commented Nov 28, 2017

spiffxp commented Nov 28, 2017

ironcladlou commented Nov 28, 2017

k8s-ci-robot commented Nov 28, 2017

ironcladlou commented Nov 28, 2017 • edited

liggitt commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

caesarxuchao commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

k8s-github-robot commented Nov 28, 2017

k8s-github-robot commented Mar 16, 2018

jennybuckley commented Mar 17, 2018

ironcladlou commented Nov 27, 2017 •

edited

jennybuckley Mar 12, 2018 •

edited

ironcladlou commented Nov 28, 2017 •

edited