Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GC sync race condition #56446

Merged
merged 4 commits into from
Nov 28, 2017

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Nov 27, 2017

Remove faulty diff detection logic from GC sync which leads to a race
condition: If the GC's discovery client is returning a fully up to date
view of server resources during the very first GC sync, the sync
function will never sync monitors or reset the REST mapper unless
discovery changes again. This causes REST mapping to fail for any custom
types already present in discovery.

Fixes #56262.

NONE

/cc @liggitt @caesarxuchao

Remove faulty diff detection logic from GC sync which leads to a race
condition: If the GC's discovery client is returning a fully up to date
view of server resources during the very first GC sync, the sync
function will never sync monitors or reset the REST mapper unless
discovery changes again. This causes REST mapping to fail for any custom
types already present in discovery.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 27, 2017
@ironcladlou
Copy link
Contributor Author

So far no flakes over 300 stress runs.

@ironcladlou
Copy link
Contributor Author

Bug was introduced in 3d6d57a

Only track the last synced resources when all preceding steps have
completed to ensure that failures will be correctly retried.
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 27, 2017
// Finally, keep track of our new state. Do this after all preceding steps
// have succeeded to ensure we'll retry on subsequent syncs if an error
// occured.
oldResources = newResources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it hasn't been reported anywhere (that I know of), @liggitt noticed this potential bug during the course of reviewing the original patch. If there are no objections, I'd like to bundle the fix in this PR.

}

// Finally, keep track of our new state. Do this after all preceding steps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a V(2) log message here that sync completed... want to be able to pair with the log message from line 183 to know resync completed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -212,9 +205,19 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p
utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err))
return
}
// TODO: WaitForCacheSync can block forever during normal operation. Could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it block forever?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if newResources contains GVKs that are being removed, and are gone by the time we get here, I think blocks until the informers for newResources are all synced, which will never happen successfully for the now-missing GVKs.

  1. add new resource (CRD, add-on APIService, etc)
  2. GC GetDeletableResources sees new resource, notices something has changed
  3. delete new resource
  4. GC resyncMonitors sets up new monitors/informers for newResources, including for the now-gone resource
  5. GC WaitForCacheSync waits until the informers are all synced, which will never succeed for the now-gone resource

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we plumb a flag into the reflector to instruct it to not retry listing if the error is 404?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... requires more thought. @ironcladlou can you open an issue to track this to make sure it stays high on our radar?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an open issue for this? Based on the logs, I think it is the cause of #60037

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @ironcladlou I couldn't find the issue tracking this race

I am wondering if something like #61057 will be enough to fix/mitigate this race condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jennybuckley looks like I neglected to create the followup issue for this. I've been slammed the past couple weeks with things unrelated to Kube but I'll try to take a look at #61057 today. Thank you!

@caesarxuchao
Copy link
Member

The patch lgtm. I asked a question regarding the todo.

@liggitt liggitt added this to the v1.9 milestone Nov 28, 2017
@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. queue/fix sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 28, 2017
@liggitt
Copy link
Member

liggitt commented Nov 28, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2017
@spiffxp
Copy link
Member

spiffxp commented Nov 28, 2017

/status approved-for-milestone
getting rid of the flakiest PR flake? sure

@ironcladlou
Copy link
Contributor Author

/status in-progress

@k8s-ci-robot
Copy link
Contributor

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

@ironcladlou
Copy link
Contributor Author

ironcladlou commented Nov 28, 2017

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

😛

@liggitt
Copy link
Member

liggitt commented Nov 28, 2017

/status in-progress

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Current

@caesarxuchao @ironcladlou @lavalamp @liggitt

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@caesarxuchao
Copy link
Member

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, ironcladlou, liggitt

Associated issue: 56262

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 56446, 56437). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot
Copy link

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@jennybuckley
Copy link

I just noticed that the bug this fixes is also in 1.8, would it be useful to cherrypick this back to 1.8?

k8s-github-robot pushed a commit that referenced this pull request Jun 5, 2018
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Avoid deadlock in gc resync if available resources change during sync

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

xref #61057 #56446 (comment)

```release-note
fixes a potential deadlock in the garbage collection controller
```
k8s-publishing-bot added a commit to kubernetes/apimachinery that referenced this pull request Jun 5, 2018
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Avoid deadlock in gc resync if available resources change during sync

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment)

```release-note
fixes a potential deadlock in the garbage collection controller
```

Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
sttts pushed a commit to sttts/apimachinery that referenced this pull request Jun 8, 2018
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Avoid deadlock in gc resync if available resources change during sync

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment)

```release-note
fixes a potential deadlock in the garbage collection controller
```

Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
k8s-publishing-bot added a commit to kubernetes/apimachinery that referenced this pull request Jun 8, 2018
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Avoid deadlock in gc resync if available resources change during sync

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment)

```release-note
fixes a potential deadlock in the garbage collection controller
```

Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants