-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GC sync race condition #56446
Fix GC sync race condition #56446
Conversation
Remove faulty diff detection logic from GC sync which leads to a race condition: If the GC's discovery client is returning a fully up to date view of server resources during the very first GC sync, the sync function will never sync monitors or reset the REST mapper unless discovery changes again. This causes REST mapping to fail for any custom types already present in discovery.
So far no flakes over 300 |
Bug was introduced in 3d6d57a |
Only track the last synced resources when all preceding steps have completed to ensure that failures will be correctly retried.
// Finally, keep track of our new state. Do this after all preceding steps | ||
// have succeeded to ensure we'll retry on subsequent syncs if an error | ||
// occured. | ||
oldResources = newResources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it hasn't been reported anywhere (that I know of), @liggitt noticed this potential bug during the course of reviewing the original patch. If there are no objections, I'd like to bundle the fix in this PR.
} | ||
|
||
// Finally, keep track of our new state. Do this after all preceding steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a V(2) log message here that sync completed... want to be able to pair with the log message from line 183 to know resync completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -212,9 +205,19 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.DiscoveryInterface, p | |||
utilruntime.HandleError(fmt.Errorf("failed to sync resource monitors: %v", err)) | |||
return | |||
} | |||
// TODO: WaitForCacheSync can block forever during normal operation. Could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it block forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if newResources contains GVKs that are being removed, and are gone by the time we get here, I think blocks until the informers for newResources are all synced, which will never happen successfully for the now-missing GVKs.
- add new resource (CRD, add-on APIService, etc)
- GC GetDeletableResources sees new resource, notices something has changed
- delete new resource
- GC resyncMonitors sets up new monitors/informers for newResources, including for the now-gone resource
- GC WaitForCacheSync waits until the informers are all synced, which will never succeed for the now-gone resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we plumb a flag into the reflector to instruct it to not retry listing if the error is 404?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know... requires more thought. @ironcladlou can you open an issue to track this to make sure it stays high on our radar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an open issue for this? Based on the logs, I think it is the cause of #60037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt @ironcladlou I couldn't find the issue tracking this race
I am wondering if something like #61057 will be enough to fix/mitigate this race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jennybuckley looks like I neglected to create the followup issue for this. I've been slammed the past couple weeks with things unrelated to Kube but I'll try to take a look at #61057 today. Thank you!
The patch lgtm. I asked a question regarding the todo. |
/lgtm |
/status approved-for-milestone |
/status in-progress |
You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels. |
😛 |
/status in-progress |
[MILESTONENOTIFIER] Milestone Pull Request Current @caesarxuchao @ironcladlou @lavalamp @liggitt Note: This pull request is marked as Example update:
Pull Request Labels
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: caesarxuchao, ironcladlou, liggitt Associated issue: 56262 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 56446, 56437). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
I just noticed that the bug this fixes is also in 1.8, would it be useful to cherrypick this back to 1.8? |
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref #61057 #56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ```
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ``` Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ``` Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid deadlock in gc resync if available resources change during sync retry GC sync if waiting for cache sync times out, without unpausing workers viewing ignoring whitespace reveals the actual change: https://github.com/kubernetes/kubernetes/pull/64235/files?w=1 xref kubernetes/kubernetes#61057 kubernetes/kubernetes#56446 (comment) ```release-note fixes a potential deadlock in the garbage collection controller ``` Kubernetes-commit: 9fceab1d83fa327fa31fc1ea733483c05d576cb8
Remove faulty diff detection logic from GC sync which leads to a race
condition: If the GC's discovery client is returning a fully up to date
view of server resources during the very first GC sync, the sync
function will never sync monitors or reset the REST mapper unless
discovery changes again. This causes REST mapping to fail for any custom
types already present in discovery.
Fixes #56262.
/cc @liggitt @caesarxuchao