-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReplicaSetController can miss handling the deletion of a ReplicaSet #69376
Comments
/kind flake |
/remove-kind failing-test |
I was about to file this. I spent yesterday triaging it and have determined the root cause. The issue is that the
This is a timing issue. The /sig apps |
For a bit more clarity: When the rs informer sees a new rs, it calls
When the rs informer sees a deleted rs, it calls
The net effect is
Once a key ($namespace/$name) is in the queue, kubernetes/pkg/controller/replicaset/replica_set.go Lines 582 to 587 in 7bc48ba
The problem here is that there's not necessarily a guarantee that the order of operations is always like this:
When the test fails, this is what happens:
I confirmed this by adding some additional print statements to |
/milestone v1.13 Yeah this is definitely becoming more of a problem, the daily failure rate for pull-integration-test is over 50% It shows up on triage, but I can't point to when it started become more flaky: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration |
/priority critical-urgent |
@kubernetes/sig-apps-bugs do you think the controller should change to differentiate between two replicasets with the same name but different UIDs? |
The flaky test is actually testing kubectl, so an option is to split this into two issues. We can fix the test to avoid this problem, which would fix the flakiness for people working on other things. And then separately come up with a fix for the issue in replicaset. I created a PR that updates the test to avoid creating replicasets with the same name: #69739 |
Alternatively, perhaps we could fix this by adding UID to the key used in the delta fifo? |
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration#3e7dba9374ac0dbc9c11 @kow3ns so flakes are definitely down now that we're working around this (thanks @mortent!), but do we have an issue for the "proper" fix @lavalamp proposed? |
/assign janetkuo |
We should get a fix in for the next release |
As per @kow3ns comment, moving this to 1.14 /milestone v1.14 |
When the ReplicaSet controller fetches a fresh version of the ReplicaSet from the server it validates that the UID of the fresh matches the one in the cache. If the UID does not match, a race condition outlined in the issue must have occurred and the controller should intepret the call to syncReplicaSet as a delete. Reverts the merge that temporarily fixed the bug that would occasionally fail as a result of this bug.
(tracking update - in progress #72927) |
Hi! We are starting the code freeze for 1.15 tomorrow EOD. Just checking in to see if this issue still planned for the 1.15 cycle? |
I think #72927 is pretty close, I left a few comments. |
/milestone v1.16 |
Hello! I'm part of the bug triage team for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then. Do we still target this issue to be fixed for 1.16? If not please re-tag the issue to the planned milestone. |
A change this low-level needs more soak time. Moving to 1.17 /milestone v1.17 |
I think fixing the handlers and handling expectations in them will fix this issue #82572 /assign |
Bug triage for 1.17 here. This issue has been open for a significant amount of time and since it is tagged for milestone 1.17, we want to let you know that the 1.17 code freeze is coming in less than one month on Nov. 14th. Will this issue be resolved before then? |
I hope it will, PR #82572 is there, just a matter of getting a review/tag. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
@ncdc's description hoisted from #69376 (comment):
I was about to file this. I spent yesterday triaging it and have determined the root cause. The issue is that the
ReplicaSetController
can "miss" handling the deletion of aReplicaSet
if things happen quickly enough. Here's the flow:ReplicaSetController
sees new rs, starts working on creating podsReplicaSetController
'srsInformer
see the rs deletion and callsrsc.enqueueReplicaSet
, which adds the namespace/name of the rs to the work queueReplicaSetController
'srsInformer
see the rs addition and callsrsc.enqueueReplicaSet
, which adds the namespace/name of the rs to the work queue (again)ReplicaSetController
's sync handler processes the entry from the queuesyncReplicaSet
callsrsLister
to get the rs, it's there (it's the 2nd one)This is a timing issue. The
ReplicaSetController
doesn't check the rs's UID, and if the order of operations is "just right", the controller's sync handler won't "see" the deletion, so it never callsrsc.expectations.DeleteExpectations(key)
to reflect that the rs was deleted./sig apps
Original description follows:
In #69344 I had to re-run the integration test 4 times to get it to work.
Log outputs:
The text was updated successfully, but these errors were encountered: