Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReplicaSetController can miss handling the deletion of a ReplicaSet #69376

Closed
zegl opened this issue Oct 3, 2018 · 29 comments · Fixed by #69739 or #82572
Closed

ReplicaSetController can miss handling the deletion of a ReplicaSet #69376

zegl opened this issue Oct 3, 2018 · 29 comments · Fixed by #69739 or #82572
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Milestone

Comments

@zegl
Copy link
Contributor

zegl commented Oct 3, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

@ncdc's description hoisted from #69376 (comment):

I was about to file this. I spent yesterday triaging it and have determined the root cause. The issue is that the ReplicaSetController can "miss" handling the deletion of a ReplicaSet if things happen quickly enough. Here's the flow:

  1. Client creates rs
  2. ReplicaSetController sees new rs, starts working on creating pods
  3. Client deletes rs
  4. ReplicaSetController's rsInformer see the rs deletion and calls rsc.enqueueReplicaSet, which adds the namespace/name of the rs to the work queue
  5. Client recreates rs with the exact same name as before
  6. ReplicaSetController's rsInformer see the rs addition and calls rsc.enqueueReplicaSet, which adds the namespace/name of the rs to the work queue (again)
  7. ReplicaSetController's sync handler processes the entry from the queue
  8. Because the rs was recreated with the same name, when syncReplicaSet calls rsLister to get the rs, it's there (it's the 2nd one)

This is a timing issue. The ReplicaSetController doesn't check the rs's UID, and if the order of operations is "just right", the controller's sync handler won't "see" the deletion, so it never calls rsc.expectations.DeleteExpectations(key) to reflect that the rs was deleted.

/sig apps

Original description follows:

In #69344 I had to re-run the integration test 4 times to get it to work.

Log outputs:

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 3, 2018
@neolit123
Copy link
Member

neolit123 commented Oct 4, 2018

/kind flake
/sig testing

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/flake Categorizes issue or PR as related to a flaky test. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 4, 2018
@neolit123
Copy link
Member

/remove-kind failing-test

@k8s-ci-robot k8s-ci-robot removed the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 4, 2018
@ncdc ncdc changed the title pull-integration-test is flakey pull-integration-test (run_rs_tests) is flakey Oct 5, 2018
@ncdc
Copy link
Member

ncdc commented Oct 5, 2018

I was about to file this. I spent yesterday triaging it and have determined the root cause. The issue is that the ReplicaSetController can "miss" handling the deletion of a ReplicaSet if things happen quickly enough. Here's the flow:

  1. Client creates rs
  2. ReplicaSetController sees new rs, starts working on creating pods
  3. Client deletes rs
  4. ReplicaSetController's rsInformer see the rs deletion and calls rsc.enqueueReplicaSet, which adds the namespace/name of the rs to the work queue
  5. Client recreates rs with the exact same name as before
  6. ReplicaSetController's rsInformer see the rs addition and calls rsc.enqueueReplicaSet, which adds the namespace/name of the rs to the work queue (again)
  7. ReplicaSetController's sync handler processes the entry from the queue
  8. Because the rs was recreated with the same name, when syncReplicaSet calls rsLister to get the rs, it's there (it's the 2nd one)

This is a timing issue. The ReplicaSetController doesn't check the rs's UID, and if the order of operations is "just right", the controller's sync handler won't "see" the deletion, so it never calls rsc.expectations.DeleteExpectations(key) to reflect that the rs was deleted.

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 5, 2018
@ncdc
Copy link
Member

ncdc commented Oct 5, 2018

For a bit more clarity:

When the rs informer sees a new rs, it calls

AddFunc: rsc.enqueueReplicaSet,

When the rs informer sees a deleted rs, it calls

DeleteFunc: rsc.enqueueReplicaSet,

The net effect is

Once a key ($namespace/$name) is in the queue, syncReplicaSet pops it off and then does

rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
if errors.IsNotFound(err) {
glog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
rsc.expectations.DeleteExpectations(key)
return nil
}

The problem here is that there's not necessarily a guarantee that the order of operations is always like this:

  1. AddFunc -> enqueueReplicaSet -> pop -> rsLister.Get() -> normal processing
  2. DeleteFunc -> enqueueReplicaSet -> pop -> rsLister.Get() -> not found -> delete expectations
  3. AddFunc -> enqueueReplicaSet -> pop -> rsLister.Get() -> normal processing

When the test fails, this is what happens:

  1. AddFunc -> enqueueReplicaSet -> pop -> rsLister.Get() -> normal processing
  2. DeleteFunc -> enqueueReplicaSet (note the change here - no pop!)
  3. AddFunc -> enqueueReplicaSet -> pop -> rsLister.Get() -> normal processing

I confirmed this by adding some additional print statements to AddFunc and DeleteFunc. The DeleteFunc is always called at the "right" time. But sometimes syncReplicaSet isn't scheduled fast enough, so the new rs is already visible by the time syncReplicaSet is executed.

@spiffxp
Copy link
Member

spiffxp commented Oct 9, 2018

/milestone v1.13
FYI @jberkus

Yeah this is definitely becoming more of a problem, the daily failure rate for pull-integration-test is over 50%

It shows up on triage, but I can't point to when it started become more flaky: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Oct 9, 2018
@spiffxp
Copy link
Member

spiffxp commented Oct 9, 2018

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 9, 2018
@ncdc
Copy link
Member

ncdc commented Oct 11, 2018

@kubernetes/sig-apps-bugs do you think the controller should change to differentiate between two replicasets with the same name but different UIDs?

@mattfarina
Copy link
Contributor

@mortent
Copy link
Member

mortent commented Oct 12, 2018

The flaky test is actually testing kubectl, so an option is to split this into two issues. We can fix the test to avoid this problem, which would fix the flakiness for people working on other things. And then separately come up with a fix for the issue in replicaset.

I created a PR that updates the test to avoid creating replicasets with the same name: #69739

@lavalamp
Copy link
Member

The replica set controller needs a fix. Thanks for the clear description, @ncdc.

/assign @kow3ns

@lavalamp
Copy link
Member

Alternatively, perhaps we could fix this by adding UID to the key used in the delta fifo?

@spiffxp
Copy link
Member

spiffxp commented Oct 17, 2018

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration#3e7dba9374ac0dbc9c11 @kow3ns so flakes are definitely down now that we're working around this (thanks @mortent!), but do we have an issue for the "proper" fix @lavalamp proposed?

@kow3ns
Copy link
Member

kow3ns commented Nov 14, 2018

/assign janetkuo

@kow3ns
Copy link
Member

kow3ns commented Nov 14, 2018

We should get a fix in for the next release

@AishSundar
Copy link
Contributor

As per @kow3ns comment, moving this to 1.14

/milestone v1.14

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.13, v1.14 Nov 16, 2018
runyontr added a commit to runyontr/kubernetes that referenced this issue Jan 15, 2019
When the ReplicaSet controller fetches a fresh version of the ReplicaSet from the server
it validates that the UID of the fresh matches the one in the cache.  If the
UID does not match, a race condition outlined in the issue must have occurred and the
controller should intepret the call to syncReplicaSet as a delete.

Reverts the merge that temporarily fixed the bug that would occasionally fail as
a result of this bug.
@nikopen
Copy link
Contributor

nikopen commented Mar 1, 2019

(tracking update - in progress #72927)

@nikopen
Copy link
Contributor

nikopen commented Mar 11, 2019

This is an important bug but the fix in #72927 still needs work @runyontr

Shifting milestones
/milestone v1.15

@kow3ns @spiffxp

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.14, v1.15 Mar 11, 2019
@timmycarr
Copy link

Hi! We are starting the code freeze for 1.15 tomorrow EOD. Just checking in to see if this issue still planned for the 1.15 cycle?

@lavalamp
Copy link
Member

I think #72927 is pretty close, I left a few comments.

@soggiest
Copy link

soggiest commented Jun 3, 2019

/milestone v1.16

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.15, v1.16 Jun 3, 2019
@Pothulapati
Copy link
Contributor

Hello! I'm part of the bug triage team for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.

Do we still target this issue to be fixed for 1.16? If not please re-tag the issue to the planned milestone.

@xmudrii
Copy link
Member

xmudrii commented Sep 5, 2019

@lavalamp @liggitt @janetkuo This issue and the relevant PR are open for a long time now. Should we move this to the next milestone or should we remove this from the milestone entirely?

@liggitt
Copy link
Member

liggitt commented Sep 5, 2019

A change this low-level needs more soak time. Moving to 1.17

/milestone v1.17

@tnozicka
Copy link
Contributor

tnozicka commented Sep 12, 2019

I think fixing the handlers and handling expectations in them will fix this issue #82572

/assign

@josiahbjorgaard
Copy link
Contributor

josiahbjorgaard commented Oct 20, 2019

Bug triage for 1.17 here. This issue has been open for a significant amount of time and since it is tagged for milestone 1.17, we want to let you know that the 1.17 code freeze is coming in less than one month on Nov. 14th. Will this issue be resolved before then?

@tnozicka
Copy link
Contributor

I hope it will, PR #82572 is there, just a matter of getting a review/tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet