New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller manager restart default initialization failure. #10030

Closed
timothysc opened this Issue Jun 18, 2015 · 27 comments

Comments

Projects
None yet
@timothysc
Member

timothysc commented Jun 18, 2015

In validating HA-fail-over on density, we verified that the controller manager appears to act in an uninitialized fashion, e.g. it doesn't gather complete state before acting. It over-corrects which causes a fair amount of unnecessary churn on the system.

It eventually recovers, and converges, but it fails density tests by overshooting.

Reproduction:

  1. Configure HA
  2. run density
  3. Trigger a fail-over
  4. Overshoot - test failure.

This can also be seen by watching the pending count spike on a systemctl restart kube-controller-manager in the middle of a ramping-up on a wide replication controller.

/cc @brendandburns, @rrati, @davidopp, @jayunit100, @eparis, @bprashanth

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jun 18, 2015

I get the feeling this can happen if the rc manager runs it's first loop after the restart before the watch has populated the store (but it will correct itself, like you noted). We've been meaning to autdit controller framework uses cases for things like restarts (#9026), but if we really need a fix we'd need to communicate that the watch goroutines have run to the rc manager.
@lavalamp

@timothysc

This comment has been minimized.

Member

timothysc commented Jun 18, 2015

@bprashanth or gate the run loops on a watch sync.

@satnam6502 satnam6502 added this to the v1.0-candidate milestone Jun 18, 2015

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jun 19, 2015

The problem is that RC manager holds state that prevents the overshoot. That state isn't shared, so the standby doesn't know how many creations/deletions are queued and is likely to double-create/delete.

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jun 19, 2015

It should only be a problem if a fail-over is triggered while a large operation is in progress.

Hm, but if as @bprashanth says, it doesn't gate actions on the store being filled, that's a different and more serious problem.

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jun 19, 2015

yeah, we should either prepopulate that store or communicate that it has finished a first sync otherwise a restart of the binary could cause overshooting and correction

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jun 19, 2015

If @bprashanth is right, then restarting rc manager with big but stable RCs in the system sholud be a big problem.

We've added a method that will tell you if it's had at least one full sync, btw.

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jun 20, 2015

Didn't see that method go in, but now that we have it #10147 fixes the issue i mentioned

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jun 22, 2015

Replying to @davidopp comment on the pr, since the comment will be lost once the pr is closed.

Does this fully fix #10030?

There's still the case of the standby not knowing how many requests are in flight if the primary goes down during a resize (eg: manager 1 creates 1000 replicas, it dies, manager 2 takes over and starts up a watch, sees 900 replicas because 100 are still in flight and decides to create 100 more). It will end up overshooting and correcting a maximum of 500 replicas (more like 20 since that's the qps on the connection).

Do we want to consider a more complete fix for 1.0?

@timothysc

This comment has been minimized.

Member

timothysc commented Jun 22, 2015

@bprashanth, why can't we gather the details of the pending items as well? It appears to only see (x) running, but why gather and += pending?

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jun 22, 2015

Not sure I understand what you mean. The case I'm talking about the requests are in flight (sent by the old manager, not yet written to etcd)

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jun 22, 2015

@timothysc It does see pending pods; it just won't see pods that have been created but not yet propagated. There should really not be very many pods in this category, and for this reason I think it is probably not a problem at the moment. #10147 therefore at least knocks this issue down to p3 and post-1.0.

@davidopp

This comment has been minimized.

Member

davidopp commented Jun 24, 2015

Moving to 1.0-post now that #10147 has been merged. Thanks for the report, @timothysc, and @bprashanth and @lavalamp for working on this.

@davidopp davidopp modified the milestones: v1.0-post, v1.0 Jun 24, 2015

@jayunit100

This comment has been minimized.

Member

jayunit100 commented Jul 14, 2015

This bug, i assume, only surfaces if you restart kube-controller-manager at the end of the test.
otherwise i guess overshoot might be corrected before it occurs.
correct? am working on local reproducer and trying to make it robust

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 14, 2015

restart in the middle of density... density has a check for total (pending + running), and the total count far exceeds, and density will fail.

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jul 14, 2015

Yeah, restart in the middle to test. But this should be 98% fixed with @bprashanth's change. Is there still a problem?

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jul 14, 2015

If you time it right (eg while it's creating replicas), you can get it to overshoot, this is known. It should always correct itself and it should not overshoot if restarted in the steady state.

@jayunit100

This comment has been minimized.

Member

jayunit100 commented Jul 14, 2015

cool ! so... maybe its fixed in 98% of the scenarios ? as per the bprashant change?
Is this now virtually impossible to reproduce?


Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:07.679790472 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:17.680058695 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:27.680314095 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 

KILL KCM, 

INFO: 2015-07-14 16:56:37.680653983 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:47.680985771 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:57.681283125 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:07.681461051 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 

RESTART KCM

INFO: 2015-07-14 16:57:17.681642482 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 9 running, 21 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:27.681929083 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 10 running, 20 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:37.682268394 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 13 running, 17 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:47.682679824 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 13 running, 17 pending, 0 waiting, 0 inactive, 0 unknown 

KILL KCM

INFO: 2015-07-14 16:57:57.683031949 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 21 running, 9 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:07.683384456 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 21 running, 9 pending, 0 waiting, 0 inactive, 0 unknown 

RESTART KCM

INFO: 2015-07-14 16:58:17.68379428 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 26 running, 4 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:27.684272592 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 26 running, 4 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:37.684671371 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 30 running, 0 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: E2E startup time for 30 pods: 2m50.010079055s

I tried killing/restaring several times, couldnt get an overshoot.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 15, 2015

Let's try on our beasty cluster in the a.m.

@jayunit100

This comment has been minimized.

Member

jayunit100 commented Jul 15, 2015

@lavalamp (just confirming) are you referring to #10147 as the final fix (with the caveat that un-propagated can still rarely trip the bug) . just want to make sure we're all on same page.

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jul 15, 2015

yes

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jul 15, 2015

Yes.

@bprashanth

This comment has been minimized.

Member

bprashanth commented Jul 16, 2015

@jayunit100 to be more clear on the case you should expect overshoot:

  1. With a single rc manager: you restart the manager right after it starts a bunch of POST requests for pod creates, but before the POSTs can complete. So realistic scenario if the apiserver is loaded and these POSTs take a second each to reach etcd, and the manager restarts within that second, there could be overshoot.
  2. With 2 rc managers in HA: You kill the active after the POSTs have been submitted and/or completed but before the passive rc's watch has observed them, the passive doesn't know to wait on the watch. it has inconsistent local state and will overshoot.

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

@davidopp

This comment has been minimized.

Member

davidopp commented Dec 15, 2015

@bprashanth I assume this issue is still open because we don't have fixes for the two issues in your last comment? How hard would it be to fix?

@timothysc

This comment has been minimized.

Member

timothysc commented Dec 16, 2015

Last I remember they still existed but were lower priority b/c of the caps we had in place. If we ever have a high throughput it could be a real concern.

@timothysc

This comment has been minimized.

Member

timothysc commented Jun 17, 2016

I'm going to close this one but we should be mindful of the legacy in the reflector re-design.

/cc @hongchaodeng

@timothysc timothysc closed this Jun 17, 2016

@hongchaodeng

This comment has been minimized.

Member

hongchaodeng commented Jun 17, 2016

Is there a list of legacy issues? Are they all covered by tests? If not, we could add the testing for it?

@sureshpalemoni

This comment has been minimized.

sureshpalemoni commented Oct 3, 2017

Oct 03 19:08:40 openstack-master.novalocal systemd[1]: Failed to start Atomic OpenShift Master API.
Oct 03 19:08:40 openstack-master.novalocal systemd[1]: Unit origin-master-api.service entered failed state.
Oct 03 19:08:40 openstack-master.novalocal systemd[1]: origin-master-api.service failed.

Any solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment