Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller manager restart default initialization failure. #10030

Closed
timothysc opened this issue Jun 18, 2015 · 27 comments
Closed

Controller manager restart default initialization failure. #10030

timothysc opened this issue Jun 18, 2015 · 27 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@timothysc
Copy link
Member

In validating HA-fail-over on density, we verified that the controller manager appears to act in an uninitialized fashion, e.g. it doesn't gather complete state before acting. It over-corrects which causes a fair amount of unnecessary churn on the system.

It eventually recovers, and converges, but it fails density tests by overshooting.

Reproduction:

  1. Configure HA
  2. run density
  3. Trigger a fail-over
  4. Overshoot - test failure.

This can also be seen by watching the pending count spike on a systemctl restart kube-controller-manager in the middle of a ramping-up on a wide replication controller.

/cc @brendandburns, @rrati, @davidopp, @jayunit100, @eparis, @bprashanth

@bprashanth
Copy link
Contributor

I get the feeling this can happen if the rc manager runs it's first loop after the restart before the watch has populated the store (but it will correct itself, like you noted). We've been meaning to autdit controller framework uses cases for things like restarts (#9026), but if we really need a fix we'd need to communicate that the watch goroutines have run to the rc manager.
@lavalamp

@timothysc
Copy link
Member Author

@bprashanth or gate the run loops on a watch sync.

@satnam6502 satnam6502 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 18, 2015
@satnam6502 satnam6502 added this to the v1.0-candidate milestone Jun 18, 2015
@lavalamp
Copy link
Member

The problem is that RC manager holds state that prevents the overshoot. That state isn't shared, so the standby doesn't know how many creations/deletions are queued and is likely to double-create/delete.

@lavalamp
Copy link
Member

It should only be a problem if a fail-over is triggered while a large operation is in progress.

Hm, but if as @bprashanth says, it doesn't gate actions on the store being filled, that's a different and more serious problem.

@bprashanth
Copy link
Contributor

yeah, we should either prepopulate that store or communicate that it has finished a first sync otherwise a restart of the binary could cause overshooting and correction

@lavalamp
Copy link
Member

If @bprashanth is right, then restarting rc manager with big but stable RCs in the system sholud be a big problem.

We've added a method that will tell you if it's had at least one full sync, btw.

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 20, 2015
@davidopp davidopp modified the milestones: v1.0, v1.0-candidate Jun 20, 2015
@davidopp davidopp assigned kprobst and bprashanth and unassigned kprobst Jun 20, 2015
@bprashanth
Copy link
Contributor

Didn't see that method go in, but now that we have it #10147 fixes the issue i mentioned

@bprashanth
Copy link
Contributor

Replying to @davidopp comment on the pr, since the comment will be lost once the pr is closed.

Does this fully fix #10030?

There's still the case of the standby not knowing how many requests are in flight if the primary goes down during a resize (eg: manager 1 creates 1000 replicas, it dies, manager 2 takes over and starts up a watch, sees 900 replicas because 100 are still in flight and decides to create 100 more). It will end up overshooting and correcting a maximum of 500 replicas (more like 20 since that's the qps on the connection).

Do we want to consider a more complete fix for 1.0?

@timothysc
Copy link
Member Author

@bprashanth, why can't we gather the details of the pending items as well? It appears to only see (x) running, but why gather and += pending?

@bprashanth
Copy link
Contributor

Not sure I understand what you mean. The case I'm talking about the requests are in flight (sent by the old manager, not yet written to etcd)

@lavalamp
Copy link
Member

@timothysc It does see pending pods; it just won't see pods that have been created but not yet propagated. There should really not be very many pods in this category, and for this reason I think it is probably not a problem at the moment. #10147 therefore at least knocks this issue down to p3 and post-1.0.

@davidopp
Copy link
Member

Moving to 1.0-post now that #10147 has been merged. Thanks for the report, @timothysc, and @bprashanth and @lavalamp for working on this.

@davidopp davidopp modified the milestones: v1.0-post, v1.0 Jun 24, 2015
@jayunit100
Copy link
Member

This bug, i assume, only surfaces if you restart kube-controller-manager at the end of the test.
otherwise i guess overshoot might be corrected before it occurs.
correct? am working on local reproducer and trying to make it robust

@timothysc
Copy link
Member Author

restart in the middle of density... density has a check for total (pending + running), and the total count far exceeds, and density will fail.

@lavalamp
Copy link
Member

Yeah, restart in the middle to test. But this should be 98% fixed with @bprashanth's change. Is there still a problem?

@bprashanth
Copy link
Contributor

If you time it right (eg while it's creating replicas), you can get it to overshoot, this is known. It should always correct itself and it should not overshoot if restarted in the steady state.

@jayunit100
Copy link
Member

cool ! so... maybe its fixed in 98% of the scenarios ? as per the bprashant change?
Is this now virtually impossible to reproduce?


Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:07.679790472 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:17.680058695 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:27.680314095 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 0 running, 30 pending, 0 waiting, 0 inactive, 0 unknown 

KILL KCM, 

INFO: 2015-07-14 16:56:37.680653983 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:47.680985771 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:56:57.681283125 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:07.681461051 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 5 running, 25 pending, 0 waiting, 0 inactive, 0 unknown 

RESTART KCM

INFO: 2015-07-14 16:57:17.681642482 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 9 running, 21 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:27.681929083 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 10 running, 20 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:37.682268394 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 13 running, 17 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:57:47.682679824 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 13 running, 17 pending, 0 waiting, 0 inactive, 0 unknown 

KILL KCM

INFO: 2015-07-14 16:57:57.683031949 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 21 running, 9 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:07.683384456 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 21 running, 9 pending, 0 waiting, 0 inactive, 0 unknown 

RESTART KCM

INFO: 2015-07-14 16:58:17.68379428 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 26 running, 4 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:27.684272592 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 26 running, 4 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: 2015-07-14 16:58:37.684671371 -0400 EDT density30-b3c38835-2a6a-11e5-91f7-3c970e4b8bbd Pods: 30 out of 30 created, 30 running, 0 pending, 0 waiting, 0 inactive, 0 unknown 
INFO: E2E startup time for 30 pods: 2m50.010079055s

I tried killing/restaring several times, couldnt get an overshoot.

@timothysc
Copy link
Member Author

Let's try on our beasty cluster in the a.m.

@jayunit100
Copy link
Member

@lavalamp (just confirming) are you referring to #10147 as the final fix (with the caveat that un-propagated can still rarely trip the bug) . just want to make sure we're all on same page.

@bprashanth
Copy link
Contributor

yes

@lavalamp
Copy link
Member

Yes.

@bprashanth
Copy link
Contributor

@jayunit100 to be more clear on the case you should expect overshoot:

  1. With a single rc manager: you restart the manager right after it starts a bunch of POST requests for pod creates, but before the POSTs can complete. So realistic scenario if the apiserver is loaded and these POSTs take a second each to reach etcd, and the manager restarts within that second, there could be overshoot.
  2. With 2 rc managers in HA: You kill the active after the POSTs have been submitted and/or completed but before the passive rc's watch has observed them, the passive doesn't know to wait on the watch. it has inconsistent local state and will overshoot.

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@davidopp
Copy link
Member

@bprashanth I assume this issue is still open because we don't have fixes for the two issues in your last comment? How hard would it be to fix?

@timothysc
Copy link
Member Author

Last I remember they still existed but were lower priority b/c of the caps we had in place. If we ever have a high throughput it could be a real concern.

@timothysc
Copy link
Member Author

I'm going to close this one but we should be mindful of the legacy in the reflector re-design.

/cc @hongchaodeng

@hongchaodeng
Copy link
Member

Is there a list of legacy issues? Are they all covered by tests? If not, we could add the testing for it?

@sureshpalemoni
Copy link

Oct 03 19:08:40 openstack-master.novalocal systemd[1]: Failed to start Atomic OpenShift Master API.
Oct 03 19:08:40 openstack-master.novalocal systemd[1]: Unit origin-master-api.service entered failed state.
Oct 03 19:08:40 openstack-master.novalocal systemd[1]: origin-master-api.service failed.

Any solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests