Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler should terminate on loosing leader lock #81306

Conversation

@ravisantoshgudimetla
Copy link
Contributor

commented Aug 12, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug

What this PR does / why we need it:
As of now, we're not terminating scheduler on loosing leader lock causing the scheduler to continue watching pod, node objects and allowing the pod to be scheduled. While this may not cause problem in single scheduler topology, this will certainly cause problems if we have HA enabled for control plane components. This PR ensures that scheduler terminates instead of silently ignoring the loss of leader lock and proceeding with scheduling. Without this patch, we'd notice that in HA deployment, schedulers would race to schedule pods(with or without leader lock) causing multiple bind failures as the individual scheduler will have a stale cache.

Thanks to @gnufied for identifying the issue.

/cc @bsalamat @Huang-Wei @ahg-g @damemi @ingvagabund
Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Scheduler should terminate when it looses leader lock.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

/sig scheduling
/priority important

@ravisantoshgudimetla ravisantoshgudimetla changed the title Terminate on leader election Scheduler should terminate on loosing leader lock Aug 12, 2019
@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

/priority important-soon

@@ -262,7 +261,7 @@ func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, regis
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
klog.Fatalf("leaderelection lost")

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 12, 2019

Member

Looks good. Now we're consistent with our neighbours:

cmd/kube-controller-manager/app/controllermanager.go
281:			OnStoppedLeading: func() {
282-				klog.Fatalf("leaderelection lost")

cmd/kube-scheduler/app/server.go
264:			OnStoppedLeading: func() {
265-				utilruntime.HandleError(fmt.Errorf("lost master"))

cmd/cloud-controller-manager/app/controllermanager.go
209:			OnStoppedLeading: func() {
210-				klog.Fatalf("leaderelection lost")
Copy link
Member

left a comment

/lgtm
/approve

Thanks, @ravisantoshgudimetla for fixing this important bug. We must backport this to older versions (1.13+) of K8s.

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 12, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

/retest

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

/hold
(if I have permission)

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

Sorry to bother you, could you please provide some real cases and how to reproduce this issue?
I have tested this and scheduler can successfully exit.
step as below:

  1. start two schedulers, say scheduler1 and scheduler2.
  2. the scheduler1 comes as the leader.
  3. break the connection between scheduler1 and apiserver using iptables: iptables -I INPUT -p tcp --dport 8083 -j DROP.
  4. scheduler1 can not schedule pod because of broken connection and scheduler2 become as the leader successfully.
  5. remove iptables from step 3 and connection recover. then scheduler1 finds himself has lost lease and EXITs successfully.
  6. node agent supervisor(e.g. systemd) restart scheduler1.

logs as below: (ignore the unsupported 1.10 version, this snippet hasn't changed since then)

I0813 09:57:56.190610    3114 server.go:562] Version: v1.10.11
I0813 09:57:56.191312    3114 server.go:581] starting healthz server on 0.0.0.0:10251
I0813 09:57:57.091874    3114 controller_utils.go:1027] Waiting for caches to sync for scheduler controller
I0813 09:57:57.191994    3114 controller_utils.go:1034] Caches are synced for scheduler controller
I0813 09:57:57.192019    3114 leaderelection.go:175] attempting to acquire leader lease  kube-system/kube-scheduler...
I0813 09:58:15.926175    3114 leaderelection.go:184] successfully acquired lease kube-system/kube-scheduler
I0813 09:59:10.889224    3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0813 09:59:10.889265    3114 server.go:619] lost master
error: lost lease

It seems all work fine.
IIUC, If scheduler loses lease, it will return from leaderElector.Run(ctx) and then invoke return fmt.Errorf("lost lease") to exit.

if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")

if we want to be consistent with our neighbors, we can merge this but it targets as cleanup (not bugfix, so no need to backport).

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

I0813 09:59:10.889224 3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition

It depends on how long you waited before the leader looses lock, in my case the scheduler was able to communicate with apiserver before it timed out waiting for condition, the scenario I am talking about is a network condition where we would loose connectivity with apiserver for 30 seconds and then scheduler is able to communicate again, it's not a permanent network failure which would cause the above scenario you mentioned.

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

It depends on how long you waited before the leader looses lock

I think no matters how long the scheduler wait, once the connection recover, the scheduler will detect himself has lost lock and exit.

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

@gaorong - Thanks for testing the above scenario. What type of resourceLocks are you using? configmaps or endpoints? Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

It's interesting that the scheduler is terminating for you, I looked more in 1.10 code base and I think leader election code is different.

Following are the logs from v1.14.0+0faddd8

I0810 18:54:37.287283       1 leaderelection.go:263] failed to renew lease openshift-kube-scheduler/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0810 18:54:37.287313       1 server.go:248] lost master
E0810 18:54:38.278507       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0810 18:54:38.278654       1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:234: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E0810 18:54:38.278787       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
I0810 18:55:22.386588       1 scheduler.go:417] Failed to bind pod: test-insights/insights-operator-9868bfd69-kklx8
E0810 18:55:22.386635       1 factory.go:660] Error scheduling test-insights/insights-operator-9868bfd69-kklx8: Operation cannot be fulfilled on pods/binding "insights-operator-9868bfd69-kklx8": pod insights-operator-9868bfd69-kklx8 is already assigned to node "test-hlb8q-master-0"; retrying

You can clearly see that after lost master, we're not exiting from scheduler and continuing with scheduling.

What's happening?

The code path is slightly different in leader election code, scheduler in your case is failing at lock acquistion state while renewing and the channel is closing right after that here - https://github.com/kubernetes/kubernetes/blob/release-1.10/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go#L162

We moved to contexts from channels in #57932 1.12+ and I think this line

, won't always cancel the parent context causing explicit exit in the component that calls the leader election code.

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

What type of resourceLocks are you using? configmaps or endpoints?

the default: endpoint

Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

Yes, scheduler can still terminate as before.
I run v1.15.0 scheduler as a foreground process with minimal configuration.

logs as below:

[root@k8s-test-master01 ]#  /usr/bin/kube-scheduler  --v=0 --master=http://127.0.0.1:8080
I0814 10:17:08.141925   11894 serving.go:319] Generated self-signed cert in-memory
W0814 10:17:08.313407   11894 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0814 10:17:08.313424   11894 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0814 10:17:08.313439   11894 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0814 10:17:08.314082   11894 server.go:142] Version: v1.15.0
I0814 10:17:08.314117   11894 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0814 10:17:08.315223   11894 authorization.go:47] Authorization is disabled
W0814 10:17:08.315234   11894 authentication.go:55] Authentication is disabled
I0814 10:17:08.315244   11894 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0814 10:17:08.315639   11894 secure_serving.go:116] Serving securely on [::]:10259
I0814 10:17:09.218792   11894 leaderelection.go:235] attempting to acquire leader lease  kube-system/kube-scheduler...
I0814 10:17:24.964249   11894 leaderelection.go:245] successfully acquired lease kube-system/kube-scheduler
E0814 10:19:09.392164   11894 leaderelection.go:324] error retrieving resource lock kube-system/kube-scheduler: Get http://127.0.0.1:8080/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0814 10:19:09.392245   11894 event.go:296] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'k8s-test-master01_bc8bdc92-a9d8-423d-9177-e48bc60b3f80 stopped leading'
I0814 10:19:09.392515   11894 leaderelection.go:281] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0814 10:19:09.392538   11894 server.go:254] lost master
lost lease
[root@k8s-test-master01 ]# 

it's weird to have different behavior. I think leader-election should be compatible with previous behavior (in v1.10).
Are there some special configurations for you?

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

I run v1.15.0 scheduler as a foreground process with minimal configuration.

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running. The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Are there some special configurations for you?

I am using configmaps instead of endpoints/

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running

apiserver is listening on port 8080, I drop all tcp packages sending to apiserver by iptables: iptables -I INPUT -p tcp --dport 8080 -j DROP. maybe this is equivalent to apiserver is down. as shown in the scheduler flag in above example, the scheduler is connecting to apiserver by localhost, so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

How can we simulate this scenario?

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Do you set ReleaseOnCancel to true by changing scheduler code? the scheduler would not release lock proactively.

// ReleaseOnCancel should be set true if the lock should be released
// when the run context is cancelled. If you set this to true, you must
// ensure all code guarded by this lease has successfully completed
// prior to cancelling the context, or you may have two processes
// simultaneously acting on the critical path.
ReleaseOnCancel bool

@yqwang-ms

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

@ravisantoshgudimetla I am not familiar with kube-scheduler in HA mode, but I have a few small questions, could you please help:

Shouldn't the previous leader be fenced when the new leader is selected?

Seems the previous leader just exit as soon as possible, still cannot totally avoid your race condition that multiple scheduler are scheduling pods at the same time, right?
Such as, the binding request from the previous leader is being sending to ApiServer, exit the process may not be able to cancel the sending (at least when the request is already sent to network).

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

How're the other schedulers connecting to apiserver? Are they running on the same host? If you block inbound traffic on port 8080, they'd also loose connectivity, isn't it?

How can we simulate this scenario?

I am currently using a 3 node control plane cluster, where each scheduler on every node connects to apiserver running locally. As of now, I am bouncing apiserver one node at a time to simulate this scenario

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

Good question, while I am not certain that scenario can happen in the above situation, there is a very good chance that apiserver received the request just before scheduler exited, the other scheduler which gained the lock won't send the request again because it'll have to wait for the informer caches to sync initially(actually even before acquiring the lock). The informers would talk to apiserver and get the latest state of pod before building the local cache and start scheduling. The main problem in the situation that I mentioned here is scheduler without lock is continuously sending requests to apiserver for binding or trying to reschedule pods that are already scheduled by the right scheduler (the one which has leader lock).

But in general, kubernetes is based on eventual consistency, where failure can happen sometimes because of various reasons but system should correct itself after some time.

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

Are they running on the same host?

No, there aren't. my environment is same as you: each scheduler on every node connects to apiserver running locally.

@k82cn

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

@gaorong , @ravisantoshgudimetla , is there any impact if we terminate scheduler in this case, similar to other component? IMO, terminate the scheduler seems a safer way to handle such kind of case :)

@gaorong

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@k82cn Agree, The main concern here is if this is a confirmed bug-fix or just a cleanup. could we backport a cleanup?

anyway, agree with merging this first
/hold cancel

@yqwang-ms

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@ravisantoshgudimetla Thanks for your detailed explaination. Totally agree we should be "eventual consistency", and add one more principle, the "eventual" duration should be as shorter as possible. :)

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 15, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot

This comment has been minimized.

Copy link

commented Aug 15, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 15, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@bsalamat

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

This change is fine and we can merge it, but I agree with @gaorong that we don't need to back port it if it is a clean-up.

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

I would consider this a bug and +1 on backporting

@k8s-ci-robot k8s-ci-robot merged commit de3aed8 into kubernetes:master Aug 15, 2019
23 checks passed
23 checks passed
cla/linuxfoundation ravisantoshgudimetla authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.