Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Scheduler should terminate on loosing leader lock #81306
What type of PR is this?
What this PR does / why we need it:
Thanks to @gnufied for identifying the issue.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
[APPROVALNOTIFIER] This PR is APPROVED
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing
Sorry to bother you, could you please provide some real cases and how to reproduce this issue?
logs as below: (ignore the unsupported 1.10 version, this snippet hasn't changed since then)
It seems all work fine.
if we want to be consistent with our neighbors, we can merge this but it targets as cleanup (not bugfix, so no need to backport).
It depends on how long you waited before the leader looses lock, in my case the scheduler was able to communicate with apiserver before it timed out waiting for condition, the scenario I am talking about is a network condition where we would loose connectivity with apiserver for 30 seconds and then scheduler is able to communicate again, it's not a permanent network failure which would cause the above scenario you mentioned.
@gaorong - Thanks for testing the above scenario. What type of resourceLocks are you using? configmaps or endpoints? Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?
It's interesting that the scheduler is terminating for you, I looked more in 1.10 code base and I think leader election code is different.
Following are the logs from v1.14.0+0faddd8
You can clearly see that after
The code path is slightly different in leader election code, scheduler in your case is failing at lock acquistion state while renewing and the channel is closing right after that here - https://github.com/kubernetes/kubernetes/blob/release-1.10/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go#L162
We moved to contexts from channels in #57932 1.12+ and I think this line
the default: endpoint
Yes, scheduler can still terminate as before.
logs as below:
it's weird to have different behavior. I think leader-election should be compatible with previous behavior (in v1.10).
Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running. The scenario is specific to situation where the scheduler acquired lock and releasing it back.
I am using configmaps instead of endpoints/
apiserver is listening on port 8080, I drop all tcp packages sending to apiserver by iptables:
How can we simulate this scenario?
@ravisantoshgudimetla I am not familiar with kube-scheduler in HA mode, but I have a few small questions, could you please help:
Shouldn't the previous leader be fenced when the new leader is selected?
Seems the previous leader just exit as soon as possible, still cannot totally avoid your race condition that multiple scheduler are scheduling pods at the same time, right?
So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
How're the other schedulers connecting to apiserver? Are they running on the same host? If you block inbound traffic on port 8080, they'd also loose connectivity, isn't it?
I am currently using a 3 node control plane cluster, where each scheduler on every node connects to apiserver running locally. As of now, I am bouncing apiserver one node at a time to simulate this scenario
Good question, while I am not certain that scenario can happen in the above situation, there is a very good chance that apiserver received the request just before scheduler exited, the other scheduler which gained the lock won't send the request again because it'll have to wait for the informer caches to sync initially(actually even before acquiring the lock). The informers would talk to apiserver and get the latest state of pod before building the local cache and start scheduling. The main problem in the situation that I mentioned here is scheduler without lock is continuously sending requests to apiserver for binding or trying to reschedule pods that are already scheduled by the right scheduler (the one which has leader lock).
But in general, kubernetes is based on eventual consistency, where failure can happen sometimes because of various reasons but system should correct itself after some time.
2 similar comments