Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix flaky test TestPreemptionRaces #77990

Merged
merged 2 commits into from May 18, 2019

Conversation

@Huang-Wei
Copy link
Member

commented May 16, 2019

What type of PR is this?

/kind bug
/kind flake
/sig scheduling
/priority important-soon
/assign @bsalamat

What this PR does / why we need it:

In some cases, an Update event with no "NominatedNode" present is received right
after a node("NominatedNode") is reserved for this pod in memory.

If we go updating (delete and add) it, it actually un-reserves the node since
the newPod doesn't carry sped.status.nominatedNode.

In this case, during this time other low-priority pods have chances to take space which was reserved for the nominatedPod.

Which issue(s) this PR fixes:

Fixes #74931.

Special notes for your reviewer:

The flake is reproducible in my env and the above solution and analysis are given based on the real execution path. However, we can brainstorm a better solution following the same rationale.

BTW: actually it's more a bug. And the integration test was given impressively to reveal it :)

Does this PR introduce a user-facing change?:

Fixed a scheduler racing issue to ensure low priority pods to be unschedulable on the node(s) where high priority pods have `NominatedNodeName` set to the node(s). 
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

/retest

// the newPod doesn't carry sped.status.nominatedNode.
// In this case, during this time other low-priority pods have chances to take space which
// was reserved for the nominatedPod.
if len(oldPod.Status.NominatedNodeName) == 0 && len(newPod.Status.NominatedNodeName) == 0 {

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 16, 2019

Member

What you have found is an interesting case. The only problem is that a pod pointer could change in an update. If we bypass this update, we may not receive the new pointer.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei May 16, 2019

Author Member

That's true. The good thing is that in both npm.delete() and npm.add(), we actually check whether pod.UID exists or not, rather than directly check the pointer's presence. And the idea of this change skips only one case which is both old and new pods don't have NominatedNodeName set.

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 16, 2019

Member

I understand, but with the existing logic, if a pod pointer is updated, this update function updates the pod pointer. This condition that you have added skips the pointer update. Your point is valid though that this skips only when both NominatedNodeNames are empty, which can be true only for pods that have just been nominated and we have not received their NominatedNode status update.
This may solve the problem, but it does not look great and feels like a hack to me. Let me think about a bit more and see if I can find a cleaner solution.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei May 16, 2019

Author Member

That's fair. We can brainstorm for a better solution.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei May 17, 2019

Author Member

@bsalamat A new solution is proposed. PTAL.

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

/retest

// the newPod doesn't carry sped.status.nominatedNode.
// In this case, during this time other low-priority pods have chances to take space which
// was reserved for the nominatedPod.
if len(oldPod.Status.NominatedNodeName) == 0 && len(newPod.Status.NominatedNodeName) == 0 {

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 16, 2019

Member

I understand, but with the existing logic, if a pod pointer is updated, this update function updates the pod pointer. This condition that you have added skips the pointer update. Your point is valid though that this skips only when both NominatedNodeNames are empty, which can be true only for pods that have just been nominated and we have not received their NominatedNode status update.
This may solve the problem, but it does not look great and feels like a hack to me. Let me think about a bit more and see if I can find a cleaner solution.

// In some cases, an Update event with no "NominatedNode" present is received right
// after a node("NominatedNode") is reserved for this pod in memory.
// If we go updating (delete and add) it, it actually un-reserves the node since
// the newPod doesn't carry sped.status.nominatedNode.

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 16, 2019

Member

s/sped.status.nominatedNode/Status.NominatedNodeName/

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:fix-TestPreemptionRaces branch from f510b0b to 5fc6bb1 May 16, 2019

Huang-Wei added some commits May 16, 2019

scheduler: fix flaky test TestPreemptionRaces
In some cases, an Update event with no "NominatedNode" present is received right
after a node("NominatedNode") is reserved for this pod in memory.
If we go updating (delete and add) it, it actually un-reserves the node since
the newPod doesn't carry sped.status.nominatedNode.
In this case, during this time other low-priority pods have chances to take space which
was reserved for the nominatedPod.

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:fix-TestPreemptionRaces branch from 5fc6bb1 to 29195fa May 17, 2019

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/retest

// (2) NominatedNode info is updated
// (3) NominatedNode info is removed
if NominatedNodeName(oldPod) == "" && NominatedNodeName(newPod) == "" {
if nnn, ok := npm.nominatedPodToNode[oldPod.UID]; ok {

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei May 17, 2019

Author Member

A neat way is nodeName = npm.nominatedPodToNode[oldPod.UID]. If oldPod.UID doesn't exist, a nil string is returned.

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 17, 2019

Member

This change makes the logic more robust, but nodeName should not be initialized with empty string. It should be initialized by NominatedNodeName(newPod). Otherwise, all usual updates that do not come inside this if statement, would remove nominated pods.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei May 17, 2019

Author Member

Initilizing nodeName as empty string should be good b/c in nominatedPodMap.add(), it checks if incoming nodeName is empty; if it is, it calls NominatedNodeName(p):

func (npm *nominatedPodMap) add(p *v1.Pod, nodeName string) {
// always delete the pod if it already exist, to ensure we never store more than
// one instance of the pod.
npm.delete(p)
nnn := nodeName
if len(nnn) == 0 {
nnn = NominatedNodeName(p)
if len(nnn) == 0 {
return
}
}

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 17, 2019

Member

You are right.

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/retest

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-kubemark-e2e-gce-big

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/retest

@bsalamat
Copy link
Member

left a comment

/lgtm

// (2) NominatedNode info is updated
// (3) NominatedNode info is removed
if NominatedNodeName(oldPod) == "" && NominatedNodeName(newPod) == "" {
if nnn, ok := npm.nominatedPodToNode[oldPod.UID]; ok {

This comment has been minimized.

Copy link
@bsalamat

bsalamat May 17, 2019

Member

You are right.

@k8s-ci-robot k8s-ci-robot added the lgtm label May 17, 2019

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/retest

1 similar comment
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 82ac5c3 into kubernetes:master May 18, 2019

20 checks passed

cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@Huang-Wei Huang-Wei deleted the Huang-Wei:fix-TestPreemptionRaces branch May 20, 2019

@KJTsanaktsidis

This comment has been minimized.

Copy link

commented Jul 3, 2019

This is in the 1.15.0-alpha3 changelog as:

Fixed a scheduler racing issue to ensure low priority pods to be unschedulable on the node(s) where high priority pods have NominatedNodeName set to the node(s)

I believe I may have run into this issue with 1.13.7 on GKE. I was testing pod pre-emption based on priority, and to do so I made the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bsdeploy
  namespace: connect-server
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: bsdeploy
  replicas: 37
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 50%
      maxSurge: 0
  template:
    metadata:
      labels:
        app.kubernetes.io/name: bsdeploy
    spec:
      nodeName: gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3
      priorityClassName: normal-workload-priority
      containers:
      - name: bs
        image: busybox
        command: [ '/bin/sleep', '10000000' ]
        resources:
          requests:
            cpu: 205m

i.e. binds 37 copies of bsdeploy to the node gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3 with class normal-workload-priority (set to 100) carefully calibrated to fill up it's available CPU.

I then deployed a daemonset (consul-agent) to my cluster with a higher priority class (1000). I see that the daemonset does actually pre-empt the bsdeploy pods - a bunch of them are terminated with an event:

  Normal  Preempted  95s   default-scheduler                                             by cluster-baseline/consul-agent-vlcr4 on node gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3

However, the actual daemonset pod consul-agent-vlcr4 is still unscheduled, even though the status on it has nominatedNodeName set to gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3 (i.e. where we want it to go). Instead, the deployment for bsdeploy keeps making more Pods, and they end up getting scheduled.

Is this evidence that perhaps this fix should be backported to 1.13/1.14 as well?

Thanks!

@KJTsanaktsidis

This comment has been minimized.

Copy link

commented Jul 3, 2019

Disregard the above... I figured out my issue was totally separate.

The issue was that I had my bsdepoy using nodeName: to put it on the node I wanted it to get evicted from.

What I think was happening was that:

  • New consul-agent pod of higher priority was created in etcd
  • Scheduler tries to put it somewhere, only place it can go is on the gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3 node (since it has to go everywhere)
  • Since that node is full, and the daemonset is higher priority, it tries to evict some stuff. It deletes some Pods of bsdeploy on gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3, then sets nominatedNodeName on the daemonset pod to the node it should go to.
  • The deployment notices that one of it's Pods got deleted, so it makes another one
  • But the pod template has nodeName on it, so the new Pod actually just gets assigned to gke-app-cluster-noautoscalev1-a3fc8c3b-vjt3 without going through the scheduler at all
  • When the scheduler tries again to schedule the daemonset, it can't, because the empty space from its eviction was actually already filled up by the replacement bsdeploy pod.

I fixed my test setup by using a node affinity on the desired node instead of directly setting nodeName in the template. Then everything works as expected.

So, maybe this change should be backported, maybe not, but definitely not for this reason :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.