New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a scheduler panic due to internal cache inconsistency #71063

Merged
merged 1 commit into from Nov 17, 2018

Conversation

@Huang-Wei
Member

Huang-Wei commented Nov 15, 2018

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a node is deleted and genericScheduler.cachedNodeInfoMap isn't populated (called in Schedule()/Preempt()), invoking a nodeInfo.Clone() on a "stripped" NodeInfo will panic.

"stripped" means nodeInfo (the pointer) in schedulerCache side is taken off:

func (cache *schedulerCache) RemoveNode(node *v1.Node) error {

And most of its content are set to nil:

func (n *NodeInfo) RemoveNode(node *v1.Node) error {

Which issue(s) this PR fixes:

Fixes #70450.

Special notes for your reviewer:

Generally speaking, the root cause is b/c internally we didn't handle the cache population properly. In schedulerCache, it watches on events of Node Add/Update/Delete, and update its nodeInfoMap immediately. But in a scheduling cycle, it just called snapshot() during Schedule()/Preempt().

An ideal solution is to (1) do the cache "population" in real time, or (2) do you really need two nodeInfoMap? We can revisit it in our next refactoring iteration. cc/ @misterikkit .

This issue is likely to occur frequently in auto-scaler case - where node add/delete happens frequently.

Does this PR introduce a user-facing change?:

Fix a scheduler panic due to internal cache inconsistency

/sig scheduling
/assign @bsalamat

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 15, 2018

/priority important-soon

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 15, 2018

/retest

@@ -984,6 +983,10 @@ func selectVictimsOnNode(
queue internalqueue.SchedulingQueue,
pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
if nodeInfo.Node() == nil {

This comment has been minimized.

@justinsb

justinsb Nov 15, 2018

Member

Which line of Clone do we think is causing the panic here? Looking at the nodeInfo.Clone() method body, I couldn't immediately spot which copy would panic on a stripped node, unless nodeInfo itself was nil.

This comment has been minimized.

@Huang-Wei

Huang-Wei Nov 15, 2018

Member

@justinsb from the stack trace, it's:

requestedResource: n.requestedResource.Clone(),

but honestly I can't find any code related with "setting requestedResource to nil" or "build NodeInfo() without init requestedResource".

And from the trace, it seems nodeInfo itself is not nil.

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

@Huang-Wei I think the fix is to change NodeInfo.Clone() code to check for nil before calling Clone() on requestedResource, nonzeroRequest, and allocatableResource.

This comment has been minimized.

@Huang-Wei

Huang-Wei Nov 16, 2018

Member

Yes, we can do that - that's a more safe way.

My point on checking if NodeInfo.node is nil was: if "nodeInfo.node == nil" is related with "nodeInfo.requestedResource == nil", (we haven't been able to confirm it yet), then we can directly return as calculation on a NodeInfoCopy with node==nil sounds error prone, or maybe we will see other panics.

As of now, I think we can make both changes on (1) checking "if NodeInfo.node is nil" and (2) if "requestedResource/nonzeroRequest/allocatableResource" is nil? Thoughts?

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

I have no reason to believe that checking NodeInfo.node is nil helps with mitigating the crash. I would limit this to checking resources. Alternatively, we could dig to find where we initialize a NodeInfo object without initialing the "resources"

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

How does it check both? Besides, I am not sure if we should actually check nodeInfo.node == nil.

This comment has been minimized.

@Huang-Wei

Huang-Wei Nov 16, 2018

Member
func (n *NodeInfo) Node() *v1.Node {
	if n == nil {
		return nil
	}
	return n.node
}

nodeInfo.node == nil may not happen, as nodeInfo here is a cloned one, setting nodeInfo.node to nil can only happen in schedulerCache when it receive event of RemoveNode(). That one won't impact this cloned one.

So @bsalamat @ravisantoshgudimetla I can change it to check if nodeInfo == nil only, if it's more reasonable to you.

This comment has been minimized.

@bsalamat

bsalamat Nov 16, 2018

Contributor

I see your point Wei. You are right, but I still feel checking nodeInfo == nil is more readable for those who may be confused by nodeInfo.node == nil like myself.

This comment has been minimized.

@Huang-Wei

Huang-Wei Nov 16, 2018

Member

sure, will update.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Nov 16, 2018

Contributor

Yeah, at first I found it be confusing too but after looking at the code for a bit, I understood it.

@justinsb

This comment has been minimized.

Member

justinsb commented Nov 16, 2018

So a question from trying to understand:

I don't see locking around cachedNodeInfoMap (and checkNode is spawned in a goroutine here, and I believe nodeNameToInfo is == cachedNodeInfoMap).

So is it possible that we concurrently deleted the node, and thus it is nil?

Technically we should be reading the map with a lock, because concurrent map access is unsafe. However, given the likely perf impact and the slow rate of change of Node objects, my personal vote would be that (if I am correct that we should be using a lock) that we not address it till 1.14. But we should check the case when the node is not in nodeNameToInfo?

(There are likely other ways also: we could also use a copy-on-write map, or maybe concurrent reading is safe if we are only writing from a thread that holds a write lock and we never change the set of keys etc)

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

So is it possible that we concurrently deleted the node, and thus it is nil?

This is not that possible. But it reminds me of another possibility that when a new node is added, it's possible cachedNodeInfoMap doesn't have that entry, yet. Why? Let me explain:

As I mentioned in description of this PR, schedulerCache is always up-to-date as it uses informers to watch changes on pods/nodes/etc. But the sync from schedulerCache to cachedNodeInfoMap is not realtime - it's called in snapshot(), by Schedule() and Preempt().

The issue occurs in Preempt():

  • on L251, it called snapshot() to populate the nodeInfoMap, it's good so far and snapshot() is a bi-directional comparison, and it's safe

  • but on L259, it uses NodeLister to generate all available names, and it's up-to-date! In other words, in between L251 and L259, it's possible there is a new node is added to the cluster

allNodes, err := nodeLister.List()

  • and on L266, it calculates a potentialNodes, which is filter out nodes which has failed on previous run of scheduling, and for sure a new node won't be filtered out

potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)

  • then it comes to where the panic happens: calls selectNodesForPreemption() and in selectNodesForPreemption(), checkNode() runs to iterate on every potentialNodeName and call nodeNameToInfo[nodeName].Clone() directly:

nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,

checkNode := func(i int) {
nodeName := potentialNodes[i].Name
var metaCopy algorithm.PredicateMetadata
if meta != nil {
metaCopy = meta.ShallowCopy()
}
pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], predicates, queue, pdbs)

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Nov 16, 2018

/milestone 1.13

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 16, 2018

@ravisantoshgudimetla: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone 1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:nodeinfo-clone-panic branch from 2068665 to 25e1f4c Nov 16, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 16, 2018

/milestone v1.13

Adding to 1.13. But do note that Code freeze is at 5pm today. If this is critical for 1.13 and need to get in, please switch priority to critical-urgent

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 16, 2018

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/priority critical-urgent

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

I added a UT to simulate the panic and prove my analysis in #71063 (comment):

Running tool: /usr/local/bin/go test -timeout 30s k8s.io/kubernetes/pkg/scheduler/core -run ^TestSelectNodesForPreemption$ -v -count=1

=== RUN   TestSelectNodesForPreemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_does_not_fit_on_any_machine
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_with_no_preemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_on_one_machine_with_no_preemption
E1116 12:10:41.760498   60306 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:522
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:513
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:82
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/signal_unix.go:390
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/cache/node_info.go:437
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:990
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:909
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:65
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:1333

and interestingly it's also pointing to the same line as original stack trace :)

requestedResource: n.requestedResource.Clone(),

@ravisantoshgudimetla

Thanks @Huang-Wei for finding and fixing this bug.

/lgtm
/approve

/cc @bsalamat

@k8s-ci-robot k8s-ci-robot requested a review from bsalamat Nov 16, 2018

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 16, 2018

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Nov 16, 2018

@Huang-Wei - Can you please create backports to 1.12 and 1.11?

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

@randomvariable that's what I'm going to do :)

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:nodeinfo-clone-panic branch from 25e1f4c to a86ba8b Nov 16, 2018

@k8s-ci-robot k8s-ci-robot removed the lgtm label Nov 16, 2018

@ravisantoshgudimetla

This comment has been minimized.

Contributor

ravisantoshgudimetla commented Nov 16, 2018

@bsalamat any other comments before I re-lgtm?

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 16, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 16, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/retest

2 similar comments
@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 16, 2018

/retest

@Huang-Wei

This comment has been minimized.

Member

Huang-Wei commented Nov 17, 2018

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7e621cc into kubernetes:master Nov 17, 2018

18 checks passed

cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@Huang-Wei Huang-Wei deleted the Huang-Wei:nodeinfo-clone-panic branch Nov 17, 2018

k8s-ci-robot added a commit that referenced this pull request Nov 28, 2018

Merge pull request #71195 from Huang-Wei/automated-cherry-pick-of-#71063
-upstream-release-1.12

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency

k8s-ci-robot added a commit that referenced this pull request Dec 3, 2018

Merge pull request #71196 from Huang-Wei/automated-cherry-pick-of-#71063
-upstream-release-1.11

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment