Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix panic while removing node from imageStates cache #66224

Merged
merged 1 commit into from
Jul 16, 2018

Conversation

nikhita
Copy link
Member

@nikhita nikhita commented Jul 16, 2018

Currently, when I run hack/local-up-cluster.sh, the scheduler encounters a panic. From /tmp/kube-scheduler.log :

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x15e9988]

goroutine 55 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x4242880, 0x877d870)
	/usr/local/go/src/runtime/panic.go:502 +0x229
k8s.io/kubernetes/pkg/scheduler/cache.(*schedulerCache).removeNodeImageStates(0xc4203dfe50, 0xc420ae3b80)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/cache/cache.go:510 +0xe8
k8s.io/kubernetes/pkg/scheduler/cache.(*schedulerCache).UpdateNode(0xc4203dfe50, 0xc420ae3b80, 0xc420415340, 0x0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/cache/cache.go:442 +0xcd
k8s.io/kubernetes/pkg/scheduler/factory.(*configFactory).updateNodeInCache(0xc420d2ca00, 0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/factory/factory.go:794 +0x9a
k8s.io/kubernetes/pkg/scheduler/factory.(*configFactory).(k8s.io/kubernetes/pkg/scheduler/factory.updateNodeInCache)-fm(0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/factory/factory.go:248 +0x52
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(0xc4209fc8f0, 0xc4209fc900, 0xc4209fc910, 0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:202 +0x5d
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x42cf8f, 0xc4215035a0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552 +0x18a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0xc4214addf0, 0x42cad9, 0xc421598f30)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203 +0x9c
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x81
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc421503768)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4214adf68, 0xdf8475800, 0x0, 0x40fdd01, 0xc4215d2360)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc421503768, 0xdf8475800, 0xc4215d2360)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc420187100)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x78
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).(k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.run)-fm()
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc4209b4840, 0xc42025e370)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

#65745 was merged recently which introduced

state, ok := cache.imageStates[name]
if ok {
state.nodes.Delete(node.Name)
}
if len(state.nodes) == 0 {

If !ok i.e. state is nil, state.nodes ends up in a panic.

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 16, 2018
@nikhita
Copy link
Member Author

nikhita commented Jul 16, 2018

@k8s-ci-robot
Copy link
Contributor

@nikhita: GitHub didn't allow me to request PR reviews from the following users: silveryfu.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @silveryfu @resouer @bsalamat @ravisantoshgudimetla

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nikhita for finding and working on this. This LGTM.

/lgtm

// imageStates represents the total number of different
// images on all nodes
delete(cache.imageStates, name)
if len(state.nodes) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previous line doesn't make much sense now -state.nodes.Delete(node.Name) when len(state.nodes) == 0, you can change the order. I'd to do a !ok pattern, atleast log something in the !ok block and then in else we can handle this condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still some nits and tests to be fixed as a follow-up for #65745. Maybe this can be incorporated into that, or I'll fix it as a follow-up later. 👍

@ravisantoshgudimetla
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 16, 2018
@k82cn
Copy link
Member

k82cn commented Jul 16, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, nikhita, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 66203, 66224). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 72440a1 into kubernetes:master Jul 16, 2018
@nikhita nikhita deleted the fix-scheduler-panic branch July 16, 2018 16:30
@bsalamat
Copy link
Member

Thanks @nikita for fixing the issue. Any fix like this should be accompanied by a test that reproduces the bug. That would verify that the fix does not miss corner cases and also raises the probability of catching similar bugs in other areas of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants