Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler: updateDevicePath() panic:invalid memory address or nil pointer dereference #86722

Closed
h4ghhh opened this issue Dec 30, 2019 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@h4ghhh
Copy link
Contributor

h4ghhh commented Dec 30, 2019

What happened:
kubelet start reconciler with panic.
4773 reconciler.go:154] Reconciler: start to sync state E1211 00:37:16.826560 84773 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /usr/local/go/src/runtime/asm_amd64.s:522 /usr/local/go/src/runtime/panic.go:513 /usr/local/go/src/runtime/panic.go:82 /usr/local/go/src/runtime/signal_unix.go:390 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:563 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:600 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:419 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:330 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:155 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:143 /usr/local/go/src/runtime/asm_amd64.s:1333

What you expected to happen:
No panic.
How to reproduce it (as minimally and precisely as possible):
I don't know...
Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.13
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig node
/sig storage
/kind bug

@h4ghhh h4ghhh added the kind/bug Categorizes issue or PR as related to a bug. label Dec 30, 2019
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Dec 30, 2019
@tedyu
Copy link
Contributor

tedyu commented Dec 30, 2019

1.13 is very old.

Is it possible to try a recent release and see if the panic persists ?

Thanks

@neolit123
Copy link
Member

neolit123 commented Dec 30, 2019

/remove-kind bug
/priority awaiting-more-evidence

1.13

what patch version is this?
latest one is v1.13.12.

is your API server older than 1.13? try matching the kubelet and api-sever versions.
(see these comments: #78970 (comment))

the minimum version in the support skew is 1.15.x, so please upgrade.

there are only two changes in reconciler.go between 1.13 and 1.15:
https://github.com/kubernetes/kubernetes/commits/release-1.13/pkg/kubelet/volumemanager/reconciler/reconciler.go
https://github.com/kubernetes/kubernetes/commits/release-1.15/pkg/kubelet/volumemanager/reconciler/reconciler.go

@k8s-ci-robot k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 30, 2019
@k8s-ci-robot
Copy link
Contributor

@neolit123: Those labels are not set on the issue: kind/bug

In response to this:

/remove-kind bug
/priority awaiting-more-evidence

1.13

what patch version is this?
latest one is v1.13.12.

is your API server older than 1.13? try matching the kubelet and api-sever versions.
(see these comments: #78970 (comment))

the minimum version in the support skew is 1.15.x, so please upgrade.

there are only two changes in reconciler.go between 1.13 and 1.15:
https://github.com/kubernetes/kubernetes/commits/release-1.13/pkg/kubelet/volumemanager/reconciler/reconciler.go
https://github.com/kubernetes/kubernetes/commits/release-1.15/pkg/kubelet/volumemanager/reconciler/reconciler.go

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@h4ghhh
Copy link
Contributor Author

h4ghhh commented Jan 3, 2020

/remove-kind bug
/priority awaiting-more-evidence

1.13

what patch version is this?
latest one is v1.13.12.

is your API server older than 1.13? try matching the kubelet and api-sever versions.
(see these comments: #78970 (comment))

the minimum version in the support skew is 1.15.x, so please upgrade.

there are only two changes in reconciler.go between 1.13 and 1.15:
https://github.com/kubernetes/kubernetes/commits/release-1.13/pkg/kubelet/volumemanager/reconciler/reconciler.go
https://github.com/kubernetes/kubernetes/commits/release-1.15/pkg/kubelet/volumemanager/reconciler/reconciler.go

Version is 1.13.0, and merges bug from 1.13.6.
It happened after kubelet restarted.

@neolit123
Copy link
Member

< 1.15 is out of support at this point, so it is best to help us confirm if the issue is still present in versions in the support skew.

anything else you can tells us?

  • when does this happen?
  • any special conditions, volumes?
  • how to reproduce this?

given this call passes:

rc.updateDevicePath(volumesNeedUpdate)

abdda3f == v1.13.6

i'm going to assume that what is causing the panic on this line:

node, fetchErr := rc.kubeClient.CoreV1().Nodes().Get(string(rc.nodeName), metav1.GetOptions{})

is a kubeClient == nil, which is quite odd.

looking at the backtrace in the OP and the history of the file i don't see any changes that could have fixed such a panic.

@kubernetes/sig-storage-bugs

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 3, 2020
@tedyu
Copy link
Contributor

tedyu commented Jan 3, 2020

In kubelet.go, kubeClient is passed from:

	klet := &Kubelet{
		hostname:                                hostname,
		hostnameOverridden:                      len(hostnameOverride) > 0,
		nodeName:                                nodeName,
		kubeClient:                              kubeDeps.KubeClient,

Earlier there is check:

	if kubeDeps.KubeClient != nil {

I am thinking of the following fix:

diff --git a/pkg/kubelet/volumemanager/reconciler/reconciler.go b/pkg/kubelet/volumemanager/reconciler/reconciler.go
index adb2e2038cd..0ca13912ee9 100644
--- a/pkg/kubelet/volumemanager/reconciler/reconciler.go
+++ b/pkg/kubelet/volumemanager/reconciler/reconciler.go
@@ -598,7 +598,10 @@ func (rc *reconciler) reconstructVolume(volume podVolume) (*reconstructedVolume,
 }

 // updateDevicePath gets the node status to retrieve volume device path information.
-func (rc *reconciler) updateDevicePath(volumesNeedUpdate map[v1.UniqueVolumeName]*reconstructedVolume) {
+func (rc *reconciler) updateDevicePath(volumesNeedUpdate map[v1.UniqueVolumeName]*reconstructedVolume) error {
+       if rc.kubeClient == nil {
+               return fmt.Errorf("kubeClient is nil")
+       }
        node, fetchErr := rc.kubeClient.CoreV1().Nodes().Get(string(rc.nodeName), metav1.GetOptions{})
        if fetchErr != nil {
                klog.Errorf("updateStates in reconciler: could not get node status with error %v", fetchErr)
@@ -611,6 +614,7 @@ func (rc *reconciler) updateDevicePath(volumesNeedUpdate map[v1.UniqueVolumeName
                        }
                }
        }
+       return nil
 }

 // getDeviceMountPath returns device mount path for block volume which
@@ -630,7 +634,9 @@ func getDeviceMountPath(volume *reconstructedVolume) (string, error) {

 func (rc *reconciler) updateStates(volumesNeedUpdate map[v1.UniqueVolumeName]*reconstructedVolume) error {
        // Get the node status to retrieve volume device path information.
-       rc.updateDevicePath(volumesNeedUpdate)
+       if err := rc.updateDevicePath(volumesNeedUpdate); err != nil {
+               return err
+       }

        for _, volume := range volumesNeedUpdate {
                err := rc.actualStateOfWorld.MarkVolumeAsAttached(

@neolit123
Copy link
Member

@tedyu
having the nil guard is OK, but i think the question is why did the client end up as nil in the first place.

@tedyu
Copy link
Contributor

tedyu commented Jan 3, 2020

@h4ghhh
Do you see the following log ?

		kubeDeps.KubeClient, err = clientset.NewForConfig(clientConfig)
		if err != nil {
			return fmt.Errorf("failed to initialize kubelet client: %v", err)

It seems kubeDeps.KubeClient might be nil in case of error.

@h4ghhh
Copy link
Contributor Author

h4ghhh commented Jan 3, 2020

@h4ghhh
Do you see the following log ?

		kubeDeps.KubeClient, err = clientset.NewForConfig(clientConfig)
		if err != nil {
			return fmt.Errorf("failed to initialize kubelet client: %v", err)

It seems kubeDeps.KubeClient might be nil in case of error.

I don‘t find such log...

The whole system was upgrading. Apiserver had not beening working at that time, but kubelet started running first.

@mattjmcnaughton
Copy link
Contributor

Ah, is it possible that the Kubelet was running in standalone mode at that time?

Re https://sourcegraph.com/github.com/kubernetes/kubernetes@master/-/blob/cmd/kubelet/app/server.go#L551

If the Kubelet wasn't running in standalone mode, I'm not seeing how KubeClient could be nil. As far as I can tell, the Kubelet doesn't actually start running when NewForConfig returns an error, as the cmd/kubelet/app/server.go#run func immediately returns an error (see the code @tedyu linked above).

@Pingan2017
Copy link
Member

Ah, is it possible that the Kubelet was running in standalone mode at that time?

Re https://sourcegraph.com/github.com/kubernetes/kubernetes@master/-/blob/cmd/kubelet/app/server.go#L551

If the Kubelet wasn't running in standalone mode, I'm not seeing how KubeClient could be nil. As far as I can tell, the Kubelet doesn't actually start running when NewForConfig returns an error, as the cmd/kubelet/app/server.go#run func immediately returns an error (see the code @tedyu linked above).

+1

@h4ghhh
Copy link
Contributor Author

h4ghhh commented Jan 4, 2020

Ah, is it possible that the Kubelet was running in standalone mode at that time?

Re https://sourcegraph.com/github.com/kubernetes/kubernetes@master/-/blob/cmd/kubelet/app/server.go#L551

If the Kubelet wasn't running in standalone mode, I'm not seeing how KubeClient could be nil. As far as I can tell, the Kubelet doesn't actually start running when NewForConfig returns an error, as the cmd/kubelet/app/server.go#run func immediately returns an error (see the code @tedyu linked above).

Yes, kubelet was running in standalone mode, then?

@tedyu
Copy link
Contributor

tedyu commented Jan 4, 2020

Only rc.updateDevicePath() uses KubeClient.
It seems adding the null check would stop the panic.

I am open to not running reconciler if KubeClient is nil (over #86795).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 3, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants