Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Structured Logging] when klog is passed nil, it causes components to panic while logging #100155

Closed
ehashman opened this issue Mar 12, 2021 · 12 comments · Fixed by #100156 or kubernetes/klog#226
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ehashman
Copy link
Member

What happened:

When klog.KObj is passed a nil ref while logging, it panics:

Mar 10 23:23:57 kind-worker kubelet[241]: E0310 23:23:57.580249 241 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Mar 10 23:23:57 kind-worker kubelet[241]: goroutine 4504 [running]:
Mar 10 23:23:57 kind-worker kubelet[241]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41cc2c0, 0x7377320)
Mar 10 23:23:57 kind-worker kubelet[241]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
Mar 10 23:23:57 kind-worker kubelet[241]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
Mar 10 23:23:57 kind-worker kubelet[241]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
Mar 10 23:23:57 kind-worker kubelet[241]: panic(0x41cc2c0, 0x7377320)
Mar 10 23:23:57 kind-worker kubelet[241]: /usr/local/go/src/runtime/panic.go:965 +0x1b9
Mar 10 23:23:57 kind-worker kubelet[241]: k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).GetName(0x0, 0x40a4c60, 0x73f4568)
Mar 10 23:23:57 kind-worker kubelet[241]: :1 +0x5
Mar 10 23:23:57 kind-worker kubelet[241]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.KObj(0x5191890, 0x0, 0x9c, 0xa, 0xc000f09b00, 0x40)
Mar 10 23:23:57 kind-worker kubelet[241]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1587 +0x31
Mar 10 23:23:57 kind-worker kubelet[241]: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult.func1(0xc0016ba1d0, 0xc0007f02c0, 0x0, 0x0, 0xc000a3f320, 0xc00090c380)
Mar 10 23:23:57 kind-worker kubelet[241]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:677 +0x2ae
Mar 10 23:23:57 kind-worker kubelet[241]: created by k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).killContainersWithSyncResult
Mar 10 23:23:57 kind-worker kubelet[241]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:670 +0x105

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/100050/pull-kubernetes-e2e-kind/1369786002343727104/artifacts/logs/kind-worker/kubelet.log

What you expected to happen:

nil or an empty reference should be logged.

How to reproduce it (as minimally and precisely as possible):

klog.KObj(nil)

Anything else we need to know?:

I have fixed this in kubernetes/klog#225 so we will need to bump klog.

ht @pacoxu for reporting in #100109

/priority critical-urgent
/sig instrumentation

@ehashman ehashman added the kind/bug Categorizes issue or PR as related to a bug. label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2021
@pacoxu
Copy link
Member

pacoxu commented Mar 12, 2021

need update klog dependencies later.
milestone 1.21?

@ehashman
Copy link
Member Author

/milestone v1.21

Will set the milestone for the issue and let people review the PR during regular burndown.

@ehashman
Copy link
Member Author

Scope check: potentially affected components are:

  • Kubelet
  • Controller manager
  • Scheduler

(grep for KObj: https://cs.k8s.io/?q=KObj&i=nope&files=&excludeFiles=&repos=)

@serathius
Copy link
Contributor

serathius commented Mar 12, 2021

I looked in current implementation of handling nil object. Please remember how checking interface{} works.

Current fix should work on simple nil (https://play.golang.org/p/T5974AlFXc8):

package main

import (
	"k8s.io/klog/v2"
)

func main() {
	klog.InfoS("message", "pod", klog.KObj(nil))
}

but it will fail on structs implementing KMetadata interface, but not handling nil pointer (https://play.golang.org/p/2EL5nNhgY8E).

package main

import (
	"k8s.io/klog/v2"
)

type A struct {
	name, namespace string
}

func (a *A) GetName() string {
	return a.name
}

func (a *A) GetNamespace() string {
	return a.namespace
}

func main() {
	var a *A
	klog.InfoS("message", "a", klog.KObj(a))
}

Reason is that interface with nil value is not nil. Resulting call to 'Log' function will panic on (nil).GetName() call.

In stacktrace presented above it was k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).GetName(0x0, 0x40a4c60, 0x73f4568) function that panicked which leads to thinking that the problem is in (*Pod).GetName() function which panics if pointer is empty.

Potential solutions:

  • Replace all klog.KObj calls with klog.KRef until we solve this issue
  • Fix this particular case where pod can be nil (solves only local problem, but not the long term issue that every log call can be a panic)
  • Use reflection to verify if object under interface is an nil (expected overhead EDIT: got results only 7% increased cpu usage) Fix cases where KMetadata interface has nil object klog#226
  • Fix all GetName and GetNamespace functions to check for nil (not enough time before the freeze, some overhead on calls)
  • Do runtime checks for K8s types (would require klog to import api types)

Looking at all options I think that first one makes most sense as all other have bad impact on whole k/k repo

@serathius
Copy link
Contributor

serathius commented Mar 12, 2021

Did benchmarking for using reflections, looks like performance overhead is not as significant as I previously anticipated, around 20%.

I think we could consider using reflections.

@serathius
Copy link
Contributor

serathius commented Mar 12, 2021

/cc @thockin @44past4 @DirectXMan12
For feedback

@serathius
Copy link
Contributor

/cc @dims

@ehashman
Copy link
Member Author

Potential solutions:

* Replace all `klog.KObj` calls with `klog.KRef` until we solve this issue

...
Looking at all options I think that first one makes most sense as all other have bad impact on whole k/k repo

There are hundreds of instances of klog.KObj in k/k now, and every single one would need to be carefully checked when obtaining the pod name/namespace, and we'd need a top-level approver to do the replacement. I think this really needs to be fixed in klog.

@serathius
Copy link
Contributor

With benchmark results being not as bad as first estimated I think we could use reflection in klog like proposed in kubernetes/klog#226

@ehashman
Copy link
Member Author

/reopen

needs to get bumped in k/k to close

@k8s-ci-robot
Copy link
Contributor

@ehashman: Reopened this issue.

In response to this:

/reopen

needs to get bumped in k/k to close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 12, 2021
@ehashman
Copy link
Member Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
No open projects
4 participants