Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apimachinery & apiserver: use stacktrace in the stdlib #75853

Merged

Conversation

@roycaihw
Copy link
Member

commented Mar 29, 2019

use runtime.Stack() in the stdlib to log stack trace instead of using customized code, so that third party lib can parse the panic (like panicparse)

e.g. if we have a panic here, with the change, the tool is able to parse all three goroutines

without the change, only the first goroutine (finishRequest) gets parsed. The second goroutine (timeoutHandler) is missed because we drop a trailing newline in the one goroutine before. The third goroutine (HandleCrash) is missed because getCaller() is non-standard which only contains filename + line number.

Does this PR introduce a user-facing change?:

Use stdlib to log stack trace when a panic occurs

/cc @logicalhan @sttts

@liggitt

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

I think the size limit was to prevent excessively large logs (stdlib http server panic handling does the same thing)

I don't know why the custom caller code was added initially. @smarterclayton, do you know?

@logicalhan
Copy link
Contributor

left a comment

I think the size limit was to prevent excessively large logs (stdlib http server panic handling does the same thing)

That makes sense. Maybe we should add a comment if we revert to that behavior?

Personally I like this change. I find the callers thing quite odd.

@@ -92,23 +92,22 @@ func (t *timeoutHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
return
}

errCh := make(chan interface{})
// resultCh is used as both errCh and stopCh

This comment has been minimized.

Copy link
@logicalhan

logicalhan Mar 29, 2019

Contributor

Nice, I like the rename. It makes more sense.

@@ -62,27 +63,14 @@ func HandleCrash(additionalHandlers ...func(interface{})) {

// logPanic logs the caller tree when a panic occurs.
func logPanic(r interface{}) {
callers := getCallers(r)
callers := debug.Stack()

This comment has been minimized.

Copy link
@logicalhan

logicalhan Mar 29, 2019

Contributor

probably better to rename this to stacktrace?

@@ -155,10 +143,10 @@ func GetCaller() string {
// handlers to handle errors and panics the same way.
func RecoverFromPanic(err *error) {
if r := recover(); r != nil {
callers := getCallers(r)
callers := debug.Stack()

This comment has been minimized.

Copy link
@logicalhan

logicalhan Mar 29, 2019

Contributor

same.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

I think this code was added before Go had an easy way to get a stack dump. There may have been a security concern too (avoiding showing some portions of the stack). But it’s been too long.

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Mar 29, 2019

2012: runtime/debug.Stack was deprecated (golang/go#4070, golang/go@e49a183)
2012: stdlib http server switched from runtime/debug.Stack to runtime.Stack with 4KB buf size (golang/go#4060, golang/go@dd43bf8)
2014: the http server increased buf size from 4KB to 64KB golang/go@645a341
2015: runtime/debug.Stack was re-implemented and un-deprecated golang/go@2d697b2
2017-2019: some conversation comparing the two APIs golang/go#3845 (comment)

I can go with manually allocating buf size if excessively large log is a concern. Will update the PR soon

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from 943a02b to 0e61b77 Apr 1, 2019

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

updated. I'm working on adding a test to capture what's printed in klog

@logicalhan
Copy link
Contributor

left a comment

Thanks for adding the test!

/lgtm

Show resolved Hide resolved staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime_test.go Outdated
callers := getCallers(r)
// Same as stdlib http server code. Manually allocate stack trace buffer size
// to prevent excessively large logs
const size = 64 << 10

This comment has been minimized.

Copy link
@logicalhan

logicalhan Apr 1, 2019

Contributor

I wonder if it would be worth extracting this logic somewhere, since we do the same thing in handlers/rest.go. Probably not though..

This comment has been minimized.

Copy link
@roycaihw

roycaihw Apr 2, 2019

Author Member

I agree. I'm not sure on this

on one hand, I'd imagine extracting this logic would result in one more function call in the call stack

on the other hand, I guess there are more places in our code base need this logic (I was about to add the logic here until I saw #73137 (comment))

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 1, 2019

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from 739be66 to 826cb90 Apr 1, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Apr 1, 2019

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from 826cb90 to d5ca73a Apr 2, 2019

@sttts

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

Looks good.

Can you post the difference in output from before and after this PR?

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from d5ca73a to a2c59f5 Apr 2, 2019

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

Can you post the difference in output from before and after this PR?

@sttts https://gist.github.com/roycaihw/b0977279000fb3f3048c3b77df89bec5

to reproduce: include this commit, bring up apiserver and look for the last occurrence of "Observed a panic" in apiserver log

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

/retest

@sttts

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

@roycaihw do I read the gist correctly that we print all 3249 go routine stack traces now, while only one before?

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@sttts we print stacktraces for 3 goroutines in both cases

after:

  • goroutine 3247 handlers.finishRequest
  • goroutine 3246 filters.(*timeoutHandler).ServeHTTP
  • goroutine 3277 runtime.logPanic

before:

  • goroutine 3277 handlers.finishRequest (without trailing newline)
  • goroutine 3276 filters.(*timeoutHandler).ServeHTTP
  • filename+line# (runtime.logPanic)

/retest

@roycaihw

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

@liggitt could you approve this PR?

/retest

if match, _ := regexp.MatchString(`logPanic(.*)`, lines[2]); !match {
t.Errorf("mismatch symbolized function name: %s", lines[2])
}
if match, _ := regexp.MatchString(`runtime\.go:69 \+0x`, lines[3]); !match {

This comment has been minimized.

Copy link
@liggitt

liggitt Apr 5, 2019

Member

this seems like a fragile test... is this format and line number stable across releases?

This comment has been minimized.

Copy link
@liggitt

liggitt Apr 5, 2019

Member

same question for the [running] format, etc

This comment has been minimized.

Copy link
@roycaihw

roycaihw Apr 5, 2019

Author Member

line #: it's fragile. I meant to have people update the number when changing our runtime.go, but it's not necessary

[running]: it comes from one of the statuses. Should be stable in this test

updated both to use regexp. The rest of the test should be stable enough

also added documentation about if golang stdlib changes its format


// captureStderr redirects stderr to result string, and then restore stderr from backup
func captureStderr(f func()) (string, error) {
bak := os.Stderr

This comment has been minimized.

Copy link
@liggitt

liggitt Apr 5, 2019

Member

defer func(){ os.Stderr = bak }()

This comment has been minimized.

Copy link
@roycaihw

roycaihw Apr 5, 2019

Author Member

updated

tried to defer closing the writable file before, and it behaved weirdly (stuck on flushing the buffer?)

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from a2c59f5 to ca8fcfc Apr 5, 2019

@roycaihw roycaihw force-pushed the roycaihw:fix/use-standard-stacktrace branch from ca8fcfc to 999a02c Apr 5, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

/lgtm
/approve
/retest

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f8d0b21 into kubernetes:master Apr 6, 2019

17 checks passed

cla/linuxfoundation roycaihw authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.