Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: new test TestLabelSystemstack is flaky #50007

Closed
prattmic opened this issue Dec 6, 2021 · 12 comments
Closed

runtime/pprof: new test TestLabelSystemstack is flaky #50007

prattmic opened this issue Dec 6, 2021 · 12 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@prattmic
Copy link
Member Author

prattmic commented Dec 6, 2021

It looks like this is exposing a legitimate bug. Every one of these logs I've looked at looks something like this:

        1: 0x80ed151 (fmt.Fprintf:203) 0x817f28c (runtime/pprof.labelHog:1500) 0x817ec3b (runtime/pprof.TestLabelSystemstack.func2.1.1:1481) labels: map[]
        1: 0x80ece1b (fmt.newPrinter:140) 0x80ed14d (fmt.Fprintf:203) 0x817f28c (runtime/pprof.labelHog:1500) 0x817ec3b (runtime/pprof.TestLabelSystemstack.func2.1.1:1481) labels: map[key:[value]]
        1: 0x80f2ac0 (fmt.(*pp).doPrintf:1026) 0x80ed181 (fmt.Fprintf:204) 0x817f28c (runtime/pprof.labelHog:1500) 0x817ec3b (runtime/pprof.TestLabelSystemstack.func2.1.1:1481) labels: map[key:[value]]
        1: 0x80f355c (fmt.(*pp).doPrintf:1009) 0x80ed181 (fmt.Fprintf:204) 0x817f28c (runtime/pprof.labelHog:1500) 0x817ec3b (runtime/pprof.TestLabelSystemstack.func2.1.1:1481) labels: map[key:[value]]
...

Notice that the first entry is missing the label. Every runs seems to have the first sample missing the label. Perhaps we are dropping that somewhere?

@bcmills bcmills added this to the Go1.18 milestone Dec 6, 2021
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 6, 2021
@felixge
Copy link
Contributor

felixge commented Dec 6, 2021

😞 Sorry this is flaky again. I'll do some debugging now to see if I can figure out what's going on. Thanks for the details you provided @prattmic . Looks like there are no linux/amd64 failures so far?

@prattmic prattmic self-assigned this Dec 6, 2021
@prattmic
Copy link
Member Author

prattmic commented Dec 6, 2021

I am looking as well to see if I can find the discrepancy.

@prattmic
Copy link
Member Author

prattmic commented Dec 6, 2021

Found the bug!

profBuf.write uses an index in b.tags for each entry, even if that entry has no tag (that slice entry just remains 0). profBuf.read similarly returns a tags slice with exactly as many entries as there are in data ("entries" in data may use multiple uint64s in that slice).

profileBuilder.addCPUData iterates through the tags in lockstep with the data entries. Except in the special case of the first record, where it forgets to increment tags. Thus the first read of profiling data has all tags off-by-one.

I will send a CL shortly.

@felixge
Copy link
Contributor

felixge commented Dec 6, 2021

@prattmic wow, that was fast! I only got as far as verifying that the tagPtr is still intact when cpuprof.add(tagPtr, stk[:n]) gets called for the first sample :).

Given what you found, do you think this this will fix the test flakiness? Or would it just help with the tag accuracy of the first batch of samples?

@prattmic
Copy link
Member Author

prattmic commented Dec 6, 2021

Given what you found, do you think this this will fix the test flakiness? Or would it just help with the tag accuracy of the first batch of samples?

I'm not certain. I hope it is the only bug and perhaps could even resolve this TODO, but I don't know yet.

@gopherbot
Copy link

Change https://golang.org/cl/369741 mentions this issue: runtime/pprof: consume tag for first CPU record

@felixge
Copy link
Contributor

felixge commented Dec 6, 2021

and perhaps could even resolve this TODO, but I don't know yet.

Just took your patch for a spin, and it looks very promising! I'd be happy to send a change for removing that TODO once your patch lands. Or feel free to make it part of your change if you think it's worth a shot.

https://gist.github.com/felixge/b3a2436035ccfc24d3a0aa0fe98b3c57

@gopherbot
Copy link

Change https://golang.org/cl/369744 mentions this issue: runtime/pprof: assert that labelHog samples are always labeled

@gopherbot
Copy link

Change https://golang.org/cl/369984 mentions this issue: runtime/pprof: assert that labels never appear on unexpected samples

@prattmic
Copy link
Member Author

prattmic commented Dec 7, 2021

https://golang.org/cl/369744 is expected to fix this. Will file a new bug if more issues come up.

@prattmic prattmic closed this as completed Dec 7, 2021
gopherbot pushed a commit that referenced this issue Dec 7, 2021
profBuf.write uses an index in b.tags for each entry, even if that entry
has no tag (that slice entry just remains 0). profBuf.read similarly
returns a tags slice with exactly as many entries as there are records
in data.

profileBuilder.addCPUData iterates through the tags in lockstep with the
data records. Except in the special case of the first record, where it
forgets to increment tags. Thus the first read of profiling data has all
tags off-by-one.

To help avoid regressions, addCPUData is changed to assert that tags
contains exactly the correct number of tags.

For #50007.

Change-Id: I5f32f93003297be8d6e33ad472c185d924a63256
Reviewed-on: https://go-review.googlesource.com/c/go/+/369741
Reviewed-by: Austin Clements <austin@google.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Dec 7, 2021
With https://golang.org/issue/50007 resolved, there are no known issues
with pprof labels remaining. Thus, the 10% allowed error in
TestLabelSystemstack should not be required.

Drop it in favor of an explicit assertion that all samples containing
labelHog are properly labeled.

This is no flaky in my local testing. It is possible that other bugs
will appear at larger testing scale, in which case this CL will be
reverted, but then at least we will be aware of additional failure
modes.

For #50007.

Change-Id: I1ef530c303bd9a01af649b8b08d4b35505e8aada
Reviewed-on: https://go-review.googlesource.com/c/go/+/369744
Reviewed-by: Bryan Mills <bcmills@google.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Member

bcmills commented Dec 8, 2021

Will file a new bug if more issues come up.

Looks like there is at least one failure mode remaining — filed #50050.

gopherbot pushed a commit that referenced this issue Jan 19, 2022
This makes TestLabelSystemstack much more strict, enabling it to detect
any misplacement of labels.

Unfortunately, there are several edge cases where we may not have an
obviously correct stack trace, so we generally except the runtime
package, with the exception of background goroutines that we know should
not be labeled.

For #50007
For #50032

Change-Id: I8dce7e7da04f278ce297422227901efe52782ca0
Reviewed-on: https://go-review.googlesource.com/c/go/+/369984
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
This makes TestLabelSystemstack much more strict, enabling it to detect
any misplacement of labels.

Unfortunately, there are several edge cases where we may not have an
obviously correct stack trace, so we generally except the runtime
package, with the exception of background goroutines that we know should
not be labeled.

For golang#50007
For golang#50032

Change-Id: I8dce7e7da04f278ce297422227901efe52782ca0
Reviewed-on: https://go-review.googlesource.com/c/go/+/369984
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@prattmic prattmic self-assigned this Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

4 participants