Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: tests start failing since migration of AIX builder #45170

Open
Helflym opened this issue Mar 23, 2021 · 17 comments
Open

runtime/pprof: tests start failing since migration of AIX builder #45170

Helflym opened this issue Mar 23, 2021 · 17 comments

Comments

@Helflym
Copy link
Contributor

@Helflym Helflym commented Mar 23, 2021

We recently moved the AIX VM into a Power9 server, in order to allow P9 builds in a near future.
For now, it's still using P8 instructions only, thus nothing should have change for now.

However, runtime/pprof tests start failing:
https://build.golang.org/log/aa79ac6509bea68e2e34cb50ac83b454544bdcd6
https://build.golang.org/log/9b24ed78ebd9a7e3bc3fcecb4e7504b6865b545b
...

I have to investigate them.

@laboger Did you already add this kind of bug or it's AIX related ?

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 23, 2021

This is related to a change in the configuration we made. We increased the number of possible threads by CPU (smt) from 4 to 8. It seems that pprof is loosing some samples because of that.

@laboger do you have an equivalent on Linux Power of SMT ? I guess yes, and if this case do you have any problems with 8 threads by CPU ?

Note that I've revert temporary to SMT 4 for the builder, to avoid blocking builds too much.

@cagedmantis cagedmantis added this to the Backlog milestone Mar 23, 2021
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Mar 23, 2021

@aixtools
Copy link

@aixtools aixtools commented Mar 24, 2021

Hi. Just a small idea - as I saw in the stats: total 0 CPU profile samples collected

In the past, when using a tool such as hpmstat that uses processor_type bound stats - changing to POWER8 from POWER7 caused me to review the differences (especially new names) - using pmlist -l and `pmlist -p POWERX -c -1)

  • The output is not really legible - so just a line-count - for an indication of what it might be:
# bash
# bash-4.4# diff -u <(pmlist -p POWER8 -c -1) <(pmlist -p POWER9 -c -1) | wc -l
3573
@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 24, 2021

I'm not sure the Golang runtime/pprof package is using the same technology than hpmstat. If I remember correctly, it's using the SIGPROF signal and DWARF information to get where we are.

@aixtools
Copy link

@aixtools aixtools commented Mar 24, 2021

It was just a guess - but when I saw 0 samples my first thought was on names/numbers of stats. Sadly, this is an area where binary compatibility has failed (imho).

It was also an issue with PowerVP getting stats from the VIOS (when it was AIX 6.1 based) and moved from POWERX to POWERX+1.

In short: performance stats being broken is a recurring thing when PROC_TYPE goes up.

@laboger
Copy link
Contributor

@laboger laboger commented Mar 24, 2021

@Helflym I built and tested on a Linux P9 LE system with SMT=8 and it works without error. I understand SMT=8 is the default for P9 so most likely that is what the P9 LE builders have.

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 24, 2021

Ok, we might remove Power8 compat mode and add SMT=8 again. But I just want to be sure that with POWER8 compat mode and SMT=4 it's working too.
My local test are ok, but I'm waiting for a few builds made by the farm to be sure.

@laboger
Copy link
Contributor

@laboger laboger commented Mar 24, 2021

/cc @pmur

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 25, 2021

Even with SMT=4 it seems to be broken but only for remote builds. My local builds seems to be passing without any troubles.
We will reboot in Power9 mode to see if there is any improvements.
Otherwise, any help would be welcomed as I have no idea about what's happening.

@laboger
Copy link
Contributor

@laboger laboger commented Mar 25, 2021

@Helfsym From the output it almost seems like the timer interval is not being set correctly or is being set in a way so it results in fewer samples. I see that AIX has a different way to do the setitimer than linux. I don't know if that helps.

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 26, 2021

@laboger yes but that's normal. AIX syscalls are a bit special as we must pass through libc calls.

Anyway, we have move back to Power9 and the tests are still failing.
Moreover, it looks like the problem only appears when several tests are run in parallel. I can't manage to make TestMathBigDivide failing when launching it alone unlike when it's run as part of the whole runtime/pprof tests. Is the sample area used by the runtime common to all tests?

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 30, 2021

I've been able to investigate a bit more what's going on.
All of the SIGPROF are being delivered correctly, but they appear to be deliver to a sleeping thread. The pc passed to sigprof is part of nsleep syscall. In sigprof, they ends up in https://github.com/golang/go/blob/master/src/runtime/proc.go#L4444.
This happens a lot of more when the VM isn't busy doing something else. When I've another build of Go running in parallel, most of the tests are passing (I don't want say "every tests" but it's very unlikely to have a test failing when the VM is busy.)

Moreover, looking at the syscalls being launched, I've a lot (like 1/3, 1/2 of the whole syscall for one test) being osyield. These calls might be behind the nsleep caught in sigprof.
I don't know what it's calling them, as it's a bit difficult to reproduce the same execution queue inside gdb. It looks like some are coming for runtime.suspendG in runtime.markroot. Linux/ppc64 have a few of them but not as much as AIX, for what I've checked.

I'll continue to investigate a bit. But if I'm finding nothing relevant, I'll push a temporary patch skipping runtime/pprof tests to release the builder.

@laboger
Copy link
Contributor

@laboger laboger commented Mar 30, 2021

@Helflym I don't know if this could be relevant, but have you looked at the implementation of procyield in runtime/asm_ppc64x.s? Carlos put that in before AIX was added to Go. Maybe those settings are affecting AIX differently on P9.

@Helflym
Copy link
Contributor Author

@Helflym Helflym commented Mar 30, 2021

@laboger yeap, I've seen that and it seems to be common to everyone. But I've asked for a confirmation if this is indeed working in AIX too.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 1, 2021

Change https://golang.org/cl/306489 mentions this issue: runtime/pprof: skip tests for AIX

gopherbot pushed a commit that referenced this issue Apr 4, 2021
Most of the time, the pprof tests are passing, except
for the builder. The reason is still unknown but I'd rather release
the builder to avoid missing other more important bugs.

Updates #45170

Change-Id: I667543ee1ae309b7319c5b3676a0901b4d0ecf2e
Reviewed-on: https://go-review.googlesource.com/c/go/+/306489
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
@gopherbot
Copy link

@gopherbot gopherbot commented May 5, 2021

Change https://golang.org/cl/317369 mentions this issue: [release-branch.go1.15] runtime/pprof: skip tests for AIX

@gopherbot
Copy link

@gopherbot gopherbot commented May 5, 2021

Change https://golang.org/cl/317297 mentions this issue: [release-branch.go1.16] runtime/pprof: skip tests for AIX

gopherbot pushed a commit that referenced this issue May 5, 2021
Most of the time, the pprof tests are passing, except
for the builder. The reason is still unknown but I'd rather release
the builder to avoid missing other more important bugs.

Updates #45170

Change-Id: I667543ee1ae309b7319c5b3676a0901b4d0ecf2e
Reviewed-on: https://go-review.googlesource.com/c/go/+/306489
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
(cherry picked from commit 7bfd681)
Reviewed-on: https://go-review.googlesource.com/c/go/+/317297
Trust: Heschi Kreinick <heschi@google.com>
Run-TryBot: Heschi Kreinick <heschi@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit that referenced this issue May 5, 2021
Most of the time, the pprof tests are passing, except
for the builder. The reason is still unknown but I'd rather release
the builder to avoid missing other more important bugs.

Updates #45170

Change-Id: I667543ee1ae309b7319c5b3676a0901b4d0ecf2e
Reviewed-on: https://go-review.googlesource.com/c/go/+/306489
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
(cherry picked from commit 7bfd681)
Reviewed-on: https://go-review.googlesource.com/c/go/+/317369
Reviewed-by: Heschi Kreinick <heschi@google.com>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants