-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/signal: TestAtomicStop failing on Illumos #35085
Comments
I was just looking at this. It doesn't, as far as I can tell, always fail -- just most of the time. Having watched the buildlet a bit in the last few days while it was getting going, the load averages in the zone get up pretty high. I think this is because the If |
On Linux, Perhaps the fix here is just to fix the Illumos implementation of NumCPU to be aware of the zone's CPU limit. |
In fact, I dealt with this just recently in https://go-review.googlesource.com/c/build/+/201637/7/cmd/rundockerbuildlet/rundockerbuildlet.go#162 where I had to restrict a Docker container's Perhaps we need a new knob that lets us control test parallelism without hard-coding it to mean /cc @bcmills |
If we could set, say |
@jclulow, you should probably still fix |
The hard-coded go/src/os/signal/signal_test.go Lines 449 to 456 in 606019c
See previously #29046. |
I don't see a compelling need to limit the delay here to 2 seconds, given that in the normal case the test will pass on every attempt. |
FWIW, we see this failure intermittently on gccgo for both ppc64le & ppc64. The limit was put at 2s to hopefully fix that problem but it still happens. We don't see the error in golang. |
If we pull in the fix for #28135, then we could fairly easily plumb the test deadline down to |
Change https://golang.org/cl/203499 mentions this issue: |
Previously, TestAtomicStop used a hard-coded 2-second timeout. That empirically is not long enough on certain builders. Rather than adjusting it to a different arbitrary value, use a slice of the overall timeout for the test binary. If everything is working, we won't block nearly that long anyway. Updates #35085 Change-Id: I7b789388e3152413395088088fc497419976cf5c Reviewed-on: https://go-review.googlesource.com/c/go/+/203499 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Still failing, but the failures take longer now:
Maybe an Or is it just overloaded enough that the signals take longer than ~15s each to deliver? (Do we need to set a timeout scale factor on this builder to give it more time?) |
I doubt it is a timeout issue. But if it is, we don't need to be fancy about the timeouts. Just set the timeout to one minute. |
It would seem that #35199 hasn't improved things. I tried running the test suite manually on the builder, using I've set up some basic tracing of signal-related system calls to get a better idea of what happens next time one of these runs and fails on the builder. |
This is proving difficult to debug. The only time the test seems to fail is when run under the buildlet at the behest of the coordinator. I have been trying, but I don't believe I can make it fail when run manually under I've tried using DTrace at various times to monitor the timing of signal-related system calls and delivery events, and it seems that the very slight probe effect is enough to make the situation worse -- but only in the buildlet context where the failures are pretty reliably happening now. Applying the tracing to standalone execution of the test cases doesn't seem to induce the failure there. I'd appreciate any advice for how folks would normally debug what seems like a pretty tight race in the runtime like this one. Also, as noted in #35261 it seems like the build is currently broken as of the switch to the new timer infrastructure, which is making it a bit difficult to make forward progress. |
Sometimes tests like this can be made to fail locally by putting the local system under heavy load while running the test. I don't have any reliable mechanism, though. I've looked at this test several times and I've never been able to find anything wrong with the code. It would be really great if you could figure this out. Thanks for looking at it. |
If it matters, the buildlet (the parent process that runs signal.Notify(c, syscall.SIGINT, syscall.SIGTERM) Would that affect the test? |
Thanks, but I can't think of any reason why that would matter. The test does its own |
I think fixing this is going to take a while. Can I submit a CL that skips it specifically on the current illumos buildlet so that we can at least see the general health of the rest of the test suite more clearly? |
OK, I'm pretty sure I know what this is now. The
The buildlet is running under SMF and as with many service supervision systems there is no controlling terminal or job control in that context. The SMF service for the buildlet is operating with the Examining the running buildlet, we can see that indeed
Looking at exec(2) we see:
I suspect that #include <stdlib.h>
#include <stdint.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>
int
main(int argc, char *argv[])
{
if (argc < 2) {
errx(1, "usage: %s PROGRAM [ARGS...]", argv[0]);
}
if (argv[1][0] != '/') {
errx(1, "PROGRAM must be fully qualified");
}
/*
* Ignore SIGINT so that the new process also ignores it.
*/
if (signal(SIGINT, SIG_IGN) == SIG_ERR) {
err(1, "signal");
}
if (execv(argv[1], argv + 1) != 0) {
err(1, "execv");
}
}
For now, I'm going to try inserting the opposite of this wrapper program into the buildlet script; i.e., resetting the disposition of |
OK I've adjusted the SMF service to use the
A few test runs have now had a pass from |
I don't see why it should matter whether In general Go is prepared for |
It's fine to send a CL to skip the test on Illumos. Use |
I'm pretty sure this problem occurs on Linux too when package main
import (
"fmt"
"os"
"os/signal"
"sync"
"syscall"
"time"
)
// atomicStopTestProgram is run in a subprocess by TestAtomicStop.
// It tries to trigger a signal delivery race. This function should
// either catch a signal or die from it.
func atomicStopTestProgram() {
status := 0
const tries = 10
timeout := 2 * time.Second
pid := syscall.Getpid()
printed := false
for i := 0; i < tries; i++ {
cs := make(chan os.Signal, 1)
signal.Notify(cs, syscall.SIGINT)
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
signal.Stop(cs)
}()
syscall.Kill(pid, syscall.SIGINT)
// At this point we should either die from SIGINT or
// get a notification on cs. If neither happens, we
// dropped the signal. It is given 2 seconds to
// deliver, as needed for gccgo on some loaded test systems.
select {
case <-cs:
case <-time.After(timeout):
status = 1
if !printed {
fmt.Print("lost signal on tries:")
printed = true
}
fmt.Printf(" %d", i)
}
wg.Wait()
}
if printed {
fmt.Print("\n")
}
os.Exit(status)
}
func main() {
atomicStopTestProgram()
} I created a simple test harness to run the program over and over and report when it emits the error message: #!/bin/bash
dir=$(cd "$(dirname "$0")" && pwd)
echo $dir/three
fail=0
pass=0
lastprint=$SECONDS
trap 'exit 1' INT
while :; do
x=$($dir/ignore_sigint $dir/three 2>&1)
if [[ $x =~ 'lost signal on tries:' ]]; then
printf ' %s\n' "$x"
(( fail += 1 ))
else
(( pass += 1 ))
fi
if (( SECONDS != lastprint )); then
printf '%s pass %-4d fail %-4d\n' "$(date -u +%FT%TZ)" \
"$pass" "$fail"
pass=0
fail=0
lastprint=$SECONDS
fi
done The On an illumos system, this obviously fails pretty readily:
But it also fails (though less often) on a Linux system:
|
Ah, I see it now. My apologies. When |
Thanks very much for digging into this. |
Change https://golang.org/cl/207081 mentions this issue: |
The newly revived Illumos builder (run by @jclulow now, on different host/OS probably) is now failing with:
https://build.golang.org/log/47c0329f33a7c7bd68dc98c35021160b03a3c6a5
/cc @ianlancetaylor @bcmills
The text was updated successfully, but these errors were encountered: