Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestAtomicStop failing on Illumos #35085

Closed
bradfitz opened this issue Oct 22, 2019 · 27 comments

Comments

@bradfitz
Copy link
Member

@bradfitz bradfitz commented Oct 22, 2019

The newly revived Illumos builder (run by @jclulow now, on different host/OS probably) is now failing with:

https://build.golang.org/log/47c0329f33a7c7bd68dc98c35021160b03a3c6a5

ok  	os	0.809s
ok  	os/exec	0.869s
--- FAIL: TestAtomicStop (2.15s)
    signal_test.go:400: iteration 1: output lost signal on tries: 2
    signal_test.go:408: iteration 1: lost signal
FAIL
FAIL	os/signal	6.493s
ok  	os/user	0.020s
ok  	path	0.025s
FAIL
2019/10/22 14:36:24 Failed: exit status 1

/cc @ianlancetaylor @bcmills

@bradfitz bradfitz added this to the Go1.14 milestone Oct 22, 2019
@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Oct 22, 2019

I was just looking at this. It doesn't, as far as I can tell, always fail -- just most of the time. Having watched the buildlet a bit in the last few days while it was getting going, the load averages in the zone get up pretty high.

I think this is because the -p flag for job parallelism defaults to the number of CPUs, so the test runner starts that many tests in parallel. Because this is a zone (i.e., a kind of container) you can actually see (and schedule on!) every CPU in the host; there are 48 in all. Though all of the CPUs are visible, the zone is capped in the amount of CPU time it is able to use -- in effect, it's able to use 2 CPUs worth of CPU time per second.

If TestAtomicStop is timing sensitive, it's possible that when it fails it's running alongside rather a lot of other tests, and the ~24X over subscription might mean this test isn't on CPU to run soon enough to hit the timing window it's expecting. When I run the test suite on a standalone machine which doesn't have this property (i.e., a small system on my desk where I'm the only user) it seems to pass every time.

@bradfitz

This comment has been minimized.

Copy link
Member Author

@bradfitz bradfitz commented Oct 22, 2019

On Linux, runtime.NumCPU() looks at the number of actually usable CPUs.

Perhaps the fix here is just to fix the Illumos implementation of NumCPU to be aware of the zone's CPU limit.

@bradfitz

This comment has been minimized.

Copy link
Member Author

@bradfitz bradfitz commented Oct 22, 2019

In fact, I dealt with this just recently in https://go-review.googlesource.com/c/build/+/201637/7/cmd/rundockerbuildlet/rundockerbuildlet.go#162 where I had to restrict a Docker container's --cpuset-cpus value just to make runtime.NumCPU happy, even though it would've been nice to give the containers as much extra CPU as they wanted to fight over.

Perhaps we need a new knob that lets us control test parallelism without hard-coding it to mean NumCPU.

/cc @bcmills

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Oct 22, 2019

If we could set, say GO_BUILD_MAXJOBS, and have that override the -p default, I could definitely set that in the environment for the buildlet service to a value that makes sense for the specific memory and CPU caps in the particular zone.

@bradfitz

This comment has been minimized.

Copy link
Member Author

@bradfitz bradfitz commented Oct 22, 2019

@jclulow, you should probably still fix runtime.NumCPU meanwhile, though, if it's not too much work.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 23, 2019

The hard-coded time.After in this test is suspect:

// At this point we should either die from SIGINT or
// get a notification on cs. If neither happens, we
// dropped the signal. It is given 2 seconds to
// deliver, as needed for gccgo on some loaded test systems.
select {
case <-cs:
case <-time.After(2 * time.Second):

See previously #29046.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 23, 2019

I don't see a compelling need to limit the delay here to 2 seconds, given that in the normal case the test will pass on every attempt.

@laboger

This comment has been minimized.

Copy link
Contributor

@laboger laboger commented Oct 23, 2019

FWIW, we see this failure intermittently on gccgo for both ppc64le & ppc64. The limit was put at 2s to hopefully fix that problem but it still happens. We don't see the error in golang.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 23, 2019

If we pull in the fix for #28135, then we could fairly easily plumb the test deadline down to atomicStopTestProgram and remove the hard-coded timeout entirely. I'm not sure whether that would address the problem for gccgo or make the failure mode more frustrating, though.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 25, 2019

Change https://golang.org/cl/203499 mentions this issue: os/signal: derive TestAtomicStop timeout from overall test timeout

gopherbot pushed a commit that referenced this issue Oct 25, 2019
Previously, TestAtomicStop used a hard-coded 2-second timeout.

That empirically is not long enough on certain builders. Rather than
adjusting it to a different arbitrary value, use a slice of the
overall timeout for the test binary. If everything is working, we
won't block nearly that long anyway.

Updates #35085

Change-Id: I7b789388e3152413395088088fc497419976cf5c
Reviewed-on: https://go-review.googlesource.com/c/go/+/203499
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 25, 2019

Still failing, but the failures take longer now:

--- FAIL: TestAtomicStop (30.66s)
    signal_test.go:420: iteration 4: output lost signal on tries: 3
    signal_test.go:428: iteration 4: lost signal
    signal_test.go:420: iteration 7: output lost signal on tries: 4
    signal_test.go:428: iteration 7: lost signal
FAIL
FAIL	os/signal	35.166s

Maybe an illumos kernel bug? (Is it dropping signals under load?)

Or is it just overloaded enough that the signals take longer than ~15s each to deliver? (Do we need to set a timeout scale factor on this builder to give it more time?)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 26, 2019

I doubt it is a timeout issue. But if it is, we don't need to be fancy about the timeouts. Just set the timeout to one minute.

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Oct 27, 2019

@jclulow, you should probably still fix runtime.NumCPU meanwhile, though, if it's not too much work.

I've opened #35199 for this.

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Oct 29, 2019

It would seem that #35199 hasn't improved things. I tried running the test suite manually on the builder, using make.bash and then run.bash and the tests appear to pass -- including os/signal.

I've set up some basic tracing of signal-related system calls to get a better idea of what happens next time one of these runs and fails on the builder.

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Nov 1, 2019

This is proving difficult to debug. The only time the test seems to fail is when run under the buildlet at the behest of the coordinator.

I have been trying, but I don't believe I can make it fail when run manually under all.bash or run.bash, or when using go tool dist test to run the os/signal test more directly, or when using curl to attempt to replicate (on an isolated buildlet) what I believe the coordinator is asking us to do. I also tried lifting the atomicStopTestProgram() routine out into a standalone program (trimming out the test harness-related bits) and I don't believe I've been able to observe that failing either, despite running it many thousands of times under different levels of system load.

I've tried using DTrace at various times to monitor the timing of signal-related system calls and delivery events, and it seems that the very slight probe effect is enough to make the situation worse -- but only in the buildlet context where the failures are pretty reliably happening now. Applying the tracing to standalone execution of the test cases doesn't seem to induce the failure there.

I'd appreciate any advice for how folks would normally debug what seems like a pretty tight race in the runtime like this one. Also, as noted in #35261 it seems like the build is currently broken as of the switch to the new timer infrastructure, which is making it a bit difficult to make forward progress.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 1, 2019

Sometimes tests like this can be made to fail locally by putting the local system under heavy load while running the test. I don't have any reliable mechanism, though.

I've looked at this test several times and I've never been able to find anything wrong with the code. It would be really great if you could figure this out. Thanks for looking at it.

@bradfitz

This comment has been minimized.

Copy link
Member Author

@bradfitz bradfitz commented Nov 1, 2019

If it matters, the buildlet (the parent process that runs go tool dist test XXXXX) does:

        signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)

Would that affect the test?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 1, 2019

Thanks, but I can't think of any reason why that would matter. The test does its own signal.Notify call and that should be all that matters for it.

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Nov 4, 2019

I think fixing this is going to take a while. Can I submit a CL that skips it specifically on the current illumos buildlet so that we can at least see the general health of the rest of the test suite more clearly?

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Nov 4, 2019

OK, I'm pretty sure I know what this is now. The stage0 binary is being run by a simple shell script that just starts it over and over again. That shell script is using bash as the shell. From the SIGNALS section of bash(1):

When job control is not in effect, asynchronous commands ignore SIGINT and SIGQUIT in addition to these inherited handlers.

The buildlet is running under SMF and as with many service supervision systems there is no controlling terminal or job control in that context. The SMF service for the buildlet is operating with the contract lifetime, and expects a command to execute that will terminate once the service has been started. As stage0 does not daemonise itself, we've taken the simple option of running the script that starts stage0 in a loop as buildlet & -- the & makes it an asynchronous command in bash parlance.

Examining the running buildlet, we can see that indeed SIGINT is ignored:

[root@gobuild3 ~]# pgrep -fl buildlet.exe
99253 ./buildlet.exe --halt=false --reverse-type=host-illumos-amd64-jclulow --coordin
[root@gobuild3 ~]# pgrep buildlet.exe | xargs -t psig | grep ignored
psig 99253 
INT     ignored

Looking at exec(2) we see:

Signals set to the default action (SIG_DFL) in the calling process image are set to the default action in the new process image (see signal(3C)). Signals set to be ignored (SIG_IGN) by the calling process image are set to be ignored by the new process image.

I suspect that SIGINT being ignored by default is contrary to the expectations of at least this test, if not the Go signal handling machinery in general. Using a simple C program I can now reproduce the failure:

#include <stdlib.h>
#include <stdint.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>


int
main(int argc, char *argv[])
{
        if (argc < 2) {
                errx(1, "usage: %s PROGRAM [ARGS...]", argv[0]);
        }

        if (argv[1][0] != '/') {
                errx(1, "PROGRAM must be fully qualified");
        }

        /*
         * Ignore SIGINT so that the new process also ignores it.
         */
        if (signal(SIGINT, SIG_IGN) == SIG_ERR) {
                err(1, "signal");
        }

        if (execv(argv[1], argv + 1) != 0) {
                err(1, "execv");
        }
}
$ cd /ws/go/master/src
$ PATH=/ws/go/master/bin:$PATH /ws/go/signals/ignore_sigint $PWD/run.bash -run go_test:os/signal 

##### Building packages and commands.

##### Testing packages.
--- FAIL: TestAtomicStop (16.04s)
    signal_test.go:420: iteration 6: output lost signal on tries: 1
    signal_test.go:428: iteration 6: lost signal
FAIL
FAIL    os/signal       20.574s
FAIL
2019/11/03 18:03:31 Failed: exit status 1

For now, I'm going to try inserting the opposite of this wrapper program into the buildlet script; i.e., resetting the disposition of SIGINT to SIG_DFL prior to calling execv().

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Nov 4, 2019

OK I've adjusted the SMF service to use the child (aka wait) duration for the service, and we're executing the stage0 binary directly. The wait duration seems a better fit for the way stage0 is intended to work anyway; from svc.startd(1M):

"Wait" model services are restarted whenever the child process associated
with the service exits. A child process that exits is not considered an
error for "wait" model services, and repeated failures do not lead to a
transition to maintenance state. However, a wait service which is
repeatedly exiting with an error that exceeds the default rate (5
failures/second) will be throttled back so that the service only restarts
once per second.

A few test runs have now had a pass from os/signal.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 5, 2019

I don't see why it should matter whether SIGINT is ignored when the process starts. The program run by TestAtomicStop (atomicStopTestProgram) calls Notify(cs, syscall.SIGINT), which should enable the signal even if it were previously ignored.

In general Go is prepared for SIGINT to be ignored when the program starts. This is checked explicitly in sigInstallGoHandler in runtime/signal_unix.go.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 5, 2019

It's fine to send a CL to skip the test on Illumos. Use testenv.SkipFlaky from the internal/testenv package.

@jclulow

This comment has been minimized.

Copy link
Contributor

@jclulow jclulow commented Nov 9, 2019

I'm pretty sure this problem occurs on Linux too when SIGINT is ignored, though it happens less often. I've lifted atomicStopTestProgram into a separate program:

package main

import (
        "fmt"
        "os"
        "os/signal"
        "sync"
        "syscall"
        "time"
)

// atomicStopTestProgram is run in a subprocess by TestAtomicStop.
// It tries to trigger a signal delivery race. This function should
// either catch a signal or die from it.
func atomicStopTestProgram() {
        status := 0
        const tries = 10

        timeout := 2 * time.Second

        pid := syscall.Getpid()
        printed := false
        for i := 0; i < tries; i++ {
                cs := make(chan os.Signal, 1)
                signal.Notify(cs, syscall.SIGINT)

                var wg sync.WaitGroup
                wg.Add(1)
                go func() {
                        defer wg.Done()
                        signal.Stop(cs)
                }()

                syscall.Kill(pid, syscall.SIGINT)

                // At this point we should either die from SIGINT or
                // get a notification on cs. If neither happens, we
                // dropped the signal. It is given 2 seconds to
                // deliver, as needed for gccgo on some loaded test systems.

                select {
                case <-cs:
                case <-time.After(timeout):
                        status = 1
                        if !printed {
                                fmt.Print("lost signal on tries:")
                                printed = true
                        }
                        fmt.Printf(" %d", i)
                }

                wg.Wait()
        }
        if printed {
                fmt.Print("\n")
        }

        os.Exit(status)
}

func main() {
        atomicStopTestProgram()
}

I created a simple test harness to run the program over and over and report when it emits the error message:

#!/bin/bash

dir=$(cd "$(dirname "$0")" && pwd)
echo $dir/three

fail=0
pass=0
lastprint=$SECONDS

trap 'exit 1' INT

while :; do
        x=$($dir/ignore_sigint $dir/three 2>&1)
        if [[ $x =~ 'lost signal on tries:' ]]; then
                printf '    %s\n' "$x"
                (( fail += 1 ))
        else
                (( pass += 1 ))
        fi

        if (( SECONDS != lastprint )); then
                printf '%s pass %-4d fail %-4d\n' "$(date -u +%FT%TZ)" \
                    "$pass" "$fail"
                pass=0
                fail=0
                lastprint=$SECONDS
        fi
done

The ignore_sigint program is the C program from earlier in the thread which ignores SIGINT before starting the nominated program. The binary three above is the atomicStopTestProgram program.

On an illumos system, this obviously fails pretty readily:

$ uname -a
SunOS newcastle 5.11 illumos-c5c712a873 i86pc i386 i86pc

$ ./harness 
/ws/go/signals/three
    lost signal on tries: 1
2019-11-09T16:52:08Z pass 5    fail 1   
    lost signal on tries: 4
2019-11-09T16:52:10Z pass 0    fail 1   
    lost signal on tries: 1
2019-11-09T16:52:12Z pass 16   fail 1   
    lost signal on tries: 2
2019-11-09T16:52:14Z pass 6    fail 1   
    lost signal on tries: 2
2019-11-09T16:52:16Z pass 14   fail 1   
    lost signal on tries: 2 5
2019-11-09T16:52:20Z pass 1    fail 1   
    lost signal on tries: 2
2019-11-09T16:52:22Z pass 1    fail 1   
^C

But it also fails (though less often) on a Linux system:

$ uname -a
Linux leftstation 5.0.0-31-generic #33-Ubuntu SMP Mon Sep 30 18:51:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ ./harness 
/var/tmp/signals/three
2019-11-09T16:49:23Z pass 127  fail 0   
2019-11-09T16:49:24Z pass 138  fail 0   
2019-11-09T16:49:25Z pass 139  fail 0   
2019-11-09T16:49:26Z pass 481  fail 0   
2019-11-09T16:49:27Z pass 665  fail 0   
2019-11-09T16:49:28Z pass 642  fail 0   
2019-11-09T16:49:29Z pass 667  fail 0   
2019-11-09T16:49:30Z pass 664  fail 0   
2019-11-09T16:49:31Z pass 669  fail 0   
2019-11-09T16:49:32Z pass 669  fail 0   
2019-11-09T16:49:33Z pass 645  fail 0   
2019-11-09T16:49:34Z pass 664  fail 0   
2019-11-09T16:49:35Z pass 668  fail 0   
2019-11-09T16:49:36Z pass 663  fail 0   
2019-11-09T16:49:37Z pass 670  fail 0   
2019-11-09T16:49:38Z pass 661  fail 0   
2019-11-09T16:49:39Z pass 667  fail 0   
2019-11-09T16:49:40Z pass 668  fail 0   
2019-11-09T16:49:41Z pass 664  fail 0   
2019-11-09T16:49:42Z pass 671  fail 0   
2019-11-09T16:49:43Z pass 663  fail 0   
2019-11-09T16:49:44Z pass 668  fail 0   
2019-11-09T16:49:45Z pass 669  fail 0   
2019-11-09T16:49:46Z pass 669  fail 0   
2019-11-09T16:49:47Z pass 663  fail 0   
2019-11-09T16:49:48Z pass 670  fail 0   
2019-11-09T16:49:49Z pass 669  fail 0   
2019-11-09T16:49:50Z pass 671  fail 0   
2019-11-09T16:49:51Z pass 650  fail 0   
2019-11-09T16:49:52Z pass 654  fail 0   
2019-11-09T16:49:53Z pass 653  fail 0   
2019-11-09T16:49:54Z pass 640  fail 0   
2019-11-09T16:49:55Z pass 556  fail 0   
2019-11-09T16:49:56Z pass 576  fail 0   
2019-11-09T16:49:57Z pass 535  fail 0   
    lost signal on tries: 2
2019-11-09T16:49:59Z pass 351  fail 1   
2019-11-09T16:50:00Z pass 122  fail 0   
2019-11-09T16:50:01Z pass 500  fail 0   
2019-11-09T16:50:02Z pass 501  fail 0   
2019-11-09T16:50:03Z pass 517  fail 0   
2019-11-09T16:50:04Z pass 626  fail 0   
2019-11-09T16:50:05Z pass 658  fail 0   
2019-11-09T16:50:06Z pass 657  fail 0   
2019-11-09T16:50:07Z pass 659  fail 0   
2019-11-09T16:50:08Z pass 641  fail 0   
2019-11-09T16:50:09Z pass 660  fail 0   
2019-11-09T16:50:10Z pass 623  fail 0   
    lost signal on tries: 1
2019-11-09T16:50:12Z pass 65   fail 1   
2019-11-09T16:50:13Z pass 316  fail 0   
2019-11-09T16:50:14Z pass 480  fail 0   
2019-11-09T16:50:15Z pass 550  fail 0   
2019-11-09T16:50:16Z pass 594  fail 0   
2019-11-09T16:50:17Z pass 578  fail 0   
2019-11-09T16:50:18Z pass 597  fail 0   
2019-11-09T16:50:19Z pass 588  fail 0   
2019-11-09T16:50:20Z pass 626  fail 0   
2019-11-09T16:50:21Z pass 640  fail 0   
2019-11-09T16:50:22Z pass 640  fail 0   
2019-11-09T16:50:23Z pass 603  fail 0   
2019-11-09T16:50:24Z pass 594  fail 0   
2019-11-09T16:50:25Z pass 631  fail 0   
2019-11-09T16:50:26Z pass 638  fail 0   
2019-11-09T16:50:27Z pass 630  fail 0   
2019-11-09T16:50:28Z pass 569  fail 0   
2019-11-09T16:50:29Z pass 634  fail 0   
2019-11-09T16:50:30Z pass 654  fail 0   
^C
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 14, 2019

Ah, I see it now. My apologies. When SIGINT is ignored at program start, then when we close the channel we reset the SIGINT handler back to SIG_IGN. If we manage to do that before sending the signal, then it will be ignored.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 14, 2019

Thanks very much for digging into this.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Nov 14, 2019

Change https://golang.org/cl/207081 mentions this issue: os/signal: don't ignore SIGINT in TestAtomicStop child process

@gopherbot gopherbot closed this in 3f21c23 Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.