build: all.bash on current Go tip can sometimes delete the entire repo on Linux #13789
It also happened to me. I think there are two separate issues here: 1) ../misc/cgo/testcarchive is flaky, and 2) when it fails, the entire repo directory is deleted (bye bye pending CLs ...). Or perhaps moved elsewhere although I did not find where.
Based on some tedious testing, I think this was introduced in commit baa928a, 'cmd/dist: run various one-off tests in parallel'. There is clearly some kind of race here, as this often doesn't reproduce for me if the machine is under some other additional load.
I confirm this is a side effect of commit baa928a, although I think the root cause is related to
the way testcarchive, testcshared and testshared are written.
The one failing is testcarchive (sometimes), but because of commit baa928a,
testcshared and testshared are already running in parallel at that time. When all.bash stops, reporting the failure for testcarchive, there are still running processes from testcshared and testshared. It is the cleaning mechanism of these test cases which can wipe out the whole repository.
I can reproduce the problem at will by tweaking dist to skip some tests (in order to accelerate each run), and creating a symbolic link on the repository. The purpose of the link is to keep the directory after the problem happens (i.e. only the link is deleted, instead of the whole directory).
Still, the investigation is painful. Adding traces or using monitoring tools tend to prevent the race condition to occur. Initially, I suspected the testshared test case, since it contains some scary lines such as:
defer os.RemoveAll(gorootInstallDir)
But suprinsingly, it happens the repository is deleted by the test.bash script in testcshared.
function cleanup() {
rm -f libgo.$libext libgo2.$libext libgo4.$libext libgo.h libgo4.h
rm -f testp testp2 testp3 testp4
rm -rf pkg $(go env GOROOT)/${installdir}
if [ "$goos" == "android" ]; then
adb shell rm -rf $androidpath
fi
}
trap cleanup EXIT
It appears that, when running concurrently with the other test cases, the go env GOROOT command sometimes returns:
/home/vagrant/local/go ok
instead of
/home/vagrant/local/go
The third rm command is therefore expanded as:
rm -rf pkg /home/vagrant/local/go ok/pkg/linux_amd64_testcshared_shared
which results in the repository being wiped out.
I don't understand why the output of go env is altered, but it really is. I checked this point by replacing the rm by an echo.
I will prepare a CL to make this test case a bit more robust, at least to avoid the whole repository to be deleted. It does not address the flakiness of testcarchive.
I don't believe that go env GOROOT can ever return a value like /home/vagrant/local/go ok (unless of course the GOROOT environment variable is somehow set to /home/vagrant/local/go ok). I don't think you've isolated the root cause of this problem.
Let's try to find out exactly which go is running in $(go env GOROOT). I don't know where the "ok" is coming from, and I'm not comfortable just ignoring that.
@dspezia Thank you for digging into this, and apologies to anyone for deleted work.
When all.bash stops, reporting the failure for testcarchive, there are still running processes from testcshared and testshared.
This is not supposed to be true. I will send a CL fixing this. all.bash (go tool dist test) is supposed to wait for all its subprocesses before exiting.
I still don't know why go env GOROOT would be printing the wrong output. Either the go command being run is badly broken or $GOROOT is not set properly.
I agree that this is still very mysterious (at least to me).
In misc/cgo/testcshared/test.bash, I have replaced:
rm -rf pkg $(go env GOROOT)/${installdir}
by:
echo >>didier $(go env GOROOT)
echo >>didier $(go env GOROOT)
echo >>didier $(go env GOROOT)
echo >>didier $(go env GOROOT)
echo >>didier $(which go)
echo >>didier GOROOT=$GOROOT
Furthermore, I have the following patch to ensure testcarchive does fail:
diff --git a/misc/cgo/testcarchive/main2.c b/misc/cgo/testcarchive/main2.c
index 39e39c4..4bfdd8f 100644
--- a/misc/cgo/testcarchive/main2.c
+++ b/misc/cgo/testcarchive/main2.c
@@ -181,5 +181,6 @@ int main(int argc, char** argv) {
TestSEGV();
printf("PASS\n");
+ exit(EXIT_FAILURE);
return 0;
}
And here is the content of file "didier" after all.bash:
/home/vagrant/local/go ok
/home/vagrant/local/go
/home/vagrant/local/go
/home/vagrant/local/go
/home/vagrant/local/go/bin/go
GOROOT=/home/vagrant/local/go
I said that the go env output is altered, but actually, it could also be a bash expansion bug. The fact this command runs in a shell trap function could also be a factor.
I made more tests, but it makes less and less sense.
Case A - with:
# echo >>didier cleanup
# ( echo '>>>' ; go env GOROOT ; echo '<<<' ) >>didier
# ( echo '>>>TTT' ; go env GOROOT ; echo '<<<' ) | tee >>didier
# go env GOROOT | cat >>didier
echo >>didier xxx $(go env GOROOT)
echo >>didier xxx $(which go)
echo >>didier xxx GOROOT=$GOROOT
I can reproduce (and consistently, on several runs):
xxx /home/vagrant/local/go ok
xxx /home/vagrant/local/go/bin/go
xxx GOROOT=/home/vagrant/local/go
Case B - with:
echo >>didier cleanup
# ( echo '>>>' ; go env GOROOT ; echo '<<<' ) >>didier
# ( echo '>>>TTT' ; go env GOROOT ; echo '<<<' ) | tee >>didier
# go env GOROOT | cat >>didier
echo >>didier xxx $(go env GOROOT)
echo >>didier xxx $(which go)
echo >>didier xxx GOROOT=$GOROOT
I cannot reproduce:
cleanup
xxx /home/vagrant/local/go
xxx /home/vagrant/local/go/bin/go
xxx GOROOT=/home/vagrant/local/go
Case C - with:
echo >>didier cleanup
# ( echo '>>>' ; go env GOROOT ; echo '<<<' ) >>didier
# ( echo '>>>TTT' ; go env GOROOT ; echo '<<<' ) | tee >>didier
go env GOROOT | cat >>didier
echo >>didier xxx $(go env GOROOT)
echo >>didier xxx $(which go)
echo >>didier xxx GOROOT=$GOROOT
I cannot reproduce:
cleanup
/home/vagrant/local/go
xxx /home/vagrant/local/go
xxx /home/vagrant/local/go/bin/go
xxx GOROOT=/home/vagrant/local/go
Case D - with:
echo >>didier cleanup
( echo '>>>' ; go env GOROOT ; echo '<<<' ) >>didier
( echo '>>>TTT' ; go env GOROOT ; echo '<<<' ) | tee >>didier
go env GOROOT | cat >>didier
echo >>didier xxx $(go env GOROOT)
echo >>didier xxx $(which go)
echo >>didier xxx GOROOT=$GOROOT
I cannot reproduce:
cleanup
>>>
/home/vagrant/local/go
<<<
>>>TTT
/home/vagrant/local/go
<<<
/home/vagrant/local/go
xxx /home/vagrant/local/go
xxx /home/vagrant/local/go/bin/go
xxx GOROOT=/home/vagrant/local/go
The first echo (cleanup) seems to prevent the problem to happen. But when I remove it, I have yet another strange behavior: nothing is printed anymore in the file (file size = 0 at the end) for case B, C, and D.
In all my runs, when I can reproduce the problem, the extra characters are always " ok", not some random garbage..
I can also produce the problem by adding a sleep of a few seconds just before the call to go env. Apparently, the starting time of the trap function does not matter. I don't really see how the problem could be due to a bad interaction with some other test case running in parallel, or a race condition (as I initially supposed).
So far, my only explanation is a weird bash bug.
I thought I had a nice simple convincing explanation for this, then I made the mistake of doing more tests. Now I have no idea what is going on and what Bash is doing, and unfortunately I can't reproduce this under strace so I can't tell what is really happening.
The trigger condition for this seems to be testcshared/test.bash running and not having done the $(go env GOROOT) before the main test driver process exits. Introducing an artificial sleep before that point makes it fail reliably (the sleep can be in cleanup() or earlier in the main body). It doesn't have to be go env running here specifically; I've substituted running a C program that simply prints out a hardcoded path and the same failure happens. Only the first $(...) is affected; additional ones give the correct results.
In the current setup of testcshared/test.bash, the spurious ok actually comes from the echo "ok" at the end of the script that is used to mark success. You can verify this by changing the echo to something else and having the something else show up as the spurious output. However, if I change the plain echo (which invokes the bash builtin echo) to eg /bin/echo, I once got recycled output from elsewhere entirely, in my case from bits of testcarchive's failure output. I don't seem to be able to reproduce this now; instead /bin/echo just makes everything work.
(If the output was always from within testcshared/test.bash, I could convince myself that Bash was just incorrectly recycling IO buffers under some weird situation. However, it at least once seemed to be getting IO data from a completely different process and context through some crazy method. Or this stubbornly refuses to reproduce under strace.)go env GOROOT is, I suppose; I can't tell one way or another without reliable strace results, and
Doing a great many changes to testcshared/test.bash perturb this issue into vanishing. Producing redirected output early on in the script seems to. Adding a trap "" SIGHUP to the start of cleanup() also seemed to make things go away (at one point I was operating under the theory that go tool dist test exiting was causing the lingering test.bash to get a SIGHUP while it was already in cleanup(), confusing Bash).
The whole situation worries me, but I don't know if there's anything more that can be done besides hardening the scripts against crazy input and making sure that all running test processes stop before the main process exits. (I suspect that the Bash people are not going to be interested in looking at this and trying to figure out what's going on.)
We cannot run bash script on windows. So we would have to get rid of the script if we want this functionality tested on windows anyway.
Alex
@siebenmann I want to clear about this: the extraneous "ok" you are seeing in the cleanup function in misc/cgo/testcshared/test.bash is actually coming from the echo "ok" at the bottom of that same script?
That seems especially incomprehensible, because then I don't see why the bug would be triggered by the fact that there are other scripts running at the same time in different processes.
@ianlancetaylor Yes, definitely. If I change the echo "ok" to eg echo "HI THERE" I see HI THERE in the cleanup function instead of ok. Since this bug doesn't trigger unless testcarchive fails, I'm assuming that it's something that happens to the bash process when it's left running after the go tool dist test process exits. Some testing suggests that test.bash is not executing anything past the echo "ok" in this case. Maybe it's getting an EPIPE on the write and things are becoming tangled?
In fact yes. I've finally managed to capture a strace, and what appears to happen is this:
- bash tries to write the
echo "ok"to stdout in the main process. It gets a SIGPIPE and this immediately triggerscleanup()being called. However, this write to stdout appears to be flagged as incomplete and pending. - in
cleanup(), bash forks to run$(...)in a child. This child appears to inherit the incomplete and needs-to-be-flushed stdout write, because at the end (after runninggoand just before it exits) the child doeswrite(1, "ok\n", 3). This succeeds since the child's fd 1 is now connected to the main bash as part of a pipe. - the main process reads
/some/path\nandok\nfrom the pipe for$(...)(in two separate reads). - the main process sets up to write to my logfile from my addition of
echo >>/tmp/logfile xxx $(go env GOROOT)incleanup(). Just before it makes the fd for /tmp/logfile into fd 1 temporarily, it also trieswrite(1, "ok\n", 3); however, as fd 1 is still connected to the (closed) stdout pipe, this gets EPIPE. No further SIGPIPE is generated, presumably because it's been masked at this point.
After 4 nothing attempts any further write(1, "ok\n", 3) calls, presumably because the pending buffer has been flushed (via an error) even in the main process. One sign that this theory is correct is that if you do two $(...) invocations before doing any writes to stdout in the main process, you get the extra ok in each of them, which is exactly what you'd expect if children are inheriting an unflushed buffer and then each of them flushes it.
If it helps, I've had this happen on Bash 4.3.42 (Fedora 22 and 23 builds) and Bash 4.3.11 (Ubuntu 14.04 LTS), all on 64-bit Linux.
Here is the simple reproduction of this to show that it is purely a bash bug:
#!/bin/bash
function cleanup() {
r1=$(/bin/echo one)
r2=$(/bin/echo two)
echo >>/tmp/logout $r1 '!' $r2
}
trap cleanup EXIT
sleep 1
echo finalPut this in, say, /tmp/repro and make it executable, then run /tmp/repro | false. What winds up in /tmp/logout is the clearly incorrect one final ! two final. If you make the echo be to standard error via 1>&2, you actually get an extra final written before the actual echo output.
I will see about reporting this to the Bash people.
Very nice analysis. Thanks for pushing it through. Closing the Go issue.
@siebenmann Did you ever notify the bash developers? If so, do you have a link to the ticket?
I submitted it to the bug-bash mailing list; the resulting thread is here. Based on Chet Ramey's messages, fixes have been put into the current Bash source, although I don't know if that's made it to any released version.
If I repeatedly build the current git tip with all.bash, some but not all of the time it will wind up deleting the entire repo. The specific point at which this seems to happen during testing is:
I think that the first commit that this started happening at is 8d5ff2e, but I'm not completely sure since this only happens some of the time.
My machine is 64-bit Fedora 22 Linux with a quad core processor.