New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: all.bash on current Go tip can sometimes delete the entire repo on Linux #13789
Comments
|
/cc @rsc who did that commit. |
|
It also happened to me. I think there are two separate issues here: 1) ../misc/cgo/testcarchive is flaky, and 2) when it fails, the entire repo directory is deleted (bye bye pending CLs ...). Or perhaps moved elsewhere although I did not find where. |
|
Based on some tedious testing, I think this was introduced in commit baa928a, 'cmd/dist: run various one-off tests in parallel'. There is clearly some kind of race here, as this often doesn't reproduce for me if the machine is under some other additional load. |
|
I confirm this is a side effect of commit baa928a, although I think the root cause is related to The one failing is testcarchive (sometimes), but because of commit baa928a, I can reproduce the problem at will by tweaking dist to skip some tests (in order to accelerate each run), and creating a symbolic link on the repository. The purpose of the link is to keep the directory after the problem happens (i.e. only the link is deleted, instead of the whole directory). Still, the investigation is painful. Adding traces or using monitoring tools tend to prevent the race condition to occur. Initially, I suspected the testshared test case, since it contains some scary lines such as: But suprinsingly, it happens the repository is deleted by the test.bash script in testcshared. It appears that, when running concurrently with the other test cases, the go env GOROOT command sometimes returns: The third rm command is therefore expanded as: which results in the repository being wiped out. I don't understand why the output of go env is altered, but it really is. I checked this point by replacing the rm by an echo. I will prepare a CL to make this test case a bit more robust, at least to avoid the whole repository to be deleted. It does not address the flakiness of testcarchive. |
|
CL https://golang.org/cl/18173 mentions this issue. |
|
I don't believe that |
|
Let's try to find out exactly which |
Following the parallelization of some tests, a race condition can occur in testcarchive, testshared and testcshared. In some cases, it can result in the go env GOROOT command returning corrupted data, which are then passed to a rm command. Make the shell script more robust by not trusting the result of the go env GOROOT command. It does not really fix the issue, but at least prevent the entire repository to be deleted. Updates #13789 Change-Id: Iaf04a7bd078ed3a82e724e35c4b86e6f756f2a2f Reviewed-on: https://go-review.googlesource.com/18173 TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
|
@dspezia Thank you for digging into this, and apologies to anyone for deleted work.
This is not supposed to be true. I will send a CL fixing this. all.bash (go tool dist test) is supposed to wait for all its subprocesses before exiting. I still don't know why go env GOROOT would be printing the wrong output. Either the go command being run is badly broken or $GOROOT is not set properly. |
|
I agree that this is still very mysterious (at least to me). In misc/cgo/testcshared/test.bash, I have replaced: by: Furthermore, I have the following patch to ensure testcarchive does fail: And here is the content of file "didier" after all.bash: I said that the go env output is altered, but actually, it could also be a bash expansion bug. The fact this command runs in a shell trap function could also be a factor. |
|
Thanks. Can you change your logging lines to: and see if you can still reproduce the bad output? I too was wondering if Russ |
|
CL https://golang.org/cl/18233 mentions this issue. |
|
I made more tests, but it makes less and less sense. Case A - with: I can reproduce (and consistently, on several runs): Case B - with: I cannot reproduce: Case C - with: I cannot reproduce: Case D - with: I cannot reproduce: The first echo (cleanup) seems to prevent the problem to happen. But when I remove it, I have yet another strange behavior: nothing is printed anymore in the file (file size = 0 at the end) for case B, C, and D. In all my runs, when I can reproduce the problem, the extra characters are always " ok", not some random garbage.. I can also produce the problem by adding a sleep of a few seconds just before the call to go env. Apparently, the starting time of the trap function does not matter. I don't really see how the problem could be due to a bad interaction with some other test case running in parallel, or a race condition (as I initially supposed). So far, my only explanation is a weird bash bug. |
|
I do think it's a weird bash bug. As I said before, if bash is doing its
job this cannot happen. It's possible that something is writing " ok\n" to
a closed file descriptor but I can't see why whatever process that is would
magically have access to the output pipe from the $(go env GOROOT).
The quotes around the rm argument should have been there from the beginning
and would have avoided the problem. Perhaps waiting for the other tests to
finish (CL pending) will also help.
If @ianlancetaylor is willing to close this, I certainly am.
|
|
I thought I had a nice simple convincing explanation for this, then I made the mistake of doing more tests. Now I have no idea what is going on and what Bash is doing, and unfortunately I can't reproduce this under The trigger condition for this seems to be testcshared/test.bash running and not having done the In the current setup of testcshared/test.bash, the spurious (If the output was always from within testcshared/test.bash, I could convince myself that Bash was just incorrectly recycling IO buffers under some weird situation. However, it at least once seemed to be getting IO data from a completely different process and context through some crazy method. Doing a great many changes to testcshared/test.bash perturb this issue into vanishing. Producing redirected output early on in the script seems to. Adding a The whole situation worries me, but I don't know if there's anything more that can be done besides hardening the scripts against crazy input and making sure that all running test processes stop before the main process exits. (I suspect that the Bash people are not going to be interested in looking at this and trying to figure out what's going on.) |
|
We cannot run bash script on windows. So we would have to get rid of the script if we want this functionality tested on windows anyway. Alex |
For #13789. Change-Id: I83973298a35afcf55627f0a72223098306a51f4b Reviewed-on: https://go-review.googlesource.com/18233 Reviewed-by: Ian Lance Taylor <iant@golang.org>
|
@dspezia Which version of bash do you have installed? |
|
@siebenmann I want to clear about this: the extraneous "ok" you are seeing in the cleanup function in misc/cgo/testcshared/test.bash is actually coming from the That seems especially incomprehensible, because then I don't see why the bug would be triggered by the fact that there are other scripts running at the same time in different processes. |
|
@ianlancetaylor Yes, definitely. If I change the In fact yes. I've finally managed to capture a strace, and what appears to happen is this:
After 4 nothing attempts any further If it helps, I've had this happen on Bash 4.3.42 (Fedora 22 and 23 builds) and Bash 4.3.11 (Ubuntu 14.04 LTS), all on 64-bit Linux. |
|
Here is the simple reproduction of this to show that it is purely a bash bug: #!/bin/bash
function cleanup() {
r1=$(/bin/echo one)
r2=$(/bin/echo two)
echo >>/tmp/logout $r1 '!' $r2
}
trap cleanup EXIT
sleep 1
echo finalPut this in, say, I will see about reporting this to the Bash people. |
|
Very nice analysis. Thanks for pushing it through. Closing the Go issue. |
|
@siebenmann Did you ever notify the bash developers? If so, do you have a link to the ticket? |
|
I submitted it to the bug-bash mailing list; the resulting thread is here. Based on Chet Ramey's messages, fixes have been put into the current Bash source, although I don't know if that's made it to any released version. |
If I repeatedly build the current git tip with all.bash, some but not all of the time it will wind up deleting the entire repo. The specific point at which this seems to happen during testing is:
I think that the first commit that this started happening at is 8d5ff2e, but I'm not completely sure since this only happens some of the time.
My machine is 64-bit Fedora 22 Linux with a quad core processor.
The text was updated successfully, but these errors were encountered: