Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: TestSetuidEtc failures on linux-ppc64-buildlet #42462

Closed
bcmills opened this issue Nov 9, 2020 · 17 comments
Closed

syscall: TestSetuidEtc failures on linux-ppc64-buildlet #42462

bcmills opened this issue Nov 9, 2020 · 17 comments
Labels
arch-ppc64x FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Nov 9, 2020

--- FAIL: TestSetuidEtc (0.01s)
    syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
    syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
    syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0" (bad)
    syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
    syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
    syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
FAIL
FAIL	syscall	0.309s

2020-11-06T19:42:05-362d25f/linux-ppc64-buildlet
2020-10-30T00:03:40-01efc9a/linux-ppc64-buildlet
2020-10-29T19:03:09-f2c0c2b/linux-ppc64-buildlet
2020-10-23T23:01:52-64dc25b/linux-ppc64-buildlet
2020-10-23T20:54:25-5f616a6/linux-ppc64-buildlet

See previously #42178 (@AndrewGMorgan, @laboger, @ianlancetaylor)

@bcmills bcmills added OS-Linux NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-ppc64x labels Nov 9, 2020
@bcmills bcmills added this to the Go1.16 milestone Nov 9, 2020
@bcmills
Copy link
Member Author

bcmills commented Nov 9, 2020

Marking as release-blocker at least until we determine whether this is due to an API change in Go 1.16 (per #42178 (comment)).

@AndrewGMorgan
Copy link
Contributor

Please assign this bug report to me.

@laboger
Copy link
Contributor

laboger commented Nov 9, 2020

I noticed this on the dashboard. It seems like there is some kind of timing issue in reading the /proc//status file within one process. I briefly tried to reproduce it and was not able to but didn't try too hard. Both threads are trying to read the same file. Odd that it only happens on ppc64 and not ppc64le.

@AndrewGMorgan
Copy link
Contributor

Reviewing the failure logs, these three all share the same failure characteristics (root cause currently unknown)

Looking at build.golang.org, this failure mode appears to be rare.

These failures are believed to have been resolved in 3a819e8 11 days ago (ie., the failure signature matches something explicitly fixed with that commit):

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Nov 9, 2020

Looking for patterns, the three failures are listed below. The Commonality visible in the first and third log appears to be that, as Lynn points out, the listed thread id is common to the failing test case.

I'm struck by how close together the three listed thread IDs are: 22053, 21921 and 21923. Is this to be expected? For example, is the build run in a container launched just to run the build tests, so the number of processes consumed by the time the test runs is generally about the same? Or is the available process range pretty small in this container and this is just evidence of ID reuse?

From debugging #42178 we determined that ppc64 (sans "le") runs without cgo, so this code is executing the syscall.AllThreadsSyscall() function behind the scenes when the test calls syscall.Set*(). This may or may not be important. I wonder if similar sporadic failures have been seen in the nocgo build tests on other platforms?

--- FAIL: TestSetuidEtc (0.01s)
    syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
    syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
    syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0" (bad)
    syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
    syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
    syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
--- FAIL: TestSetuidEtc (0.01s)
    syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/21921/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
--- FAIL: TestSetuidEtc (0.04s)
    syscall_linux_test.go:632: [1] "Setegid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t1\t1\t1\t1" (bad)
    syscall_linux_test.go:632: [5] "Setgid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/21923/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
    syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/21923/status" got:"Groups:" want:"Groups:\t0" (bad)
    syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t102\t102\t102" (bad)
    syscall_linux_test.go:632: [11] "Setregid(0,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)

@AndrewGMorgan
Copy link
Contributor

Looking at this output using the following method, we see that there is evidence that the threads individually stop behaving as part of the collective process. In the first case, this is immediately - the Setegid(1) fails to take. In the third case that call succeeds, but the next Seteuid(1) call fails. Looking at the code for the test

https://github.com/golang/go/blob/master/src/syscall/syscall_linux_test.go#L596

All of the test cases after the first failing one also fail, unless the test is looking for the stuck state. What also seems to happen is that the test stops failing at some point before getting to the last test (which should be numbered [20]). This suggests that whatever is causing the test to fail, cleans itself up before the test reaches the end.

My suspicion is that this test is, in some way, racing thread termination on this architecture's test environment.

"grep uid" on the first and third of these yields:

    syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)

and

    syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)

and "grep gid" yields:

    syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
    syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
    syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
    syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)

and

    syscall_linux_test.go:632: [1] "Setegid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t1\t1\t1\t1" (bad)
    syscall_linux_test.go:632: [5] "Setgid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t102\t102\t102" (bad)
    syscall_linux_test.go:632: [11] "Setregid(0,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)

@AndrewGMorgan
Copy link
Contributor

Racing with thread termination would also explain the second failure.

@AndrewGMorgan
Copy link
Contributor

This passes for me (total running time about 1m26s) on linux-ppc64-buildlet:

go/bin/go test syscall -v -run=TestSetuidEtc -count=10000

Given the rarity of this failure mode, I'm going to have to think a bit more about how to fix it.

@AndrewGMorgan
Copy link
Contributor

Without the -v it runs quicker, but similarly no failures. I'm going to explore modifying the test to see if I can make it fail quicker, and/or more verbosely.

@AndrewGMorgan
Copy link
Contributor

Based on the theory that the issue is possibly a race somewhere with terminating a thread, I modified src/syscall/syscall_linux_test.go with the following to aggressively terminate OS threads [*]:

diff --git a/src/syscall/syscall_linux_test.go b/src/syscall/syscall_linux_test.go
index 41ae8cc5a1..051eb3f204 100644
--- a/src/syscall/syscall_linux_test.go
+++ b/src/syscall/syscall_linux_test.go
@@ -550,13 +550,13 @@ func compareStatus(filter, expect string) error {
        for _, f := range fs {
                tf := fmt.Sprintf("/proc/%s/status", f.Name())
                d, err := ioutil.ReadFile(tf)
-               if os.IsNotExist(err) {
-                       // We are racing against threads dying, which
-                       // is out of our control, so ignore the
-                       // missing file and skip to the next one.
-                       continue
-               }
                if err != nil {
+                       if os.IsNotExist(err) || strings.HasSuffix(err.Error(), ": no such process") {
+                               // We are racing against threads dying, which
+                               // is out of our control, so ignore the
+                               // missing file and skip to the next one.
+                               continue
+                       }
                        return fmt.Errorf("unable to read %q: %v", tf, err)
                }
                lines := strings.Split(string(d), "\n")
@@ -574,6 +574,14 @@ func compareStatus(filter, expect string) error {
        return nil
 }
 
+// killAThread locks the go routine to an OS thread and exits; this
+// causes a thread to terminate.
+func killAThread(c <-chan struct{}) {
+       runtime.LockOSThread()
+       <-c
+       return
+}
+
 // TestSetuidEtc performs tests on all of the wrapped system calls
 // that mirror to the 9 glibc syscalls with POSIX semantics. The test
 // here is considered authoritative and should compile and run
@@ -624,6 +632,9 @@ func TestSetuidEtc(t *testing.T) {
        }
 
        for i, v := range vs {
+               c := make(chan struct{})
+               go killAThread(c)
+               close(c)
                if err := v.fn(); err != nil {
                        t.Errorf("[%d] %q failed: %v", i, v.call, err)
                        continue

[*] Note, this patch includes a fix that is unrelated to the bug in question, and fixes an issue with the last test fix... I was able to get some failures on my non-ppc64 workstation, but those turned out to be due to the fact that os.IsNotExist(err) is not always the correct way to recognize a thread ending on Linux.

Using that I generated some similar failures on linux-ppc64-buildlet:

$ [..magic to run remotely..] go/bin/go test syscall -run=TestSetuidEtc -count=1000
--- FAIL: TestSetuidEtc (0.09s)
    syscall_linux_test.go:643: [14] "Setreuid(0,0)" comparison: "/proc/1795/status" got:"Uid:\t0\t2\t2\t2" want:"Uid:\t0\t0\t0\t0" (bad)
    syscall_linux_test.go:643: [15] "Setresgid(101,0,102)" comparison: "/proc/1795/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
--- FAIL: TestSetuidEtc (0.03s)
    syscall_linux_test.go:643: [12] "Setreuid(1,0)" comparison: "/proc/2462/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
    syscall_linux_test.go:643: [13] "Setreuid(0,2)" comparison: "/proc/2462/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
--- FAIL: TestSetuidEtc (0.13s)
    syscall_linux_test.go:643: [19] "Setresuid(0,2,1)" comparison: "/proc/3144/status" got:"Uid:\t1\t0\t2\t0" want:"Uid:\t0\t2\t1\t2" (bad)
--- FAIL: TestSetuidEtc (0.14s)
    syscall_linux_test.go:643: [7] "Setgroups(nil)" comparison: "/proc/8741/status" got:"Groups:\t0 1 2 3" want:"Groups:" (bad)
    syscall_linux_test.go:643: [8] "Setgroups([]int{0})" comparison: "/proc/8741/status" got:"Groups:\t0 1 2 3" want:"Groups:\t0" (bad)
    syscall_linux_test.go:643: [9] "Setregid(101,0)" comparison: "/proc/8741/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
    syscall_linux_test.go:643: [10] "Setregid(0,102)" comparison: "/proc/8741/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
FAIL
FAIL    syscall 30.086s
FAIL
Error running run: exit status 1

@gopherbot
Copy link

Change https://golang.org/cl/268717 mentions this issue: syscall: improve TestSetuidEtc() /proc/ parsing against races

@AndrewGMorgan
Copy link
Contributor

The issue appears to be another race with interpreting the /proc/ filesystem in the testing code:

https://go-review.googlesource.com/c/go/+/268717

@AndrewGMorgan
Copy link
Contributor

Note, in passing, that the master branch just threw up another failure for linux-ppc-buildlet:

https://build.golang.org/log/650f97099ef1a84e84bb7f2d17f63a20d4c04559

This doesn't yet include the https://golang.org/cl/268717 fix.

@AndrewGMorgan
Copy link
Contributor

So, the first version of the changes unexpectedly failed:

2 of 23 SlowBots failed:
Failed on linux-amd64: https://storage.googleapis.com/go-build-log/fd4e8e86/linux-amd64_038a24f9.log
Failed on linux-386: https://storage.googleapis.com/go-build-log/fd4e8e86/linux-386_a3c5af8b.log

My plan is to revert the thread termination part of the change to misc/cgo/test/issue1435.go and retry. I'm not yet clear if this is a newly uncovered bug, or something related to what I thought I was fixing.

@AndrewGMorgan
Copy link
Contributor

Some investigation later, there were two issues:

  1. I had missed some needed bits from the syscall_linux_test.go -> issue1435.go test upgrade.
  2. as suspected the forced thread termination is somehow not working as expected.

My plan is to submit a fix for this bug in two stages - fix the original bug as reported upto and including (1) of this comment.

Then investigate why (2) is failing. It could be some subtlety of glibc's thread implementation, since launching C.pthread_create() is the only way the issue1435.go test differs from the one in syscall_linux_test.go.

@AndrewGMorgan
Copy link
Contributor

These two test continue to fail:

2 of 23 SlowBots failed:
Failed on linux-amd64: https://storage.googleapis.com/go-build-log/d73fad79/linux-amd64_0f499959.log
Failed on linux-386: https://storage.googleapis.com/go-build-log/d73fad79/linux-386_1a608de2.log

All evidence of (1) is gone now, so I'll focus on what is up with the cgo failures now.

The net change with this fix "shouldn't" have triggered any new failures, so I suspect we're tripping over what must have been let's say a heisenbug lurking all along with the cgo side of the feature... [Here's hoping it is not a glibc issue.]

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Nov 10, 2020

I've decided to split this issue into two (see #42494 for the aspect of the issue I plan not to resolve in this present bug). For this present bug, I want to get the checked in source back to testing stability by addressing the issue with the ppc64 build testing.

@golang golang locked and limited conversation to collaborators Jan 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-ppc64x FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants
@bcmills @gopherbot @laboger @AndrewGMorgan and others