Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: failures with EBADF on Linux-based platforms since 2024-02-21 #65857

Closed
gopherbot opened this issue Feb 21, 2024 · 15 comments
Closed

all: failures with EBADF on Linux-based platforms since 2024-02-21 #65857

gopherbot opened this issue Feb 21, 2024 · 15 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Milestone

Comments

@gopherbot
Copy link

gopherbot commented Feb 21, 2024

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21" && date < "2024-02-24"

Issue created automatically to collect these failures.

Example (log):

testing: copying pipe: read |0: bad file descriptor

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 21, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == ""
2024-02-21 21:26 android-amd64-emu go@750738b5 os (log)
testing: copying pipe: read |0: bad file descriptor

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == ""
2024-02-22 16:40 android-amd64-emu go@55bb3d1c os (log)
panic: test timed out after 3m0s

goroutine 680 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:2366 +0x385
created by time.goFunc
	/workdir/go/src/time/sleep.go:177 +0x2d

watchflakes

@bcmills bcmills changed the title os: unrecognized failures all: failures with EBADF on Android since 2024-02-21 Feb 23, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- goos == "android" && log ~ `bad file descriptor`
2024-02-21 20:50 android-amd64-emu pkgsite@463d7c94 go@7fd62ba8 x/pkgsite/cmd/internal/pkgsite.TestServer (log)
2024/02/22 05:47:02 Info: go/packages.Load(["all"]) loaded 1 packages from /data/local/tmp/go_android_exec/pkgsite.test-11393/TestServer1734583538/001 in 182.70968ms
2024/02/22 05:47:02 Info: FetchDataSource: fetching example.com/testmod@v0.0.0
2024/02/22 05:47:02 Info: FetchDataSource: fetched example.com/testmod@v0.0.0 using *fetch.goPackagesModuleGetter in 556.75µs with error <nil>
2024/02/22 05:47:02 Info: FetchDataSource: fetching std@latest
2024/02/22 05:47:02 Info: FetchDataSource: fetched std@latest using *fetch.modCacheModuleGetter in 57.74µs with error modCacheGetter.Info("std", "latest"): modCacheModuleGetter.latestVersion("std"): path: malformed module path "std": missing dot in first path element: invalid argument
2024/02/22 05:47:02 Info: FetchDataSource: fetching example.com/testmod@latest
2024/02/22 05:47:02 Info: FetchDataSource: fetched example.com/testmod@latest using *fetch.goPackagesModuleGetter in 470.63µs with error <nil>
2024/02/22 05:47:02 Info: FetchDataSource: fetching example.com/testmod@v0.0.0
2024/02/22 05:47:02 Info: FetchDataSource: fetched example.com/testmod@v0.0.0 using *fetch.goPackagesModuleGetter in 435.24µs with error <nil>
2024/02/22 05:47:02 Error: Server.GetLatestInfo: FetchDataSource.GetLatestInfo(ctx, "example.com/testmod", "example.com/testmod"): proxy.Client.Info("example.com/testmod", "latest"): Client.readBody("example.com/testmod", "latest", "info"): executeRequest(ctx, "https://127.0.0.1:55571/example.com/testmod/@latest"): "404 page not found\n": not found
...
2024/02/22 05:47:05 Info: returning 424 (Failed Dependency) for error 424 (Failed Dependency): <nil> (epage=&{{   <nil> false false   false false    }  <h3 class="Error-message">This page is not supported by this datasource.</h3> <nil>})
2024/02/22 05:47:05 Info: go/packages.Load(["all"]) loaded 1 packages from /data/local/tmp/go_android_exec/pkgsite.test-11393/TestServer1734583538/001 in 52.15007ms
2024/02/22 05:47:05 Info: FetchDataSource: fetching std@latest
2024/02/22 05:47:05 Info: FetchDataSource: fetched std@latest using *fetch.modCacheModuleGetter in 54.26µs with error modCacheGetter.Info("std", "latest"): modCacheModuleGetter.latestVersion("std"): path: malformed module path "std": missing dot in first path element: invalid argument
2024/02/22 05:47:05 Info: FetchDataSource: fetching example.com/testmod@v0.0.0
2024/02/22 05:47:05 Info: FetchDataSource: fetched example.com/testmod@v0.0.0 using *fetch.goPackagesModuleGetter in 524.76µs with error <nil>
2024/02/22 05:47:05 Info: returning 424 (Failed Dependency) for error 424 (Failed Dependency): <nil> (epage=&{{   <nil> false false   false false    }  <h3 class="Error-message">This page is not supported by this datasource.</h3> <nil>})
--- FAIL: TestServer (3.36s)
    --- FAIL: TestServer/search_not_found (0.01s)
        server_test.go:200: failed to load any module(s) at map[/data/local/tmp/go_android_exec/pkgsite.test-11393/TestServer1734583538/001:[{example.com/testmod /data/local/tmp/go_android_exec/pkgsite.test-11393/TestServer1734583538/001}]]
2024-02-21 21:28 android-386-emu go@05c05796 go/build.TestDependencies (log)
--- FAIL: TestDependencies (4.69s)
    deps_test.go:742: reading p256.go: read /data/local/tmp/go_android_exec/goroot/src/crypto/internal/nistec/p256.go: bad file descriptor
2024-02-21 21:28 android-amd64-emu go@05c05796 os.TestSpliceFile (log)
--- FAIL: TestSpliceFile (0.07s)
    --- FAIL: TestSpliceFile/Limited (0.05s)
        --- FAIL: TestSpliceFile/Limited/Half-TCP (0.02s)
            --- FAIL: TestSpliceFile/Limited/Half-TCP/32769 (0.01s)
                readfrom_linux_test.go:337: random data seed: 1708551370
                readfrom_linux_test.go:357: write /data/local/tmp/go_android_exec/os.test-91484/TestSpliceFileLimitedHalf-TCP327692049923806/001/dst-splice-file-test1262597403: splice: bad file descriptor
2024-02-21 22:46 android-386-emu tools@c111c4df go@507d1b22 x/tools/go/analysis/passes/asmdecl.Test (log)
--- FAIL: Test (21.72s)
    --- FAIL: Test/linux/mips64 (0.02s)
        analysistest.go:344: loading [a]: err: signal: broken pipe: stderr: 
    --- FAIL: Test/linux/mips (0.02s)
        analysistest.go:344: loading [a]: err: copying stdout: read |0: bad file descriptor: stderr: 

watchflakes

@bcmills
Copy link
Contributor

bcmills commented Feb 23, 2024

This appears to have started as of https://go.dev/cl/528438 (attn @kolyshkin @prattmic @ianlancetaylor).

@bcmills
Copy link
Contributor

bcmills commented Feb 23, 2024

The android builders are just a variety of linux, but may have different timing behaviors that cause them to expose data races that are present but harder to detect on other linux variants.

I found one such plausible race and commented on it in https://go.dev/cl/c/go/+/528438/23#message-7776b03e24828c38c7e462ff45280105fc677d83.

I suggest that we treat this as a regression possibly affecting all linux ports, which makes it release-blocker until/unless we find evidence that the problem is somehow specific to Android.

@bcmills bcmills changed the title all: failures with EBADF on Android since 2024-02-21 all: failures with EBADF since 2024-02-21 Feb 23, 2024
@bcmills bcmills added this to the Go1.23 milestone Feb 23, 2024
@gopherbot

This comment was marked as off-topic.

@bcmills bcmills changed the title all: failures with EBADF since 2024-02-21 all: failures with EBADF on Linux-based platforms since 2024-02-21 Feb 23, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21"
2024-02-22 22:32 android-amd64-emu pkgsite@094b90f1 go@ad377e90 x/pkgsite/cmd/internal/pkgsite.TestServer (log)
2024/02/23 08:46:38 Info: go/packages.Load(["all"]) loaded 1 packages from /data/local/tmp/go_android_exec/pkgsite.test-11846/TestServer3924697756/001 in 127.65439ms
2024/02/23 08:46:38 Info: FetchDataSource: fetching example.com/testmod@v0.0.0
2024/02/23 08:46:38 Info: FetchDataSource: fetched example.com/testmod@v0.0.0 using *fetch.goPackagesModuleGetter in 1.26911ms with error <nil>
2024/02/23 08:46:38 Info: FetchDataSource: fetching std@latest
2024/02/23 08:46:38 Info: FetchDataSource: fetched std@latest using *fetch.modCacheModuleGetter in 140.84µs with error modCacheGetter.Info("std", "latest"): modCacheModuleGetter.latestVersion("std"): path: malformed module path "std": missing dot in first path element: invalid argument
2024/02/23 08:46:38 Info: FetchDataSource: fetching example.com/testmod@latest
2024/02/23 08:46:38 Info: FetchDataSource: fetched example.com/testmod@latest using *fetch.goPackagesModuleGetter in 665.75µs with error <nil>
2024/02/23 08:46:38 Info: FetchDataSource: fetching example.com/testmod@v0.0.0
2024/02/23 08:46:38 Info: FetchDataSource: fetched example.com/testmod@v0.0.0 using *fetch.goPackagesModuleGetter in 413.23µs with error <nil>
2024/02/23 08:46:38 Error: Server.GetLatestInfo: FetchDataSource.GetLatestInfo(ctx, "example.com/testmod", "example.com/testmod"): proxy.Client.Info("example.com/testmod", "latest"): Client.readBody("example.com/testmod", "latest", "info"): executeRequest(ctx, "https://127.0.0.1:35410/example.com/testmod/@latest"): "404 page not found\n": not found
...
panic: test timed out after 10m0s
running tests:
	TestServer (10m0s)
	TestServer/no_symbol_search (9m56s)

testing.(*T).Run(0xc0000dbba0, {0x76fe1584751b?, 0x1?}, 0xc004ce0020)
	/workdir/go/src/testing/testing.go:1750 +0x3ab
golang.org/x/pkgsite/cmd/internal/pkgsite.TestServer(0xc0000dbba0)
	/workdir/gopath/src/golang.org/x/pkgsite/cmd/internal/pkgsite/server_test.go:197 +0xe25
testing.tRunner(0xc0000dbba0, 0x76fe15fe8ae0)
2024-02-22 22:32 android-amd64-emu pkgsite@094b90f1 go@ad377e90 x/pkgsite/cmd/pkgsite (log)
SIGQUIT: quit
PC=0x470a6c m=0 sigcode=0

goroutine 1 gp=0xc000006380 m=0 mp=0x5f2240 [syscall]:
syscall.Syscall6(0xf7, 0x3, 0xb, 0xc0001317e8, 0x4, 0xc0000b4120, 0x0)
	/workdir/go/src/syscall/syscall_linux.go:91 +0x39 fp=0xc0001317a8 sp=0xc000131748 pc=0x494ad9
os.(*Process).pidfdWait(0xc00001a0f0)
	/workdir/go/src/os/pidfd_linux.go:94 +0x135 fp=0xc0001318c8 sp=0xc0001317a8 pc=0x4aba15
os.(*Process).wait(0xc00001a0f0)
	/workdir/go/src/os/exec_unix.go:27 +0x26 fp=0xc000131940 sp=0xc0001318c8 pc=0x4a9c26
...
r12    0xc0000b4120
r13    0x0
r14    0xc000006380
r15    0x1fffffffffffffff
rip    0x470a6c
rflags 0x206
cs     0x33
fs     0x0
gs     0x0
*** Test killed with quit: ran too long (11m0s).
2024-02-22 22:32 android-amd64-emu pkgsite@094b90f1 go@ad377e90 x/pkgsite/devtools/cmd/csphash (log)
SIGQUIT: quit
PC=0x470a6c m=0 sigcode=0

goroutine 1 gp=0xc000006380 m=0 mp=0x5f2240 [syscall]:
syscall.Syscall(0x49, 0x3, 0x2, 0x0)
	/workdir/go/src/syscall/syscall_linux.go:69 +0x25 fp=0xc000104a40 sp=0xc0001049e0 pc=0x494a45
syscall.Flock(0xc00001a060?, 0x25?)
	/workdir/go/src/syscall/zsyscall_linux_amd64.go:453 +0x25 fp=0xc000104a70 sp=0xc000104a40 pc=0x493b65
main.runMain()
	/workdir/go/misc/go_android_exec/main.go:105 +0x158 fp=0xc000104f10 sp=0xc000104a70 pc=0x4dc338
...
r12    0xc000104960
r13    0x0
r14    0xc000006380
r15    0x7ffffffffffffff
rip    0x470a6c
rflags 0x212
cs     0x33
fs     0x0
gs     0x0
*** Test killed with quit: ran too long (11m0s).

watchflakes

@gopherbot
Copy link
Author

Change https://go.dev/cl/566477 mentions this issue: Revert "os: make use of pidfd on linux"

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21"
2024-02-23 08:34 android-amd64-emu go@ad377e90 net.TestBuffers_WriteTo (log)
--- FAIL: TestBuffers_WriteTo (0.02s)
    --- FAIL: TestBuffers_WriteTo/WriteTo/1025 (0.00s)
        writev_test.go:118: accept tcp4 127.0.0.1:46781: accept4: bad file descriptor
        writev_test.go:118: dial tcp 127.0.0.1:46781: connect: connection reset by peer

watchflakes

@kolyshkin
Copy link
Contributor

testing: copying pipe: read |0: bad file descriptor

This looks very much like pidfdRelease closes fd 0 (which is an uninitialized handle). Probably specific to Android.

Will work on that next week.

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21"
2024-02-23 05:06 linux-riscv64-jsing go@faf2a841 os.TestCopyFSWithSymlinks (log)
--- FAIL: TestCopyFSWithSymlinks (0.00s)
    os_test.go:3470: Mkdir: mkdir /home/gopher/build/go/src/os/testdata/copyfs: permission denied

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21"
2024-02-23 00:19 linux-riscv64-jsing go@d9be6097 os.TestCopyFSWithSymlinks (log)
--- FAIL: TestCopyFSWithSymlinks (0.00s)
    os_test.go:3470: Mkdir: mkdir /home/gopher/build/go/src/os/testdata/copyfs: permission denied

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (goos == "android" || goos == "linux") && (pkg == "os" || log ~ `bad file descriptor`) && date >= "2024-02-21"
2024-02-22 18:17 android-386-emu tools@054c06df go@8a0fbd75 x/tools/go/analysis/passes/asmdecl.Test (log)
--- FAIL: Test (22.71s)
    --- FAIL: Test/linux/mips64 (0.03s)
        analysistest.go:344: loading [a]: err: copying stdout: read |0: bad file descriptor: stderr: 
    --- FAIL: Test/linux/mips (0.02s)
        analysistest.go:344: loading [a]: err: signal: broken pipe: stderr: 

watchflakes

@kolyshkin
Copy link
Contributor

The reason for EBADF on Android was this code in https://go-review.googlesource.com/c/go/+/528438:

if runtime.GOOS == "linux" {
	h = getPidfd(sysattr.Sys)
}

On android, runtime.GOOS is "android", yet the linux build tag is set, which resulted in h (Process.handle) not initialized (set to 0). Subsequently, pidfdRelease was closing fd 0.

@gopherbot
Copy link
Author

Change https://go.dev/cl/570036 mentions this issue: os: make use of pidfd on linux

gopherbot pushed a commit that referenced this issue May 17, 2024
Use Process.handle field to store pidfd, and make use of it. Only use
pidfd functionality if all the needed syscalls are available.

1. Add/use pidfdWorks, which checks that all needed pidfd-related
   functionality works.

2. os.StartProcess: obtain the pidfd from the kernel, if possible, using
   the functionality added by CL 520266. Note we could not modify
   syscall.StartProcess to return pidfd directly because it is a public
   API and its callers do not expect it, so we have to use ensurePidfd
   and getPidfd.

3. (*Process).Kill: use pidfdSendSignal, if available and the pidfd is
   known. Otherwise, fall back to the old implementation.

4. (*Process).Wait: use pidfdWait, if available, otherwise fall back to
   using waitid/wait4. This is more complicated than expected due to
   struct siginfo_t idiosyncrasy.

NOTE pidfdSendSignal and pidfdWait are used without a race workaround
(blockUntilWaitable and sigMu, added by CL 23967) because with pidfd,
PID recycle issue doesn't exist (IOW, pidfd, unlike PID, is guaranteed
to refer to one particular process) and thus the race doesn't exist
either.

Rework of CL 528438 (reverted in CL 566477 because of #65857).

For #62654.
Updates #13987.

Change-Id: If5ef8920bd8619dc428b6282ffe4fb8c258ca224
Reviewed-on: https://go-review.googlesource.com/c/go/+/570036
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Cherry Mui <cherryyz@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker
Projects
Status: Done
Development

No branches or pull requests

3 participants