Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: update race windows syso to new LLVM version #53539

Open
thanm opened this issue Jun 24, 2022 · 19 comments
Open

runtime/race: update race windows syso to new LLVM version #53539

thanm opened this issue Jun 24, 2022 · 19 comments
Assignees
Labels
compiler/runtime NeedsInvestigation
Milestone

Comments

@thanm
Copy link
Contributor

@thanm thanm commented Jun 24, 2022

The syso that supports the race detector for windows is currently still at an old version, since we had been waiting for new windows builders with updated compilers:

https://go.googlesource.com/go/+/6bad7e82430bb1eb927a2901f44f9664637db27d/src/runtime/race/README#12

I took a stab at doing the update today but ran into some additional problems; filing this issue to note the problems and track the work needed to do the update.

The LLVM support library (llvm-project/compiler-rt/lib/tsan and related) has undergone a fair number of changes since we last did an update, and it looks as though there are now some calls in the windows version to new synchronization routines. Specifically if you run racebuild with the new builders and then race.bat, you get unsatisfied symbols.

With a slightly modified version of racebuild.go that doesn't use GCC 5.3 for the build.

./racebuild --goroot=/ssd2/xgo --rev=41cb504b7c4b18ac15830107431a0c1eec73a6b2 \
  --gorev=851ecea4cc99ab276109493477b2c7e30c253ea8 --platforms=windows/amd64
...
# go install -race std
# runtime/race
..../x86_64-w64-mingw32/bin/ld.exe: runtime\race\race_windows_amd64.syso:gotsan.cpp:(.text+0x3bfa): undefined reference to `WaitOnAddress'
...
..../x86_64-w64-mingw32/bin/ld.exe: runtime\race\race_windows_amd64.syso:gotsan.cpp:(.text+0x91cf): undefined reference to `WakeByAddressSingle'
...
...../x86_64-w64-mingw32/bin/ld.exe: runtime\race\race_windows_amd64.syso:gotsan.cpp:(.text+0x137e2): undefined reference to `WakeByAddressAll'

Looks like this is coming from this code:

https://github.com/llvm/llvm-project/blob/41cb504b7c4b18ac15830107431a0c1eec73a6b2/compiler-rt/lib/sanitizer_common/sanitizer_win.cpp#L48

This implies that if we want to support -race + internal linking on windows we need to teach the Go linker to import this lib.

@thanm thanm added the NeedsInvestigation label Jun 24, 2022
@thanm thanm added this to the Go1.20 milestone Jun 24, 2022
@thanm thanm self-assigned this Jun 24, 2022
@gopherbot
Copy link

@gopherbot gopherbot commented Jun 24, 2022

Change https://go.dev/cl/413817 mentions this issue: cmd/link: link against libsynchronization.a for -race on windows

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 24, 2022

OK, looks like CL 413817 helps with that, but we're on to a new problem:

C:\workdir\go\pkg\tool\windows_amd64\link.exe: running gcc failed: exit status 1
c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/11.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: C:\Users\gopher\AppData\Local\Temp\go-link-3661115345\000002.o:gotsan.cpp:(.text+0xf949): undefined reference to `__sanitizer::DumpProcessMap()'
collect2.exe: error: ld returned 1 exit status

gopherbot pushed a commit that referenced this issue Jun 27, 2022
As of LLVM rev 41cb504b7c4b18ac15830107431a0c1eec73a6b2, the
race detector runtime now refers to things in the windows
synchronization library, hence when doing windows internal
linking, at that library to the list of host archives that
we visit. The tsan code that makes the reference is here:

https://github.com/llvm/llvm-project/blob/41cb504b7c4b18ac15830107431a0c1eec73a6b2/compiler-rt/lib/sanitizer_common/sanitizer_win.cpp#L48
https://github.com/llvm/llvm-project/blob/41cb504b7c4b18ac15830107431a0c1eec73a6b2/compiler-rt/lib/sanitizer_common/sanitizer_win.cpp#L834

Note that libsynchronization.a is not guaranteed to be available on
all windows systems, so in the external linking case, check for its
existence before adding "-lsynchronization" to the external linker
args.

Updates #53539.

Change-Id: I433c95c869915693d59e9c1082d5b8a11da1fc8c
Reviewed-on: https://go-review.googlesource.com/c/go/+/413817
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
@gopherbot
Copy link

@gopherbot gopherbot commented Jun 27, 2022

Change https://go.dev/cl/414514 mentions this issue: runtime/race: update race_windows_amd64.syso

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 27, 2022

Update: I sent D128641 to LLVM upstream with a fix, now submitted as 13fb97d68821.

Still having trouble getting a clean race.bat run after running racebuild. My latest run is:

./racebuild --goroot=/ssd2/xgo --rev=13fb97d68821e1948d176057ebee94f50cb05b62 --gorev=68289f39f0019ff5c03e047d68e5b7f6a9f9e9e2 --platforms=windows/amd64

in combination with CL 414475, which picks GCC 11.2 instead of 5.3. The resulting race.bat run is still not passing however. Representative error:

##### Testing packages.
ok  	archive/tar	1.173s
ok  	archive/zip	3.534s
ok  	bufio	1.519s
==2580==ERROR: ThreadSanitizer failed to allocate 0x020000000000 (2199023255552) bytes at 0x010000000000 (error code: 487)
failed to reset shadow memory
FAIL	bytes	1.190s
ok  	compress/bzip2	1.807s
ok  	compress/flate	7.184s
ok  	compress/gzip	1.653s
ok  	compress/lzw	1.783s
ok  	compress/zlib	6.271s

I get this same symptom on both windows-amd64-race and windows-amd64-race-newcc.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jun 28, 2022

This error:

==2580==ERROR: ThreadSanitizer failed to allocate 0x020000000000 (2199023255552) bytes at 0x010000000000 (error code: 487)

comes from here:
https://github.com/llvm/llvm-project/blob/241557fb0600fcef67bdd0698fbbd2fb9a3459e2/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L200
when we try to zero the whole shadow memory range.

Previously we did not do it on Windows since it was optional, but now it's required to zero the whole range.

But since we map shadow on Windows lazily (otherwise we would get the same error earlier):
https://github.com/llvm/llvm-project/blob/241557fb0600fcef67bdd0698fbbd2fb9a3459e2/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L562-L570

We don't need to remap the whole range, we only need to remap what was mapped by MapShadow function.

We probably need to add mapped_shadow_begin/end for Go, update them in MapShadow and re-map on that part in DoReset.

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 30, 2022

We probably need to add mapped_shadow_begin/end for Go, update them in MapShadow and re-map on that part in DoReset.

Thanks, I tried coding up this suggestion. Tentative/experimental CL at https://reviews.llvm.org/D128909.

This does seem to eliminate the "failed to allocate 0x020000000000" problem, but now I seem to be on to a different failure mode, e.g.

##### Testing packages.
==3348==ERROR: ThreadSanitizer failed to allocate 0x000000501000 (5246976) bytes at 0x100eccf98e000 (error code: 87)
FAIL	archive/tar	0.054s
==2036==ERROR: ThreadSanitizer failed to allocate 0x0000004d1000 (5050368) bytes at 0x100edaccf6000 (error code: 87)
FAIL	archive/zip	0.056s
==3584==ERROR: ThreadSanitizer failed to allocate 0x0000003ad000 (3854336) bytes at 0x100eca5bfe000 (error code: 87)
FAIL	bufio	0.046s
==3440==ERROR: ThreadSanitizer failed to allocate 0x0000003cd000 (3985408) bytes at 0x100ef1b2da000 (error code: 87)
FAIL	bytes	0.054s
==660==ERROR: ThreadSanitizer failed to allocate 0x00000037d000 (3657728) bytes at 0x100eeb308e000 (error code: 87)
FAIL	compress/bzip2	0.055s
...

This actually looks pretty similar to some previous issues, e.g. 48231 and 46099.

I see from this comment that the problem may be due to a new compiler that defaults to a PIE binary. I can pursue this angle (maybe with Go linker CL). @dvyukov , can you confirm that we need to avoid PIE in order for the race detector to function properly on windows/amd64? Thanks.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jun 30, 2022

87 is ERROR_INVALID_PARAMETER 87 (0x57) The parameter is incorrect.
https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-

I see the addresses are not 64K aligned, that's probably what causes the error.
What line in tsan runtime does cause this?

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 30, 2022

I see the addresses are not 64K aligned, that's probably what causes the error. What line in tsan runtime does cause this?

Ah, ok, I guess this would make sense. I'll see if I can collect a stack trace.

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 30, 2022

Looks like the offending call is from MapShadow itself (e.g. https://github.com/llvm/llvm-project/blob/13fb97d68821e1948d176057ebee94f50cb05b62/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp#L568).

64K: I tried upping the alignment to 64k, but that doesn't seem to help (I still get the same error).

@thanm
Copy link
Contributor Author

@thanm thanm commented Jun 30, 2022

OK, looks like the most recent batch of problems all go away (at least for the hand tests I tried) when PIE is turned off. I will work on putting together a linker CL that disables PIE when -race is in effect. Stay tuned.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jul 1, 2022

Looks like the offending call is from MapShadow itself

Right, we only align it to 4K (page size), but not to 64K.

64K: I tried upping the alignment to 64k, but that doesn't seem to help (I still get the same error).

Humm... this is strange. You aligned both start address and size, right?
The other reason for this I can think of is that since we extend the region while rounding it to 64K, we can ask the kernel to allocate overlapping ranges. Maybe that fails?
Since we will need to track mapped_shadow_begin/end for DoReset, it should be relatively easy to avoid double mapping.

OK, looks like the most recent batch of problems all go away (at least for the hand tests I tried) when PIE is turned off.

The only way I see PIE can affect things is that some ranges become only 4K aligned, whereas w/o PIE they are 64K aligned.
If so, then carefully tracking mapped_shadow_begin/end, aligning everything to 64K and avoiding double mapping of ranges may solve all problems (w/o prohibiting PIE).

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 1, 2022

Change https://go.dev/cl/415674 mentions this issue: cmd/racebuild: support cherry-picking CL into Go repo

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 1, 2022

Change https://go.dev/cl/415675 mentions this issue: cmd/racebuild: add -copyonfail debugging flag

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 1, 2022

Change https://go.dev/cl/414475 mentions this issue: cmd/racebuild: update compilers used for windows syso build

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 1, 2022

Change https://go.dev/cl/415676 mentions this issue: cmd/link: explicitly disable PIE for windows/amd64 -race mode

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 6, 2022

Change https://go.dev/cl/416174 mentions this issue: cmd/go: default to "exe" build mode for windows -race

gopherbot pushed a commit that referenced this issue Jul 7, 2022
This patch changes the default build mode from "pie" to "exe" when
building programs on windows with "-race" in effect. The Go command
already issues an error if users explicitly ask for -buildmode=pie in
combination with -race on windows, but wasn't revising the default
"pie" build mode if a specific buildmode was not requested.

Updates #53539.
Updates #35006.

Change-Id: I2f81a41a1d15a0b4f5ae943146175c5a1202cbe0
Reviewed-on: https://go-review.googlesource.com/c/go/+/416174
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Than McIntosh <thanm@google.com>
gopherbot pushed a commit that referenced this issue Jul 7, 2022
Turn off PIE explicitly for windows/amd64 when -race is in effect,
since at the moment the race detector runtime doesn't seem to handle
PIE binaries correctly. Note that newer C compilers on windows
produce PIE binaries by default, so the Go linker needs to explicitly
turn off PIE when invoking the external linker in this case.

Updates #53539.

Change-Id: Ib990621f22cf61a5fa383584bab81d3dfd7552e8
Reviewed-on: https://go-review.googlesource.com/c/go/+/415676
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
@thanm
Copy link
Contributor Author

@thanm thanm commented Jul 7, 2022

For tracking purposes (this is more of a mental note / reminder), the updated windows race syso is dependent on functions from libsynchronization.a (which are not available on our existing Go windows builders), meaning that we can't check that in until we've transitioned over to the new ones. Otherwise we'll get unsats on WakeByAddressSingle/WakeByAddressAll/WaitOnAddress.

@gopherbot gopherbot added the compiler/runtime label Jul 7, 2022
@bcmills
Copy link
Member

@bcmills bcmills commented Jul 11, 2022

Is this dashboard failure related?

2022-07-06T18:36:43-d69bac6-77cc1c0/windows-amd64-race:

==4712==ERROR: ThreadSanitizer failed to allocate 0x000000040000 (262144) bytes at 0x056157aa0000 (error code: 1455)
FATAL: ThreadSanitizer can not mmap thread trace (0x056157aa0000/0x000000040000)
FAIL	golang.org/x/tools/refactor/importgraph	0.276s

@thanm
Copy link
Contributor Author

@thanm thanm commented Jul 11, 2022

Interesting, I haven't seem one of those before. Error code 1455 is 'The paging file is too small for this operation to complete'.

I would guess that it's unrelated, the errors I am working on at the moment for V3 are 487 ("Attempt to access invalid address").

thanm added a commit to llvm/llvm-project that referenced this issue Jul 26, 2022
Capture the computed shadow begin/end values at the point where the
shadow is first created and reuse those values on reset. Introduce new
windows-specific function "ZeroMmapFixedRegion" for zeroing out an
address space region previously returned by one of the MmapFixed*
routines; call this function (on windows) from DoResetImpl
tsan_rtl.cpp instead of MmapFixedSuperNoReserve.

See golang/go#53539 (comment)
for context; intended to help with updating the syso for Go's
windows/amd64 race detector.

Differential Revision: https://reviews.llvm.org/D128909
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime NeedsInvestigation
Projects
Status: Todo
Development

No branches or pull requests

4 participants