Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: -race leaks memory on Go >= 1.19 #63276

Open
walles opened this issue Sep 28, 2023 · 28 comments
Open

runtime: -race leaks memory on Go >= 1.19 #63276

walles opened this issue Sep 28, 2023 · 28 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Milestone

Comments

@walles
Copy link

walles commented Sep 28, 2023

What version of Go are you using (go version)?

root@4fa353104d14:/go# go version
go version go1.21.1 linux/arm64
root@4fa353104d14:/go#

Inside of docker run -it --rm golang:1.21.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
root@4fa353104d14:/go# go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1393083128=/tmp/go-build -gno-record-gcc-switches'
root@4fa353104d14:/go#

What did you do?

  • Save this program as main.go.
  • With Go 1.19 or later (1.18 does not leak, or at least a lot less)...
  • go run -race main.go (without -race there is no leak)

Also, just having the HTTP server part is enough to leak, but I wanted a repro that is standalone so the client code and the memory reporting code are there for that.

package main

import (
	"log"
	"os"
	"os/exec"
	"strconv"
	"strings"
	"time"

	"net/http"

	_ "net/http/pprof"
)

func main() {
	printMemUsage()

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {})
	server := http.Server{
		Addr:        ":5019",
		ReadTimeout: 5 * time.Second,
	}

	go func() {
		log.Println("Starting service", "http://localhost:5019/")
		if err := server.ListenAndServe(); err != nil { // blocks
			log.Fatalln("Error while running service:", err)
		}
	}()

	// Give the server time to start
	time.Sleep(2 * time.Second)

	log.Println("Now flooding the server with requests...")
	nextMemLog := time.Now()
	for {
		// Trigger the handler function
		response, err := http.Get("http://localhost:5019/")
		if err != nil {
			log.Fatalln("Error while making request:", err)
		}
		if response.StatusCode != http.StatusOK {
			log.Fatalln("Unexpected status code:", response.StatusCode)
		}
		if response.Body != nil {
			response.Body.Close()
		}

		if time.Now().After(nextMemLog) {
			nextMemLog = time.Now().Add(5 * time.Second)
			printMemUsage()
		}
	}
}

func printMemUsage() {
	out, err := exec.Command("ps", "-o", "rss", "-p", strconv.Itoa(os.Getpid())).Output()
	if err != nil {
		log.Fatalln("Error while running ps command, is it installed?", err)
	}

	// Split output into "RSS" and "1234"
	rss := strings.Fields(string(out))

	log.Println("Memory usage:", rss[1], "kB")
}

What did you expect to see?

Stable memory usage

What did you see instead?

2023/09/27 12:54:27 RssAnon:	   14172 kB
2023/09/27 12:54:27 Starting service http://localhost:5019/
2023/09/27 12:54:29 Now flooding the server with requests...
2023/09/27 12:54:29 RssAnon:	   15724 kB
2023/09/27 12:54:34 RssAnon:	   54000 kB
2023/09/27 12:54:39 RssAnon:	   63400 kB
2023/09/27 12:54:44 RssAnon:	   72044 kB
2023/09/27 12:54:49 RssAnon:	   80796 kB
2023/09/27 12:54:54 RssAnon:	   89384 kB
2023/09/27 12:54:59 RssAnon:	   99744 kB
2023/09/27 12:55:04 RssAnon:	  108400 kB
2023/09/27 12:55:09 RssAnon:	  116744 kB
2023/09/27 12:55:14 RssAnon:	  124632 kB
2023/09/27 12:55:19 RssAnon:	  132332 kB
[...]

Notes

@walles walles changed the title affected/package: -race leaks memory on Go >= 1.19 -race leaks memory on Go >= 1.19 Sep 28, 2023
@seankhliao seankhliao changed the title -race leaks memory on Go >= 1.19 runtime: -race leaks memory on Go >= 1.19 Sep 28, 2023
@bcmills bcmills added RaceDetector compiler/runtime Issues related to the Go compiler and/or runtime. labels Sep 28, 2023
@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 28, 2023
@thanm
Copy link
Contributor

thanm commented Sep 28, 2023

@golang/compiler @golang/runtime

@thanm
Copy link
Contributor

thanm commented Sep 28, 2023

Thanks for the report.

@walles you write that this happens "On Linux (not on macOS)..." -- does that mean you've tried it on a mac and the leak doesn't happen, or just that the program is linux-specific?

@walles
Copy link
Author

walles commented Sep 28, 2023

@thanm I have tried it on a mac and the leak didn't happen.

It leaks on macOS as well, sorry about that.

I updated the repro case at the start of this ticket to use ps to measure its own RAM usage, so now the repro case can be run on both macOS and Linux.

@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2023

In triage, we're a bit unsure as to what the cause could be. It seems like bisecting the toolchain might be the best path forward to try and identify the root cause.

@mknyszek mknyszek added this to the Backlog milestone Oct 4, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2023

@walles Would you be willing to do the bisection, since you have a reproducer readily available in your environment?

@walles
Copy link
Author

walles commented Oct 4, 2023

@mknyszek What happened when you tried the repro?

Did go run -race main.go reproduce the leak for you?

@aclements
Copy link
Member

@walles , we might not get a chance to bisect this for a while. If you have a chance to bisect it, that might help us a lot with moving this issue forward more quickly.

@walles
Copy link
Author

walles commented Oct 10, 2023

Do you think you could at least validate the repro case now?

I have only run it myself, so if at least one other person could verify that it shows a leak that would be super!

@nakulbajaj
Copy link
Contributor

Can confirm I am seeing a leak with the same script on macOS

GOARCH='arm64'
GOOS='darwin'
Output
~/open-source » go run -race test.go
2023/10/11 16:19:36 Memory usage: 24416 kB
2023/10/11 16:19:36 Starting service http://localhost:5019/
2023/10/11 16:19:38 Now flooding the server with requests...
2023/10/11 16:19:38 Memory usage: 28624 kB
2023/10/11 16:19:43 Memory usage: 76256 kB
2023/10/11 16:19:48 Memory usage: 85456 kB
2023/10/11 16:19:53 Memory usage: 97936 kB
2023/10/11 16:19:58 Memory usage: 107552 kB
2023/10/11 16:20:03 Memory usage: 116432 kB
2023/10/11 16:20:08 Memory usage: 124512 kB
2023/10/11 16:20:13 Memory usage: 134416 kB
2023/10/11 16:20:18 Memory usage: 136544 kB
2023/10/11 16:20:23 Memory usage: 150528 kB
2023/10/11 16:20:28 Memory usage: 160464 kB
2023/10/11 16:20:33 Memory usage: 168704 kB
2023/10/11 16:20:38 Memory usage: 177744 kB
2023/10/11 16:20:43 Memory usage: 186848 kB
2023/10/11 16:20:48 Memory usage: 195184 kB
2023/10/11 16:20:53 Memory usage: 205136 kB

@cuonglm
Copy link
Member

cuonglm commented Oct 12, 2023

git bisect points to https://go-review.googlesource.com/c/go/+/333529

@thanm
Copy link
Contributor

thanm commented Oct 12, 2023

Thank you @cuonglm for the bisect. Suspect CL seems plausible to me. I will dig into it a bit more.

@jgbooks
Copy link

jgbooks commented Dec 4, 2023

I didn't add -race. When I ran it for 2 hours, I noticed that the memory was growing slowly.
@walles @thanm

go version go1.21.0 linux/amd64

cat /etc/*release:
Red Hat Enterprise Linux Server release 6.1 (Santiago)

uname -a :
Linux bogon 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

go build main.go

What did you see instead?

image-20231204160904564

After 2 hours :

image-20231204160819248

@JOJO0527
Copy link

Also found memory leaks in go1.17.13 while race on , maybe it is not dependence on go version
image

@klaus-ziegert
Copy link

klaus-ziegert commented Apr 23, 2024

go version go1.22.2 linux/amd64

image

@jared-mackey
Copy link

jared-mackey commented Jun 18, 2024

Before/after enabling race detection in my staging environment.
image

Here's a more recent screenshot showing similar behavior.
image

Here's my production environment where I do not have race detection enabled.
image

Go 1.22.2 linux/amd64

@jared-mackey
Copy link

@thanm Anything we can do to assist you? Need anymore info to track this down and get it resolved? Happy to be a test subject 😀

@thanm
Copy link
Contributor

thanm commented Jul 11, 2024

I spent a little while looking at this.

I built a fresh copy of runtime/race/internal/amd64v1/race_linux.syso from LLVM source at dffa28fa3a907a3e0c64c9b06a46b55fc5cea40a, e.g.

$ git clone https://github.com/llvm/llvm-project.git
$ cd llvm-project/compiler-rt/lib/tsan/go
$ git checkout dffa28fa3a907a3e0c64c9b06a46b55fc5cea40a
...
$ git switch -c "work"
$ sh buildgo.sh
...
$ cp race_debug_linux_amd64.syso `go env GOROOT`/src/runtime/race/internal/amd64v1/race_linux.syso
$

I then hacked the TSAN runtime to expose the memory profiling hooks (which are normally stubbed out for Go). Here are the changes I made:

diff --git a/compiler-rt/lib/tsan/go/tsan_go.cpp b/compiler-rt/lib/tsan/go/tsan_go.cpp
index c689a51fb5e1..0846ddba5728 100644
--- a/compiler-rt/lib/tsan/go/tsan_go.cpp
+++ b/compiler-rt/lib/tsan/go/tsan_go.cpp
@@ -287,5 +287,10 @@ void __tsan_report_count(u64 *pn) {
   *pn = ctx->nreported;
 }
 
+void __tsan_mem_profile() {
+  extern void MemoryProfiler(u64 uptime);
+  MemoryProfiler(0);
+}
+
 }  // extern "C"
 }  // namespace __tsan
diff --git a/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp b/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
index e5ebb65754b3..84850a4722a9 100644
--- a/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
@@ -428,7 +428,6 @@ ThreadState::ThreadState(Tid tid)
   shadow_stack_end = shadow_stack + kInitStackSize;
 }
 
-#if !SANITIZER_GO
 void MemoryProfiler(u64 uptime) {
   if (ctx->memprof_fd == kInvalidFd)
     return;
@@ -460,6 +459,7 @@ static bool InitializeMemoryProfiler() {
   return true;
 }
 
+#if !SANITIZER_GO
 static void *BackgroundThread(void *arg) {
   // This is a non-initialized non-user thread, nothing to see here.
   // We don't use ScopedIgnoreInterceptors, because we want ignores to be
@@ -735,6 +735,7 @@ void Initialize(ThreadState *thr) {
   __ubsan::InitAsPlugin();
 #endif
 
+  InitializeMemoryProfiler();
 #if !SANITIZER_GO
   Symbolizer::LateInitialize();
   if (InitializeMemoryProfiler() || flags()->force_background_thread)

Then finally I modified the original program to call into TSAN's memory profiler as well:

package main

import (
	"log"
	"os"
	"os/exec"
	"strconv"
	"strings"
	"time"

	"net/http"

	_ "net/http/pprof"
	_ "unsafe"
)

func main() {
	printMemUsage()

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {})
	server := http.Server{
		Addr:        ":5019",
		ReadTimeout: 5 * time.Second,
	}

	go func() {
		log.Println("Starting service", "http://localhost:5019/")
		if err := server.ListenAndServe(); err != nil { // blocks
			log.Fatalln("Error while running service:", err)
		}
	}()

	// Give the server time to start
	time.Sleep(2 * time.Second)

	log.Println("Now flooding the server with requests...")
	nextMemLog := time.Now()
	iter := 0

	for iter < 20 {
		// Trigger the handler function
		response, err := http.Get("http://localhost:5019/")
		if err != nil {
			log.Fatalln("Error while making request:", err)
		}
		if response.StatusCode != http.StatusOK {
			log.Fatalln("Unexpected status code:", response.StatusCode)
		}
		if response.Body != nil {
			response.Body.Close()
		}

		if time.Now().After(nextMemLog) {
			nextMemLog = time.Now().Add(5 * time.Second)
			printMemUsage()
			iter++
		}
		TsanMemProf()
	}
	if os.Getenv("GOTRACEBACK") == "crash" {
		panic("crashing now")
	}
}

func printMemUsage() {
	out, err := exec.Command("ps", "-o", "rss", "-p", strconv.Itoa(os.Getpid())).Output()
	if err != nil {
		log.Fatalln("Error while running ps command, is it installed?", err)
	}

	// Split output into "RSS" and "1234"
	rss := strings.Fields(string(out))

	log.Println("Memory usage:", rss[1], "kB")
}

//go:linkname TsanMemProf __tsan_mem_profile
func TsanMemProf()

and finally did the run using this:

$ go build -ldflags="-linkmode=external" -race -trimpath main.go
$ GORACE=profile_memory=/tmp/raceprof.p ./main
$

Here's an excerpt of the generated profile:

==763403== 0s [1]: RSS 15 MB: shadow:0 meta:0 file:1 mmap:11 heap:0 other:2 intalloc:0 memblocks:0 syncobj:0 trace:0 stacks=0 threads=1/1
==763403== 0s [1]: RSS 31 MB: shadow:3 meta:0 file:7 mmap:4 heap:0 other:5 intalloc:0 memblocks:0 syncobj:0 trace:2 stacks=9 threads=13/22
==763403== 0s [1]: RSS 31 MB: shadow:3 meta:0 file:7 mmap:4 heap:0 other:5 intalloc:0 memblocks:0 syncobj:0 trace:2 stacks=9 threads=13/23
==763403== 0s [1]: RSS 31 MB: shadow:3 meta:0 file:7 mmap:4 heap:0 other:5 intalloc:0 memblocks:0 syncobj:0 trace:2 stacks=9 threads=13/24
...
==763403== 0s [1]: RSS 69 MB: shadow:12 meta:2 file:7 mmap:0 heap:0 other:29 intalloc:0 memblocks:0 syncobj:0 trace:10 stacks=9 threads=66/17344
==763403== 0s [1]: RSS 69 MB: shadow:12 meta:2 file:7 mmap:0 heap:0 other:29 intalloc:0 memblocks:0 syncobj:0 trace:10 stacks=9 threads=66/17345
==763403== 0s [1]: RSS 69 MB: shadow:12 meta:2 file:7 mmap:0 heap:0 other:29 intalloc:0 memblocks:0 syncobj:0 trace:10 stacks=9 threads=66/17346
==763403== 0s [1]: RSS 69 MB: shadow:12 meta:2 file:7 mmap:0 heap:0 other:29 intalloc:0 memblocks:0 syncobj:0 trace:10 stacks=9 threads=66/17347
==763403== 0s [1]: RSS 69 MB: shadow:12 meta:2 file:7 mmap:0 heap:0 other:29 intalloc:0 memblocks:0 syncobj:0 trace:10 stacks=9 threads=66/17350

So in essence it seems that we have a lot of dead threads on our hands.

Given the way TSAN works, it is not clear to me whether these dead threads are actually created by TSAN or just created somewhere else and registered with TSAN via interceptor?

I will spent a little time in the debugger to see if I can learn more.

@thanm
Copy link
Contributor

thanm commented Jul 11, 2024

From what I can see in the debugger, the threads in question are being created in Go. For example, here's one call stack:

Thread 10 "main" hit Breakpoint 1, __sanitizer::ThreadRegistry::CreateThread (this=0x1668848 <__tsan::ctx_placeholder+10485896>, user_id=0, detached=true, parent_tid=1, arg=0x7fff963eece0) at ../../sanitizer_common/sanitizer_thread_registry.cpp:138
(gdb) up
#1  0x00000000008925f5 in __tsan::ThreadCreate (thr=0x7ffff7c03900, pc=8355325, uid=0, detached=true) at ../rtl/tsan_rtl_thread.cpp:130
(gdb) 
#2  0x000000000086ef9b in __tsan::__tsan_go_start (parent=0x7ffff7c03900, pthr=0x7fff963eed98, pc=0x7f7dfd <net/http.(*Transport).startDialConnForLocked+509>) at /w/llvm-project/llvm-project/compiler-rt/lib/tsan/go/tsan_go.cpp:223
(gdb) p pc
$2 = (void *) 0x7f7dfd <net/http.(*Transport).startDialConnForLocked+509>
(gdb)

At least from a sample that I took, all the CreateThread calls are resulting from the Go side (the one above looks like it is from this line: https://go.googlesource.com/go/+/b788e91badd523e5bb0fc8d50cd76b8ae04ffb20/src/net/http/transport.go#1544

Initially I thought that perhaps we were somehow missing a callback from Go to TSAN when a goroutine exits, but in fact that seems to be working just fine based on what I see in the debugger: I set breakpoints in both __tsan_go_start and __tsan_go_end, put a high "ignore" count on both, then ran for 30 seconds or so, and when I stopped the number of times each was hit was approximately the same.

I also took a closer look at the TSAN thread registry and it looks as though the large number of "dead" threads is actually expected given the way this program works (e.g. creates lots of goroutines); the number being reported is total threads created, including those that are reaped properly with __tsan_go_end.

I think the next thing to do is figure out some way to collect a heap profile from the program to see where the memory is going. It seems likely at this point that the TSAN runtime is what's holding the memory, just not clear which part.

@thanm
Copy link
Contributor

thanm commented Jul 11, 2024

Well, profiling with https://gperftools.github.io/gperftools/heapprofile.html was largely a waste of time -- it is clear that whatever memory growth is happening it is not via the regular heap, but rather via anonymous mmaps.

@jared-mackey
Copy link

I'm not sure if this is useful info or not (probably obvious to you). We are using DD's continuous profiler which I believe hooks into runtime/trace and it is not reporting any of this increased heap usage. The graphs for all of our staging environments are flat according to that. So at least according to runtime/trace it's invisible. Not sure if that helps narrow it down for you at all. 😄

@thanm
Copy link
Contributor

thanm commented Jul 11, 2024

Definitely not Go heap or C heap usage, it looks like the space is coming from the anonymous mmaps that the TSAN runtime does. I will spend some more time with that code to see if I can figure out what the issue is. The main job is going to be determining which TSAN data structures are the major culprits.

@thanm
Copy link
Contributor

thanm commented Jul 12, 2024

I wrote some throw-away code to profile internal mmaps (e.g. instrumented this function: https://github.com/llvm/llvm-project/blob/dffa28fa3a907a3e0c64c9b06a46b55fc5cea40a/compiler-rt/lib/sanitizer_common/sanitizer_posix.cpp#L401) and did a length run to see how mmap'd data grows for the example program.

Turns out to be a bit tricky, since the memory profiler itself actually allocates a fair amount of memory, but eventually I was able to gather a little data. At the start of a run here's what mmap usage looks like (for each line the number on the left is in bytes, tag at the right is the name passed to MmapNamed):

           12288 InternalMmapVector
         5373952 TracePart
         1052672 TwoLevelMap
           65536 Allocate
        16777216 SizeClassAllocator
         8388608 StackStore
          180224 sync allocator

and at the end of the run:

         2101248 InternalMmapVector
        13631488 TracePart
         1052672 TwoLevelMap
           65536 Allocate
        98566144 SizeClassAllocator
         8388608 StackStore
          811008 sync allocator

Not sure if this really tells us anything however; InternalMmapVector and SizeClassAllocator are very generic (basically lower level allocation mechanisms, hard to tell what is actually being allocated with these).

At this point given what I've seen from the numbers (Go heap / C++ heap numbers, mmap numbers from /proc/self/smaps) it is pretty clear that the increased memory is being used by the TSAN runtime... just not sure exactly how. I think it would probaby be a good point here to engage someone from the LLVM community e.g. a TSAN expert.

@thanm
Copy link
Contributor

thanm commented Jul 12, 2024

@dvyukov do you think you might be able to offer some advice here? Thanks

@dvyukov
Copy link
Member

dvyukov commented Jul 13, 2024

I see you already did a great deal of debugging.

Re the profiles in the previous comment, it would be useful to capture several profiles during the run, rather than just start/end. They should show what components are constantly growing w/o bound, and which were just subject to some initial growth.
Second question: are these constantly growing components proportional to the total RSS leakage we see. If yes, then we pin pointed all of it.

@dvyukov
Copy link
Member

dvyukov commented Jul 13, 2024

So in essence it seems that we have a lot of dead threads on our hands.

This is suspicious and what I would check first.
Sanitizers keep some info about finished threads as well for reporting purposes. But in C/C++ this is usually bounded number of them.

I see that one part that disappeared in v3 is this kThreadQuarantineSize/kMaxTidReuse:
llvm/llvm-project@b332134921b42#diff-39729b589bbf9d353b9558f115735f4ec75c3f56539af5d1a2ff1c998d9fb20eR361

kThreadQuarantineSize was supposed to set bound of number of memorized dead threads. Probably my idea was to remove all dead threads in DoResetImpl, but maybe I never did that. I am not sure how easy/safe is it to reset all dead threads in DoResetImpl now, so probably the easiest solution is to restore the quarantine (at least for Go).

@thanm
Copy link
Contributor

thanm commented Jul 15, 2024

Thanks, that seems like a promising direction to follow. I'll read up on the old quarantine code and see what I can do there.

@thanm
Copy link
Contributor

thanm commented Jul 16, 2024

Following up on the suggestion in this comment, I tried this patch to restore the thread quarantine limit:

diff --git a/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp b/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
index e5ebb65754b3..4a0cba953087 100644
--- a/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
@@ -381,13 +381,20 @@ void SlotUnlock(ThreadState* thr) {
   thr->slot->mtx.Unlock();
 }
 
+
+#if SANITIZER_GO
+static const u32 kThreadQuarantineSize = 16;
+#else
+static const u32 kThreadQuarantineSize = UINT32_MAX
+#endif
+
 Context::Context()
     : initialized(),
       report_mtx(MutexTypeReport),
       nreported(),
       thread_registry([](Tid tid) -> ThreadContextBase* {
         return new (Alloc(sizeof(ThreadContext))) ThreadContext(tid);
-      }),
+      }, UINT32_MAX, kThreadQuarantineSize, 0),
       racy_mtx(MutexTypeRacy),
       racy_stacks(),
       fired_suppressions_mtx(MutexTypeFired),

This did not seem to go well. The Go testcase fails fairly quickly with a failed assertion:

ThreadSanitizer: CHECK failed: tsan_rtl.h:714 "((pos)) >= ((&current->events[0]))" (0x0, 0x7fade9c2a028) (tid=13648)

A bit more from the debugger:

(gdb) where 5
#0  __sanitizer::CheckFailed (file=0xa28b64 "../rtl/tsan_rtl.h", line=714, 
    cond=0xa28b78 "((pos)) >= ((&current->events[0]))", v1=0, 
    v2=140737341030440) at ../../sanitizer_common/sanitizer_termination.cpp:70
#1  0x000000000087276b in __tsan::TraceAcquire<__tsan::EventFunc> (
    ev=0x7fffffffd5f8, thr=0x7ffff7c01e00) at ../rtl/tsan_rtl.h:714
#2  __tsan::TryTraceFunc (pc=4707297, thr=0x7ffff7c01e00)
    at ../rtl/tsan_rtl.h:753
#3  __tsan::FuncEntry (pc=4707297, thr=0x7ffff7c01e00)
    at ../rtl/tsan_rtl.h:784
#4  __tsan::__tsan_func_enter (thr=0x7ffff7c01e00, 
    pc=0x47d3e1 <runtime.goexit+1>)
    at /w/llvm-project/llvm-project/compiler-rt/lib/tsan/go/tsan_go.cpp:202
(More stack frames follow...)
(gdb) up
#1  0x000000000087276b in __tsan::TraceAcquire<__tsan::EventFunc> (
    ev=0x7fffffffd5f8, thr=0x7ffff7c01e00) at ../rtl/tsan_rtl.h:714
(gdb) p pos
$10 = (__tsan::Event *) 0x0
(gdb) 

Code:

ALWAYS_INLINE WARN_UNUSED_RESULT bool TraceAcquire(ThreadState *thr,
                                                   EventT **ev) {
  // TraceSwitchPart accesses shadow_stack, but it's called infrequently,
  // so we check it here proactively.
  DCHECK(thr->shadow_stack);
  Event *pos = reinterpret_cast<Event *>(atomic_load_relaxed(&thr->trace_pos));
#if SANITIZER_DEBUG
  // TraceSwitch acquires these mutexes,
  // so we lock them here to detect deadlocks more reliably.
  { Lock lock(&ctx->slot_mtx); }
  { Lock lock(&thr->tctx->trace.mtx); }
  TracePart *current = thr->tctx->trace.parts.Back();
  if (current) {
    DCHECK_GE(pos, &current->events[0]);            <<<<----- failing assert here

@dvyukov
Copy link
Member

dvyukov commented Jul 17, 2024

I unfortunately don't have any suggestions off the top of my head. Things are a bit elaborate there. Probably there is some latent bug, or unimplemented part somewhere, but things worked fine b/c thread reuse was never enabled in the new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Projects
Development

No branches or pull requests