-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: testing/synctest: new package for testing concurrent code #67434
Comments
I really like how simple this API is.
How does time work when goroutines aren't idle? Does it stand still, or does it advance at the usual rate? If it stands still, it seems like that could break software that assumes time will advance during computation (that maybe that's rare in practice). If it advances at the usual rate, it seems like that reintroduces a source of flakiness. E.g., in your example, the 1 second sleep will advance time by 1 second, but then on a slow system the checking thread may still not execute for a long time. What are the bounds of the fake time implementation? Presumably if you're making direct system calls that interact with times or durations, we're not going to do anything about that. Are we going to make any attempt at faking time in the file system?
What if a goroutine is blocked on a channel that goes outside the group? This came to mind in the context of whether this could be used to coordinate a multi-process client/server test, though I think it would also come up if there's any sort of interaction with a background worker goroutine or pool.
What happens if multiple goroutines in a group call Wait? I think the options are to panic or to consider all of them idle, in which case they would all wake up when every other goroutine in the group is idle. What happens if you have nested groups, say group A contains group B, and a goroutine in B is blocked in Wait, and then a goroutine in A calls Wait? I think your options are to panic (though that feels wrong), wake up both if all of the goroutines in group A are idle, or wake up just B if all of the goroutines in B are idle (but this block waking up A until nothing is calling Wait in group B). |
Time stands still, except when all goroutines in a group are idle. (Same as the playground behaves, I believe.) This would break software that assumes time will advance. You'd need to use something else to test that case.
The time package: Faking time in the filesystem seems complicated and highly specialized, so I don't think we should try. Code which cares about file timestamps will need to use a test
As proposed, this would count as an idle goroutine. If you fail to isolate the system under test this will probably cause problems, so don't do that.
As proposed, none of them ever wake up and your test times out, or possibly panics if we can detect that all goroutines are blocked in that case. Having them all wake at the same time would also be reasonable.
Oh, I didn't think of that. Nested groups are too complicated, |
This is a very interesting proposal! I feel worried that the Assuming that's a valid concern (if it isn't then I'll retract this entire comment!), I could imagine mitigating it in two different ways:
(I apologize in advance if I misunderstood any part of the proposal or if I am missing something existing that's already similarly convenient to |
The fact that I think using idle-wait synchronization outside of tests is always going to be a mistake. It's fragile and fiddly, and you're better served by explicit synchronization. (This prompts the question: Isn't this fragile and fiddly inside tests as well? It is, but using a fake clock removes much of the sources of fragility, and tests often have requirements that make the fiddliness a more worthwhile tradeoff. In the expiring cache example, for example, non-test code will never need to guarantee that a cache entry expires precisely at the nanosecond defined.) So while perhaps we could offer a standalone synchroniziation primitive outside of As for passing a |
Interesting proposal. I like that it allows for waiting for a group of goroutines, as opposed to all goroutines in my proposal (#65336), though I do have some concerns:
|
One of the goals of this proposal is to minimize the amount of unnatural code required to make a system testable. Mock time implementations require replacing calls to idiomatic time package functions with a testable interface. Putting fake time in the standard library would let us just write the idiomatic code without compromising testability. For timeouts, the Also, it would be pointless for |
I wanted to evaluate practical usage of the proposed API. I wrote a version of Run and Wait based on parsing the output of runtime.Stack. Wait calls runtime.Gosched in a loop until all goroutines in the current group are idle. I also wrote a fake time implementation. Combined, these form a reasonable facsimile of the proposed synctest package, with some limitations: The code under test needs to be instrumented to call the fake time functions, and to call a marking function after creating new goroutines. Also, you need to call a synctest.Sleep function in tests to advance the fake clock. I then added this instrumentation to net/http. The synctest package does not work with real network connections, so I added an in-memory net.Conn implementation to the net/http tests. I also added an additional helper to net/http's tests, which simplifies some of the experimentation below: var errStillRunning = errors.New("async op still running")
// asyncResult is the result of an asynchronous operation.
type asyncResult[T any] struct {}
// runAsync runs f in a new goroutine,
// and returns an asyncResult which is populated with the result of f when it finishes.
// runAsync calls synctest.Wait after running f.
func runAsync[T any](f func() (T, error)) *asyncResult[T]
// done reports whether the asynchronous operation has finished.
func (r *asyncResult[T]) done() bool
// result returns the result of the asynchronous operation.
// It returns errStillRunning if the operation is still running.
func (r *asyncResult[T]) result() (T, error) One of the longest-running tests in the net/http package is TestServerShutdownStateNew (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/serve_test.go#5611). This test creates a server, opens a connection to it, and calls Server.Shutdown. It asserts that the server, which is expected to wait 5 seconds for the idle connection to close, shuts down in no less than 2.5 seconds and no more than 7.5 seconds. This test generally takes about 5-6 seconds to run in both HTTP/1 and HTTP/2 modes. The portion of this test which performs the shutdown is: shutdownRes := make(chan error, 1)
go func() {
shutdownRes <- ts.Config.Shutdown(context.Background())
}()
readRes := make(chan error, 1)
go func() {
_, err := c.Read([]byte{0})
readRes <- err
}()
// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second
t0 := time.Now()
select {
case got := <-shutdownRes:
d := time.Since(t0)
if got != nil {
t.Fatalf("shutdown error after %v: %v", d, err)
}
if d < expectTimeout/2 {
t.Errorf("shutdown too soon after %v", d)
}
case <-time.After(expectTimeout * 3 / 2):
t.Fatalf("timeout waiting for shutdown")
}
// Wait for c.Read to unblock; should be already done at this point,
// or within a few milliseconds.
if err := <-readRes; err == nil {
t.Error("expected error from Read")
} I wrapped the test in a synctest.Run call and changed it to use the in-memory connection. I then rewrote this section of the test: shutdownRes := runAsync(func() (struct{}, error) {
return struct{}{}, ts.Config.Shutdown(context.Background())
})
readRes := runAsync(func() (int, error) {
return c.Read([]byte{0})
})
// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second
synctest.Sleep(expectTimeout - 1)
if shutdownRes.done() {
t.Fatal("shutdown too soon")
}
synctest.Sleep(2 * time.Second)
if _, err := shutdownRes.result(); err != nil {
t.Fatalf("Shutdown() = %v, want complete", err)
}
if n, err := readRes.result(); err == nil || err == errStillRunning {
t.Fatalf("Read() = %v, %v; want error", n, err)
} The test exercises the same behavior it did before, but it now runs instantaneously. (0.01 seconds on my laptop.) I made an interesting discovery after converting the test: The server does not actually shut down in 5 seconds. In the initial version of this test, I checked for shutdown exactly 5 seconds after calling Shutdown. The test failed, reporting that the Shutdown call had not completed. Examining the Shutdown function revealed that the server polls for closed connections during shutdown, with a maximum poll interval of 500ms, and therefore shutdown can be delayed slightly past the point where connections have shut down. I changed the test to check for shutdown after 6 seconds. But once again, the test failed. Further investigation revealed this code (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/server.go#3041): st, unixSec := c.getState()
// Issue 22682: treat StateNew connections as if
// they're idle if we haven't read the first request's
// header in over 5 seconds.
if st == StateNew && unixSec < time.Now().Unix()-5 {
st = StateIdle
} The comment states that new connections are considered idle for 5 seconds, but thanks to the low granularity of Unix timestamps the test can consider one idle for as little as 4 or as much as 6 seconds. Combined with the 500ms poll interval (and ignoring any added scheduler delay), Shutdown may take up to 6.5 seconds to complete, not 5. Using a fake clock rather than a real one not only speeds up this test dramatically, but it also allows us to more precisely test the behavior of the system under test. Another slow test is TestTransportExpect100Continue (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/transport_test.go#1188). This test sends an HTTP request containing an "Expect: 100-continue" header, which indicates that the client is waiting for the server to indicate that it wants the request body before it sends it. In one variation, the server does not send a response; after a 2 second timeout, the client gives up waiting and sends the request. This test takes 2 seconds to execute, thanks to this timeout. In addition, the test does not validate the timing of the client sending the request body; in particular, tests pass even if the client waits The portion of the test which sends the request is: resp, err := c.Do(req) I changed this to: rt := runAsync(func() (*Response, error) {
return c.Do(req)
})
if v.timeout {
synctest.Sleep(expectContinueTimeout-1)
if rt.done() {
t.Fatalf("RoundTrip finished too soon")
}
synctest.Sleep(1)
}
resp, err := rt.result()
if err != nil {
t.Fatal(err)
} This test now executes instantaneously. It also verifies that the client does or does not wait for the ExpectContinueTimeout as expected. I made one discovery while converting this test. The synctest.Run function blocks until all goroutines in the group have exited. (In the proposed synctest package, Run will panic if all goroutines become blocked (deadlock), but I have not implemented that feature in the test version of the package.) The test was hanging in Run, due to leaking a goroutine. I tracked this down to a missing net.Conn.Close call, which was leaving an HTTP client reading indefinitely from an idle and abandoned server connection. In this case, Run's behavior caused me some confusion, but ultimately led to the discovery of a real (if fairly minor) bug in the test. (I'd probably have experienced less confusion, but I initially assumed this was a bug in the implementation of Run.) At one point during this exercise, I accidentally called testing.T.Run from within a synctest.Run group. This results in, at the very best, quite confusing behavior. I think we would want to make it possible to detect when running within a group, and have testing.T.Run panic in this case. My experimental implementation of the synctest package includes a synctest.Sleep function by necessity: It was much easier to implement with an explicit call to advance the fake clock. However, I found in writing these tests that I often want to sleep and then wait for any timers to finish executing before continuing. I think, therefore, that we should have one additional convenience function: package synctest
// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
time.Sleep(d)
Wait()
} The net/http package was not designed to support testing with a fake clock. This has served as an obstacle to improving the state of the package's tests, many of which are slow, flaky, or both. Converting net/http to be testable with my experimental version of synctest required a small number of minor changes. A runtime-supported synctest would have required no changes at all to net/http itself. Converting net/http tests to use synctest required adding an in-memory net.Conn. (I didn't attempt to use net.Pipe, because its fully-synchronous behavior tends to cause problems in tests.) Aside from this, the changes required were very small. My experiment is in https://go.dev/cl/587657. |
This proposal has been added to the active column of the proposals project |
Commenting here due to @rsc's request: Relative to my proposal #65336, I have the following concerns:
|
Regarding overriding the The In contrast, we can test code which calls Time is fundamentally different in that there is no way to use real time in a test without making the test flaky and slow. Time is also different from an Since we can't use real time in tests, we can insert a testable wrapper around the In addition, if we define a standard testable wrapper around the clock, we are essentially declaring that all public packages which deal with time should provide a way to plumb in a clock. (Some packages do this already, of course; crypto/tls.Config.Time is an example in That's an option, of course. But it would be a very large change to the Go ecosystem as a whole. |
The pprof.SetGoroutineLabels disagrees.
It doesn't try to hide it, more like tries to restrict people from relying on numbers.
If I understood proposal correctly, it will wait for any goroutine (and recursively) that was started using |
Yes, if you call |
Given that there's more precedent for goroutine identity than I had previously thought, and seeing how However, I'm still a little ambivalent about goroutine groups affecting That being said, I agree that plumbing a time/clock interface through existing code is indeed tedious, and having |
Thanks for doing the experiment. I find the results pretty compelling.
I don't quite understand this function. Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?
Partly I was wondering about nested groups because I've been scheming other things that the concept of a goroutine group could be used for. Though it's true that, even if we have groups for other purposes, it may make sense to say that synctest groups cannot be nested, even if in general groups can be nested. |
You're right that sleeping past the deadline of a timer is sufficient. The It's fairly natural to sleep to the exact instant of a timer, however. If a cache entry expires in some amount of time, it's easy to sleep for that exact amount of time, possibly using the same constant that the cache timeout was initialized with, rather than adding a nanosecond. Adding nanoseconds also adds a small but real amount of confusion to a test in various small ways: The time of logged events drifts off the integer second, rate calculations don't come out as cleanly, and so on. Plus, if you forget to add the necessary adjustment or otherwise accidentally sleep directly onto the instant of a timer's expiry, you get a race condition. Cleaner, I think, for the test code to always resynchronize after poking the system under test. This doesn't have to be a function in the synctest package, of course;
I'm very intrigued! I've just about convinced myself that there's a useful general purpose synchronization API hiding in here, but I'm not sure what it is or what it's useful for. |
For what it's worth, I think it's a good thing that virtual time is included in this, because it makes sure that this package isn't used in production settings. It makes it only suitable for tests (and very suitable). |
It sounds like the API is still:
Damien suggested adding also:
The difference between time.Sleep and synctest.Sleep seems subtle enough that it seems like you should have to spell out the Wait at the call sites where you need it. The only time you really need Wait is if you know someone else is waking up at that very moment. But then if they've both done the Sleep+Wait form then you still have a problem. You really only want some of the call sites (maybe just one) to use the Sleep+Wait form. I suppose that the production code will use time.Sleep since it's not importing synctest, so maybe it's clear that the test harness is the only one that will call Sleep+Wait. On the other hand, fixing a test failure by changing s/time.Sleep/synctest.Sleep/ will be a strange-looking bug fix. Better to have to add synctest.Wait instead. If we really need this, it could be synctest.SleepAndWait but that's what statements are for. Probably too subtle and should just limit the proposal to Run and Wait. |
Some additional suggestions for the description of the
Additionally, for "mutex operation", let's list out the the exact operations considered for implementation/testing completeness:
|
The API looks simple and that is excellent. What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect. Imagine you unit test your code but are unable to mock out a dependency. Maybe due to lack of experience or bad design of existing code I have to work with. That dependency that suddenly starts calling a syscall (e.g. to lazily try to tune the library using a sync.Once instead of on init time and having a timeout). Without support in testing you will never detect that now and only your tests will suddenly time out after an innocent minor dependency update. |
May I ortgogonally to the previous comment suggest to limit this package to standard library only to gather more experience with that approach before ? That would also allow to sketch out integration with the testing package in addition to finding more pitfalls. |
Can you expand more on what you mean by undetected regressions? If the code under test (either directly, or through a dependency) unexpectedly calls a blocking syscall,
What kind of support are you thinking of? |
What does this do?
Does it succeed or panic? It's not clear to me from the API docs because:
This is obviously a degenerate case, but I think it also applies if a test wanted to get the fake time features when testing otherwise non-concurrent code. |
In this case, the goroutine calling |
This should succeed. Perhaps the
This also resolves the question of what happens if two goroutines call |
If multiple goroutines call Some possible alternative behaviors:
|
What's the use case for supporting multiple simultaneous |
@magical Thank you. I've updated, and encountered a deadlock. It's probably something I'm misunderstand, but sharing in case it's helpful: package x
import (
"context"
"testing"
"testing/synctest"
"time"
)
// go test -count=10
func TestSyncTestContextDeadline(t *testing.T) {
synctest.Run(func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
t.Logf("before sleep: %s", time.Now())
time.Sleep(10 * time.Second)
synctest.Wait()
t.Logf("after sleep: %s", time.Now())
t.Logf("ctx.Err(): %v", ctx.Err())
})
}
|
It's subtle, but in this case you shouldn't need a synctest.Wait because the context expires 5s after the test starts and the sleep is for 10s. Expected behavior is for the runtime to detect that all goroutines are idle, advance the synthetic clock to the time of the next event (T+5s), wait for all goroutines to become idle again, and then advance the clock again to T+10s. I think this test is running afoul of a bug in CL 591997. I've uploaded a new version of the CL with a couple fixes that make this test pass consistently for me, although there are some other bugs in there that (optimistically) I haven't worked out yet. Surprising-to-me edge cases I've discovered in implementation so far:
(So far, the bugs I've encountered in implementation all seem to indicate that this feature is tricky to implement, but not problems with the user-facing API.) |
I too am now able to get 1000 runs without fail. Excellent! This is a very exciting feature/addition. I'm confident you'll suss out all the edge cases. :)
Agreed. The API seems solid. |
Ah, interesting - thanks for the clarification. (In fact i see that you and Austin already discussed this. My bad.) If i understand correctly, it sounds like we can break Sleep's behaviour down into three cases:
The u=t case seems like a sharp edge, since it's the only time when the user would actually have to call Wait. As Austin pointed out, sleeping even 1ns past t seems to fill the same role as Sleep+Wait. They asked if that makes Wait unnecessary because it could be replaced by Sleep(1ns). I would turn it around and ask, why is that extra nanosecond necessary? Is there any reason why Sleep can't always Wait afterwards? (I imagine maybe there is a good reason, like it would inject too much synchronization into the code under test, but i haven't completely thought it through) Continuing along the same vein... synctest.Run(func(){
time.AfterFunc(5*time.Second, func() { println(5) } )
time.AfterFunc(10*time.Second, func() { println(10) } )
time.AfterFunc(15*time.Second, func() { println(15) } )
time.Sleep(5*time.Second) // +5
time.Sleep(5*time.Second) // +10
time.Sleep(5*time.Second) // +15
time.Sleep(5*time.Second) // +20
}) synctest.Run(func(){
time.AfterFunc(5*time.Second, func() { println(5) } )
time.AfterFunc(10*time.Second, func() { println(10) } )
time.AfterFunc(15*time.Second, func() { println(15) } )
time.Sleep(20*time.Second)
}) In the first one, the Sleep and AfterFunc times line up perfectly, so we should only encounter the u=t and u>t cases, so there shouldn't be any implicit Wait - unless Sleep unconditionally Waits when it starts? Is Sleep(0) equivalent to Wait? |
What happens if two Sleep calls happen at the same time? How do we decide which one waits for the other? We could say that Sleep is lower priority than timers: First wake non-Sleep timers, and then wake Sleep calls when all other timers have finished. But that seems more subtle than an explicit Wait, and we still have the question of multiple Sleeps happening at the same time. Another possibility might be to say that the goroutine started by Run has a lower priority than any other: It only wakes when every other goroutine is idle. But again, this is subtle. |
I love this proposal so far. I have ran into a problem of long-running tests that just call Maybe this is out of scope, but I am looking at this from the perspective of Communicating Sequential Processes (CSP) - in CSP, a concurrent program is correct iff each trace (possible sequence of events) is correct. E.g. when we call This would, of course, complicate the implementation, since we will not be able to use the "idle" state as a global synchronization point, but we'd have a DAG of "idle" states many of which would be independent from others. Would it be possible to generate all possible traces of goroutines (using "idle" as the "event" in context of CSP) and test whether or not all of them result in correct behavior (i.e. pass the test)? |
This may be irrelevant, but the go playground docs mention
It looks like these changes may have since been abandoned? |
Those changes still exist and can be enabled using |
I think the draft CL (https://go.dev/cl/591997) is approaching something real. Still needs more tests, but I think I've squashed all the race conditions I know about. Currently open questions/suggestions: @rsc suggests (#67434 (comment)) that moving a timer between bubbles (what I've been calling a "synctest group") should crash. I agree, and the current version of CL 591997 implements this. I believe we can make channel operations and coroutine switches that cross bubble boundaries crash as well. Should we? Specifically, we would disallow:
@narqo suggests (#67434 (comment)) putting the API in the testing package. Perhaps something like:
I don't have particularly strong feelings about this, but my inclination is to stick with a separate synctest package. The testing package API is quite large, and a separate package gives us a good place to document these features in isolation. I suggest adding one more function to synctest, to report whether the current goroutine is in a bubble or not:
|
Hello. If you have components which contain time-bound logic, and testing them is difficult without being affected by slow running tests, or keep a consistent timeline between components that each expects a time-related dependency, then the I'm not opposing the proposal, just wanted to share the package that I made for a similar purpose, just a bit more generic to use. (More of a The following cases are covered: clock.Now()
clock.Sleep
clock.After
clock.NewTicker
example from the proposal would look likefunc TestCacheEntryExpires(t *testing.T) {
count := 0
c := NewCache(2 * time.Second, func(key string) int {
count++
return fmt.Sprintf("%v:%v", key, count)
})
// Get an entry from the cache.
if got, want := c.Get("k"), "k:1"; got != want {
t.Errorf("c.Get(k) = %q, want %q", got, want)
}
// Verify that we get the same entry when accessing it before the expiry.
timecop.Travel(t, 1 * time.Second)
if got, want := c.Get("k"), "k:1"; got != want {
t.Errorf("c.Get(k) = %q, want %q", got, want)
}
// back to the future
timecop.Travel(t, 3 * time.Second)
if got, want := c.Get("k"), "k:2"; got != want {
t.Errorf("c.Get(k) = %q, want %q", got, want)
}
} |
Could we consider a more flattened API such as: func TestXXX(t *testing.T) {
synctest.Enable(t)
// or
stop := synctest.Start()
defer stop()
} This approach allows helper functions to handle test arrangements/setups. The API would remain straightforward, and we could defer the teardown process using func TestXXX(t *testing.T) {
t.Cleanup(synctest.Start())
} |
I don't think this offers any significant advantages over synctest.Run, and has some disadvantages. With Run, a goroutine is always part of a synctest group or not. Within a goroutine, time.Now either returns a fake time value or it does not. The extent of Run's influence is clearly delineated, and there's no possibility of accidentally leaking that influence somewhere unintended. For example, if we use a switch like Enable or Start, the testing package now needs to consider the possibility that the goroutine running a test returns from the test function using a fake clock. I don't see how Enable or Start would be needed by testing frameworks either. A testing helper function that uses synctest can accept a function to run inside a synctest.Run environment. |
Is it important to distinguish whether it unparks? Or disallowing the nonblocking case would make it too restrictive? The original question was about timers. The new restriction would apply to all type of values? (This probably makes sense, otherwise we'd have to distinguish which values could contain time, which couldn't.) |
I don't know how to detect any other cases. We can detect when a channel operation unparks a goroutine in a separate bubble--when waking the blocked goroutine, we just check to see whether the goroutine performing the channel operation and the one being unparked are in the same bubble. But unless we associate each channel with a bubble, I don't know how to detect the case where a goroutine in one bubble closes a channel or performs some other non-blocking operation, and at a later time a goroutine in a different bubble observes the result of the operation. We could associate channels with bubbles, but that's an added field on runtime.hchan. I've been trying to avoid increasing the size of any runtime structures, aside from one additional pointer on g to record bubble membership. (g is already pretty big, so hopefully one more word isn't a meaningful cost.) I think a reasonable compromise is to say that channel operations aren't allowed to cross bubble boundaries, to panic when we can detect that this restriction has been violated, but to not worry about cases where we can't easily detect a violation.
Yes. The original question (I think) was about operations on a time.Timer. For example, if a goroutine in one bubble creates a time.Timer (which will use the bubble's fake clock) and a goroutine in a different bubble resets the timer, which clock is the timer now using? Panicking seems reasonable here, since this doesn't seem like the sort of thing one should be doing intentionally. |
I think the panicking really needs to be only about timer channels, not all channels. Consider a goroutine serving as a mutex protecting some shared state. That goroutine can be created during func init (outside any bubbles) and then code in a bubble can send and receive to and from that goroutine to access the shared state. Surely that should be allowed, just as code in a bubble can lock and unlock a global mutex. (The same is true for a mutex that may or may not park or unpark across a bubble.) Restricting to timer channels also makes the panic behavior independent of buffering, so that you get consistent panics as oppposed to panics that depend on scheduling details. |
@prattmic raises a very interesting (in the worst sense of the word) point in review https://go.dev/cl/591997, which I'm copying here for visibility: A goroutine blocked in sync.Mutex.Lock is idle. Run panics if all goroutines in the bubble are idle (deadlock). crypto/rand.Reader is a global with a mutex. reflect.FuncOf has a global mutex protecting a cache. There are probably other cases of global mutexes in std. In general, a synctest goroutine can't safely access any resource guarded by a global lock, because it will appear idle if it blocks in lock acquisition. That's a problem for common functions with a hidden lock, like reflect.FuncOf. One possible way to resolve this would be to say that goroutines blocked in mutex acquisition are not idle. For the common case of a mutex held for a brief period of time while accessing some state, this would cause no issues for synctest users. This change would make synctest unsuitable for testing code which holds a mutex for a long time across a blocking operation, however. That may be an uncommon enough case to not be a problem. (I don't think this limitation would affect synctest's usability in testing net/http, for example.) Another possibility might be to distinguish between mutexes acquired by a goroutine within a synctest bubble and ones not. We might be able to set a bit on Mutex.sema indicating that it's held by a bubbled goroutine, and only consider a goroutine idle if its blocked on a mutex with that bit set. I haven't worked this option through in much detail. Either approach would still leave the same problem with a global channel serving as a mutex, as @rsc describes in the previous comment. We could say we don't cover this case; a synctest goroutine cannot communicate with a goroutine outside its bubble using channels. Another possibility would be to associate channels with bubbles: If a channel is created by a goroutine within a bubble, it is tied to the bubble. A goroutine blocked on a channel within its bubble is idle, but one blocked on a channel from outside its bubble is not. This would be a fairly straightforward change to make (although I'm not immediately sure how to do it without increasing the size of runtime.hchan, which would be unfortunate). |
@neild Is it critical for |
It isn't strictly critical for Run to panic if all goroutines in a bubble are idle, but if goroutines can become non-durably "idle" in, for example, a call to reflect.FuncOf then the bubble's fake clock may be advanced prematurely and Wait calls may return before every active goroutine has truly idled. Having slept on this, I propose the following amendments to the proposal:
As an implementation detail, to avoid increasing the size of runtime.hchan, we'll only track whether a channel is in some bubble (requires one bit, and hchan has free bits) rather than which bubble it is in (would require one word). We could make idle detection work with mutexes: When locking a mutex, record whether it was locked by a goroutine in a synctest bubble or not. A goroutine blocked on a mutex operation is idle if the mutex is held by a bubble, non-idle otherwise. However, that's much more complexity than I want to propose adding to the mutex code, and I don't think it justifies the cost. I'll try amending the implementation with this change to see how it works in practice. |
Plus critical sections don't really have an owner. That is, Unlock does not have to be called by the same goroutine as Lock. This "feature" has stopped many of our clever ideas... |
I don't think that would be a problem here.
I don't think this requires any inefficiencies for non-synctest code. It lets us identify mutexes held by a goroutine within a bubble, and goroutines blocked on a mutex held within a bubble. It prevents weird cases where a mutex is locked inside a bubble and unlocked outside it or vice-versa. The rules are a bit complicated to lay out, but you wouldn't need to think about them in practice: If you're using synctest, Run idles when all goroutines in the bubble are blocked waiting for other goroutines in the bubble, and it doesn't idle when goroutines are blocked waiting for something outside the bubble. It's probably not even all that complicated to implement. But I'm doubtful that it's necessary. The fact that mutex lock operations spin before parking makes mutexes strictly inferior to channels for the case where you expect the lock to block for an extended period of time, which is the only case where excluding mutexes from synctest's consideration would be a problem. |
Why is the solution for channels different than the solution for mutexes? Why shouldn’t they both have the in/out of bubble check |
I have updated https://go.dev/cl/591997. The current package documentation from that CL is:
Changes from the previous version:
To recap the rationale for the changes to channels and mutexes: A goroutine in a synctest bubble can become temporarily blocked while acquiring some globally synchronized resource. For example, crypto/rand.Read performs lazy initialization of its randomness source and/or guards access to that source with a mutex. reflect.FuncOf contains a mutex-guarded cache. My initial thought when proposing synctest was that code under test can and should avoid accessing global mutexes. However, I think the case of reflect.FuncOf shows that this isn't practical: There are functions in the standard library which acquire a global mutex, and where that fact isn't something the user can be expected to be aware of. To avoid this problem, we define the rules on when a goroutine in a synctest bubble is "idle" to be more limited: A goroutine is idle when we can say with a high degree of certainty that it is blocked on another goroutine in its bubble. A goroutine blocked on something outside the bubble is not idle. This is the same principle that says that a bubbled goroutine blocked on a network operation or syscall is not idle: We can't say with confidence whether the goroutine is durably idle, or will be woken by some event from outside the bubble. For channel operations, we handle this by associating channels with the bubble (if any) that creates the channel. A channel created within a bubble may only be used by goroutines in that bubble, and so a goroutine blocked on a bubbled channel can only be woken by some other goroutine in the bubble. Mutexes are not explicitly created, so we cannot associate a sync.Mutex with a bubble at creation time. We could instead mark a mutex at Lock time as held by a bubbled goroutine, and then apply similar rules as for channels: A bubbled goroutine blocked on a bubbled mutex is idle, a bubbled goroutine blocked on an unbubbled mutex is not idle, and it is a fatal error for an unbubbled goroutine to unlock a bubbled mutex. However, mutexes are generally not well-suited for cases where the lock operation blocks for an extended period of time. Mutex locks spin before parking, making acquisition inefficient in this case. There is no way to interrupt a blocked lock operation on timeout or cancelation. I hypothesize that there is little to no benefit to taking on more complexity in the mutex implementation to support synctest, and so the current proposal just assumes goroutines blocked on mutex acquisition are not durably idle. sync.Cond is a different case: Condition variables are intended to waited on for longer periods of time. A Cond is explicitly constructed with NewCond, so we could choose to associate Conds with the bubble (if any) they are constructed in. For simplicity of implementation, I've instead implemented this as a prohibition against an unbubbled goroutine waking a bubbled goroutine from cond.Wait. One final implementation detail: For simplicity and efficiency, the current CL does not attempt to track which bubble a channel is associated with. This means we will not detect cases where a channel created in one bubble is used in a different one. I don't think this situation is likely to come up often. Proposal committee: Do these changes seem acceptable? |
One big difference is that channels are explicitly created, but mutexes are not. We can't distinguish mutexes created within a bubble from ones created outside one. We could, as I describe above, distinguish mutexes locked by a bubble. There is no association between a mutex and the goroutine that locked it, but we could set a bit on a mutex that was locked by a bubbled goroutine. This would be sufficient to make mutexes "just work" with synctest. The argument against is that this is more complexity in mutexes, which are already complicated and also highly performance critical. I don't think it's a lot of complexity, but I'm also not certain that it's necessary. Proposal committee, do you have opinions? |
I suspect that this will be OK in practice, though there is a small worry in the back of my head about lazy initialization of a global channel occurring inside the bubble and thus panicking when used outside. A theoretical example of this would be a sync.Once to initialize a channel inside of some opaque API, but I don't have concrete examples. |
Edit: FWIW, I don't understand why this function is implemented this way. The select implementation ends up using |
@rsc suggested above #67434 (comment) that we restrict the panic behavior to timer channels. That would eliminate @prattmic 's concern. The updated proposal/CL applies the restriction to all bubbled channels, regardless of timers or data type. While the new update does solve the issue about buffering and (mostly) the issue about global channels, is there a good reason to do all channels instead of just timer channels? Or would time channels suffice for the intended testing cases? |
"Just timers" does seem better if we can make it work. Someone can implement what is semantically a mutex using (a) sync.Mutex, (b) channels, or (c) sync.Cond. |
To recap, the proposed panic behavior is: If a non-bubbled goroutine operates on a bubbled channel, panic. The rationale for having a distinction between bubbled and unbubbled channels is to allow a bubbled goroutine to access channel-synchronized resources from outside its bubble. For example, let's imagine a simple case where a channel is being used as a lock:
If we didn't have a distinction between bubbled and unbubbled channels, then a bubbled goroutine calling Get while the lock is held from outside the bubble will run into problems:
Get blocks writing to lockChan, the only goroutine in the bubble is now idle, Run panics because all goroutines are deadlocked. Making a distinction between bubbled and unbubbled channels means that instead when Get blocks on lockChan, synctest can recognize that it is blocked on something outside the bubble and not actually idle. If lockChan is lazily created, however, it might be inadvertently created within a synctest bubble. Now we fall back to the previous behavior: Some goroutine outside the bubble acquires lockChan, Get blocks on lockChan, lockChan is in the bubble, Run panics. But things are much more confusing than before, because the behavior depends on when lockChan was created. To avoid this confusion, we panic when the unbubbled goroutine writes to lockChan. An unbubbled goroutine accessing a bubbled channel indicates that something has gone wrong. The fix to the problem will probably be to ensure any lazy initialization happens outside the bubble. I think that if we distinguish between bubbled and unbubbled chans, then we need to prevent unbubbled goroutines from accessing bubbled chans to avoid confusion. If we don't distinguish between bubbled and unbubbled chans, then the overall model is simpler, but bubbled goroutines can't access global resources synchronized by a channel, which is unfortunate. |
I've greatly enjoyed using the preview of this addition. It's been very useful in my work. However, now that synctest is internal, it's become challenging to test and use outside stdlib. Would it be possible to make it accessible through a less restrictive means than a stdlib internal package? Perhaps a GOEXPERIMENT flag could work? The functionality is valuable enough that I'd consider vendoring it if it weren't so tightly coupled with Go's internals. This is such an awesome new addition. I'm eager to keep using it, even in an experimental state. :) |
Out of curiosity, I tried implementing the sync.Mutex semantics I described above: To be clear, I am not currently proposing we do this. This is just an experiment to see how intrusive the changes to sync.Mutex might be. To recap, this adds the following rules:
This essentially means that a mutex used within a bubble counts for idleness detection, but a global mutex shared with goroutines outside the bubble (such as the reflect type cache mutexes) does not. The changes required to sync.Mutex are not huge, but are not entirely trivial either. I'm still not convinced the value of mutex support in synctest is worth changing such a performance-critical type. |
This is a proposal for a new package to aid in testing concurrent code.
This package has two main features:
As an example, let us say we are testing an expiring concurrent cache:
A naive test for this cache might look something like this:
This test has a couple problems. It's slow, taking four seconds to execute. And it's flaky, because it assumes the cache entry will not have expired one second before its deadline and will have expired one second after. While computers are fast, it is not uncommon for an overloaded CI system to pause execution of a program for longer than a second.
We can make the test less flaky by making it slower, or we can make the test faster at the expense of making it flakier, but we can't make it fast and reliable using this approach.
We can design our Cache type to be more testable. We can inject a fake clock to give us control over time in tests. When advancing the fake clock, we will need some mechanism to ensure that any timers that fire have executed before progressing the test. These changes come at the expense of additional code complexity: We can no longer use time.Timer, but must use a testable wrapper. Background goroutines need additional synchronization points.
The synctest package simplifies all of this. Using synctest, we can write:
This is identical to the naive test above, wrapped in synctest.Run and with the addition of two calls to synctest.Wait. However:
A limitation of the synctest.Wait function is that it does not recognize goroutines blocked on network or other I/O operations as idle. While the scheduler can identify a goroutine blocked on I/O, it cannot distinguish between a goroutine that is genuinely blocked and one which is about to receive data from a kernel network buffer. For example, if a test creates a loopback TCP connection, starts a goroutine reading from one side of the connection, and then writes to the other, the read goroutine may remain in I/O wait for a brief time before the kernel indicates that the connection has become readable. If synctest.Wait considered a goroutine in I/O wait to be idle, this would cause nondeterminism in cases such as this,
Tests which use synctest with network connections or other external data sources should use a fake implementation with deterministic behavior. For net.Conn, net.Pipe can create a suitable in-memory connection.
This proposal is based in part on experience with tests in the golang.org/x/net/http2 package. Tests of an HTTP client or server often involve multiple interacting goroutines and timers. For example, a client request may involve goroutines writing to the server, reading from the server, and reading from the request body; as well as timers covering various stages of the request process. The combination of fake clocks and an operation which waits for all goroutines in the test to stabilize has proven effective.
The text was updated successfully, but these errors were encountered: