Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: deterministic multithreading #21916

Closed
glycerine opened this issue Sep 18, 2017 · 3 comments
Closed

proposal: deterministic multithreading #21916

glycerine opened this issue Sep 18, 2017 · 3 comments

Comments

@glycerine
Copy link

Various dissertations have proposed scheduling algorithms that provide deterministic multithreading. For example, this one claims it gives up only 16% on a four core benchmark. In exchange it provides deterministic execution. This could provide drastically faster time to program delivery in many cases. Further, it would dramatically help the reproduction of concurrency bugs if a "trace mode" could then be built that would trace a non-deterministic execution for later reproduction under deterministic scheduling.

http://groups.csail.mit.edu/commit/papers/09/asplos073-olszewski.pdf

The proposal would be to add a deterministic interleaving mode to the goroutine scheduler.

@gopherbot gopherbot added this to the Proposal milestone Sep 18, 2017
@rsc
Copy link
Contributor

rsc commented Sep 25, 2017

This is a fascinating idea but ultimately not something that's ready yet for the proposal process. There is likely many person-years of work remaining to integrate something like this into Go.

One problem that applies to most of these systems is that deterministic execution, or following-a-trace deterministic execution, is only useful if you have exactly the same program. As soon as you attempt a bug fix, the program has changed and may (deterministically) go down a completely different path. Worse, it will do so repeatedly, so you can't even run it multiple times to check whether the bug is really fixed. Most programs run different code paths on different inputs, so even needing to try a different input might trigger this loss of repeatability, despite the "determinism".

(This observation is not mine; it was first pointed out to me by @aclements, although maybe I mangled it in the retelling above.)

In addition to that problem, which I think is the most significant one, these systems also usually fail to cope with non-determinism from timing of network inputs or I/O, or from explicit randomization of address space operations in the kernel (which in turn affects code paths in address-value-sensitive code like the garbage collector).

None of this is to say that it's not an interesting idea that we'd love to see someone run with, but it would be a major, speculative undertaking. It's not something we on the Go team are likely to attempt ourselves without a clearer demonstration of applicability to Go, but if someone else wants to explore further how to productionize something like this into Go, the problems you encounter and their solutions would probably make a great paper or maybe the start of a great PhD thesis.

@rsc rsc closed this as completed Sep 25, 2017
@LarryRuane
Copy link

This is more difficult than just implementing deterministic scheduling -- all sources of nondeterminism must be eliminated. I worked on a complex distributed block-level storage product (written in C) that supported this kind of pseudo-random testing. We had to simulate (mock) all interactions with the outside world, including networking, file access and, most importantly, time itself. The network and files were simulated using memory buffers. The entire simulated cluster of nodes ran within a single executable (one address space, one scheduler), which was single-threaded (we didn't use threads since we had no deterministic thread scheduler, among other reasons). Each fake network message and file IO access would appear to take a (pseudo) random variable amount of time. The scheduler (which was specific to this test mode) would use a priority queue to determine at what time the next event (network message arrival, whatever) occurs, and then simply advance the simulated clock to that time. So simulation time usually moved faster than real time!

Every failure was guaranteed to be reproducible. It was amazing. We ran this test (with different random number generator starting seeds) on racks of systems 24 hours a day. It found many unbelievably subtle timing-window bugs.

One enormous benefit of doing all this is that, when a failure is found, you can enable (or add) arbitrary amounts of tracing then re-run that seed. You can also add massive amounts of checking / audit code and possibly catch the bug earlier. These changes have no effect on simulated time. (I'm sure you've all seen the Heisenberg Uncertainty Principle at work when you try to reproduce bugs on a real system with tracing enabled.)

Your comment, rsc, about not knowing if you've fixed the bug is very astute. If necessary to get around this problem, I would actually "time-bomb" the fix, such that it would only kick in just at the (simulated) moment the wrong decision was made (I knew when that was from tracing output), so the run would reproduce to that exact state, and then I could see it continue from that point okay (or not okay if the fix didn't work or had some bad side-effect).

Since becoming interested in Go, I've thought about this very idea, but I agree that it's a huge project, plus it takes significant forethought when designing the code. It would be great, though.

@glycerine
Copy link
Author

@LarryRuane I've implemented an interactive Go REPL (https://github.com/gijit/gi) based on a single-threaded just-in-time trace compiler, LuaJIT. gijit currently supports everything in Go except goroutines/select/channels. I'll be adding those next, and may well opt for a deterministic scheduler. You can contribute (your input is most welcome) or simply follow progress on issue gijit/gi#2

@golang golang locked and limited conversation to collaborators Feb 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants