Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: express lock rank graph as a DAG like go/build/deps_test.go #53789

Closed
aclements opened this issue Jul 11, 2022 · 11 comments
Closed

runtime: express lock rank graph as a DAG like go/build/deps_test.go #53789

aclements opened this issue Jul 11, 2022 · 11 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@aclements
Copy link
Member

aclements commented Jul 11, 2022

The runtime's lock rank graph currently requires a full manual transitive closure of the lock rank graph. This makes it annoying to maintain and some of the edge lists are very long.

We should replace this with something like the DAG language in go/build/deps_test.go. In particular, the runtime's lock ranking falls into a few natural strata, which are very natural to express in the deps DAG, and extremely verbose in the current approach. This would also significantly improve the readability of diffs to the graph (see, for example, 482669d). Finally, it would let us auto-generate the total order, rather than trying to keep them consistent by hand.

We could parse this at runtime startup (only if lock rank checking is enabled anyway), though the current totally static structure has the advantage of being available for lock checking from the moment the process starts.

It may be better to go:generate the static ranking structure from the DAG language. This would keep the structure available immediately at startup, and would also make it easy to share the parser with deps_test.go without trying to fit it into the constraints of early runtime initialization.

@aclements aclements added this to the Backlog milestone Jul 11, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 11, 2022
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 11, 2022
@aclements aclements self-assigned this Jul 12, 2022
@aclements
Copy link
Member Author

aclements commented Jul 19, 2022

I need to clean this up by hand, but here's the current lock graph automatically converted to deps language and reduced:

NONE < dummy, sysmon, sweepWaiters, assistQueue, cpuprof, sweep, pollDesc, deadlock, itab, notifyList, root, rwmutexW, defer, newmHandoff, debugPtrmask, faketimeState, ticks, raceFini, pollCache, debug;
sysmon < scavenge, forcegc;
assistQueue, cpuprof, forcegc, pollDesc, scavenge, sweep, sweepWaiters < sched;
sched < allg, allp, gFree;
allp < timers;
itab < reflectOffs;
scavenge, sweep < hchan;
scavenge < traceBuf;
allg, hchan, reflectOffs, timers, traceBuf < fin;
traceBuf < traceStrings;
allg, hchan, notifyList, reflectOffs, timers, traceStrings < mspanSpecial, profInsert, profBlock, profMemActive, gcBitsArenas, spanSetSpine, mheapSpecial;
profMemActive < profMemFuture;
hchan, root, sched, traceStrings < trace;
fin, notifyList, trace < traceStackTab;
timers < netpollInit;
rwmutexW, sysmon < rwmutexR;
gcBitsArenas, netpollInit, profBlock, profInsert, profMemFuture, spanSetSpine, traceStackTab < gscan;
gscan, rwmutexR < stackpool;
gscan < stackLarge, hchanLeaf;
hchan, notifyList < sudog;
defer, gscan, mspanSpecial, sudog < wbufSpans;
stackLarge, stackpool, wbufSpans < mheap;
mheap, mheapSpecial < globalAlloc;
mheap < pageAllocScav;
deadlock < panic;

To the surprise of no-one, the current lock rank graph is not transitively closed, meaning there are locks for which it's okay to acquire a then b or b then c, but not a then c. The new deps-based approach automatically constructs the full reachability matrix, which will fix this, too.

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418716 mentions this issue: runtime: add missing trace lock edges

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418717 mentions this issue: runtime: add mayAcquire annotation for finlock

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418714 mentions this issue: runtime: delete unused lock ranks

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418715 mentions this issue: runtime: generate the lock ranking from a DAG description

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418593 mentions this issue: internal/dag: add a Graph type and make node order deterministic

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418718 mentions this issue: runtime: make the lock rank DAG make more sense

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418719 mentions this issue: runtime: clean up panic and deadlock lock ranks

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418720 mentions this issue: runtime: add mayAcquire annotation for trace.lock

@gopherbot
Copy link

gopherbot commented Jul 21, 2022

Change https://go.dev/cl/418592 mentions this issue: go/build, internal/dag: lift DAG parser into an internal package

@aclements
Copy link
Member Author

aclements commented Jul 21, 2022

Now that it's in the DAG language, I was able to generally rationalize the lock graph, in part because it was now easy to generate a visual graph, and group it by subsystem. This rationalization led me to find a few genuinely missing edges, including one that detected a previously unknown rank violation and true lock cycle (#53979).

The graph is still pretty complicated. Now that it's easier to think about, we could consider breaking some of these long paths. In particular, I think we should break the TRACE -> STACKGROW path so that tracing is at the leaf of the lock graph, and possibly the WB -> mheap path, which has caused us issues in the past.

x

Here's what it looked like prior to cleaning up:

x

gopherbot pushed a commit that referenced this issue Aug 4, 2022
This lifts the DAG parser from the go/build dependencies test into its
own package that can be reused elsewhere.

I tried to keep the code as close as possible. I changed some names to
reflect the more general purpose of internal/dag. Most of the changes
are related to error handling, since internal/dag doesn't take a
testing.T on which to report errors. Notably, parseRules now returns a
slice of parsed rules rather than calling a callback because this made
it easier to separate fatal parsing errors from non-fatal graph
checking errors.

For #53789.

Change-Id: I170b84fd85f971cfc1a50972156d48e78b45fce3
Reviewed-on: https://go-review.googlesource.com/c/go/+/418592
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
The go/types package doesn't care about node ordering because it's
just querying paths in the graph, but we're about to use this for the
runtime lock graph, and there we want determinism.

For #53789.

Change-Id: Ic41329bf2eb9a3a202f97c21c761ea588ca551c8
Reviewed-on: https://go-review.googlesource.com/c/go/+/418593
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
For #53789.

Change-Id: Ic7379afcfdcc47b541bac9b44b5bc6b43604fc0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/418714
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
We're missing lock edges to trace.lock that happen only rarely. Any
trace event can potentially fill up a trace buffer and acquire
trace.lock in order to flush the buffer, but this happens relatively
rarely, so we simply haven't seen some of these lock edges that could
happen.

With this change, we promote "fin, notifyList < traceStackTab" to
"fin, notifyList < trace" and now everything that emits trace events
with a P enters the tracer lock ranks via "trace", rather than some
things entering at "trace" and others at "traceStackTab".

This was found by inspecting the rank graph for things that didn't
make sense.

Ideally we would add a mayAcquire annotation that any trace event can
potentially acquire trace.lock, but there are actually cases that
violate this ranking right now. This is #53979. The chance of a lock
cycle is extremely low given the number of conditions that have to
happen simultaneously.

For #53789.

Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418716
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
We're missing lock edges to finlock that happen only rarely. Anything
that calls mallocgc can potentially trigger sweeping, which can
potentially queue a finalizer, which acquires finlock. While this can
happen on any malloc, it happens relatively rarely, so we simply
haven't seen some of the lock edges that could happen.

Add a mayAcquire annotation to mallocgc to capture the possibility of
acquiring finlock.

With this change, we add "fin" to the set of "malloc" locks. Several
of these edges were already there, but not quite all of them.

This was found by inspecting the rank graph for things that didn't
make sense.

For #53789.

Change-Id: Idc10ce6f250596b0c07ba07ac93f2198fb38c22b
Reviewed-on: https://go-review.googlesource.com/c/go/+/418717
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
This groups, comments, and generally reorganizes the lock rank graph
description by subsystem. It also introduces several pseudo-nodes that
more cleanly describe the inherent layering of lock ranks by
subsystem.

I believe this doesn't actually change the graph, but haven't verified
this.

For #53789.

Change-Id: I72f332f5a23b8217c7dc1b21411631ad48cee4b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418718
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Aug 4, 2022
I'm not entirely sure why these locks are currently ranked "deadlock <
panic" since we drop panic before acquiring deadlock, and we actually
want deadlock to be below panic because panic is implicitly below
everything else and we want deadlock to be, too. My best guess is that
we had this edge because we intentionally acquire deadlock twice to
deadlock, and that causes the lock rank checking to panic on the
second acquire.

Fix this in a more sensible way by capturing that deadlock can be
acquired in a self-cycle and flipping the rank to "panic < deadlock"
to express that deadlock needs to be under all other locks, just like
panic.

For #53789.

Change-Id: I8809e5d102ce473bd3ace0ba07bf2200ef60263f
Reviewed-on: https://go-review.googlesource.com/c/go/+/418719
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
This lifts the DAG parser from the go/build dependencies test into its
own package that can be reused elsewhere.

I tried to keep the code as close as possible. I changed some names to
reflect the more general purpose of internal/dag. Most of the changes
are related to error handling, since internal/dag doesn't take a
testing.T on which to report errors. Notably, parseRules now returns a
slice of parsed rules rather than calling a callback because this made
it easier to separate fatal parsing errors from non-fatal graph
checking errors.

For golang#53789.

Change-Id: I170b84fd85f971cfc1a50972156d48e78b45fce3
Reviewed-on: https://go-review.googlesource.com/c/go/+/418592
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
The go/types package doesn't care about node ordering because it's
just querying paths in the graph, but we're about to use this for the
runtime lock graph, and there we want determinism.

For golang#53789.

Change-Id: Ic41329bf2eb9a3a202f97c21c761ea588ca551c8
Reviewed-on: https://go-review.googlesource.com/c/go/+/418593
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
For golang#53789.

Change-Id: Ic7379afcfdcc47b541bac9b44b5bc6b43604fc0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/418714
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
Currently, the runtime lock rank graph is maintained manually in a
large set of arrays that give the partial order and a manual
topological sort of this partial order. Any changes to the rank graph
are difficult to reason about and hard to review, as well as likely to
cause merge conflicts. Furthermore, because the partial order is
manually maintained, it's not actually transitively closed (though
it's close), meaning there are many cases where rank a can be acquired
before b and b before c, but a cannot be acquired before c. While this
isn't technically wrong, it's very strange in the context of lock
ordering.

Replace all of this with a much more compact, readable, and
maintainable description of the rank graph written in the internal/dag
graph language. We statically generate the runtime structures from
this description, which has the advantage that the parser doesn't have
to run during runtime initialization and the structures can live in
static data where they can be accessed from any point during runtime
init.

The current description was automatically generated from the existing
partial order, combined with a transitive reduction. This ensures it's
correct, but it could use some manual messaging to call out the
logical layers and add some structure.

We do lose the ad hoc string names of the lock ranks in this
translation, which could mostly be derived from the rank constant
names, but not always. I may bring those back but in a more uniform
way.

We no longer need the tests in lockrank_test.go because they were
checking that we manually maintained the structures correctly.

Fixes golang#53789.

Change-Id: I54451d561b22e61150aff7e9b8602ba9737e1b9b
Reviewed-on: https://go-review.googlesource.com/c/go/+/418715
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
We're missing lock edges to trace.lock that happen only rarely. Any
trace event can potentially fill up a trace buffer and acquire
trace.lock in order to flush the buffer, but this happens relatively
rarely, so we simply haven't seen some of these lock edges that could
happen.

With this change, we promote "fin, notifyList < traceStackTab" to
"fin, notifyList < trace" and now everything that emits trace events
with a P enters the tracer lock ranks via "trace", rather than some
things entering at "trace" and others at "traceStackTab".

This was found by inspecting the rank graph for things that didn't
make sense.

Ideally we would add a mayAcquire annotation that any trace event can
potentially acquire trace.lock, but there are actually cases that
violate this ranking right now. This is golang#53979. The chance of a lock
cycle is extremely low given the number of conditions that have to
happen simultaneously.

For golang#53789.

Change-Id: Ic65947d27dee88d2daf639b21b2c9d37552f0ac0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418716
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
We're missing lock edges to finlock that happen only rarely. Anything
that calls mallocgc can potentially trigger sweeping, which can
potentially queue a finalizer, which acquires finlock. While this can
happen on any malloc, it happens relatively rarely, so we simply
haven't seen some of the lock edges that could happen.

Add a mayAcquire annotation to mallocgc to capture the possibility of
acquiring finlock.

With this change, we add "fin" to the set of "malloc" locks. Several
of these edges were already there, but not quite all of them.

This was found by inspecting the rank graph for things that didn't
make sense.

For golang#53789.

Change-Id: Idc10ce6f250596b0c07ba07ac93f2198fb38c22b
Reviewed-on: https://go-review.googlesource.com/c/go/+/418717
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
This groups, comments, and generally reorganizes the lock rank graph
description by subsystem. It also introduces several pseudo-nodes that
more cleanly describe the inherent layering of lock ranks by
subsystem.

I believe this doesn't actually change the graph, but haven't verified
this.

For golang#53789.

Change-Id: I72f332f5a23b8217c7dc1b21411631ad48cee4b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/418718
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
I'm not entirely sure why these locks are currently ranked "deadlock <
panic" since we drop panic before acquiring deadlock, and we actually
want deadlock to be below panic because panic is implicitly below
everything else and we want deadlock to be, too. My best guess is that
we had this edge because we intentionally acquire deadlock twice to
deadlock, and that causes the lock rank checking to panic on the
second acquire.

Fix this in a more sensible way by capturing that deadlock can be
acquired in a self-cycle and flipping the rank to "panic < deadlock"
to express that deadlock needs to be under all other locks, just like
panic.

For golang#53789.

Change-Id: I8809e5d102ce473bd3ace0ba07bf2200ef60263f
Reviewed-on: https://go-review.googlesource.com/c/go/+/418719
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Aug 11, 2022
Now that we've moved the trace locks to the leaf of the lock graph, we
can safely annotate that any trace event may acquire trace.lock even
if dynamically it turns out a particular event doesn't need to flush
and acquire this lock.

This reveals a new edge where we can trace while holding the mheap
lock, so we add this to the lock graph.

For #53789.
Updates #53979.

Change-Id: I13e2f6cd1b621cca4bed0cc13ef12e64d05c89a7
Reviewed-on: https://go-review.googlesource.com/c/go/+/418720
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Done
Development

No branches or pull requests

3 participants