Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark workunits blocked, and skip rendering completed workunits #12369

Merged
merged 11 commits into from Jul 19, 2021

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Jul 16, 2021

Our "graph" of execution is a DAG, while the workunits (used to visualize and record traces) form a tree. Because they form a tree, workunits do not always report that they are blocked on work that they didn't start, but which some other node started (common when lots of @rules are waiting on another single @rule which only one of them started).

To fix this, we make the blocked property of a workunit an atomic mutable, and skip rendering the parents of blocked leaves. We use the blocked flag both for Tasks (which wait directly for memoized Nodes, and so frequently block in this way), and in BoundedCommandRunner, which temporarily blocks the workunit while we're acquiring the semaphore. Additionally, we skip rendering or walking through Completed workunits, which can happen in the case of speculation if a parent workunit completes before a child.

In order to toggle the blocked property on workunits, we expose the current RunningWorkunit in two new places: the CommandRunner and WrappedNode. In both cases, this is to allow the generic code to consume the workunit created by their callers and mark it blocked (for Task and BoundedCommandRunner).

Fixes #12349.

…ate workunit for eager fetching.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
…rkunits in case of spawned work (like cache writes) where a child is still running below a completed parent.

[ci skip-build-wheels]
…andRunner can mark its parent workunit blocked.

[ci skip-build-wheels]
…e BoundedCommandRunner's.

[ci skip-build-wheels]
@stuhood stuhood marked this pull request as ready for review July 17, 2021 23:29
Comment on lines +71 to +82
///
/// Workunits form a tree of running, blocked, and completed work, with parent ids propagated via
/// thread-local state.
///
/// While running (the Started state), a copy of a Workunit is generally kept on the stack by the
/// `in_workunit!` macro, while another copy of the same Workunit is recorded in the WorkunitStore.
/// Most of the fields of the Workunit are immutable, but an atomic "blocked" flag can be set to
/// temporarily mark the running Workunit as being in a blocked state.
///
/// When the `in_workunit!` macro exits, the Workunit on the stack is completed by storing any
/// local mutated values as the final value of the Workunit.
///
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: ^

src/rust/engine/process_execution/src/cache.rs Outdated Show resolved Hide resolved
src/rust/engine/process_execution/src/remote.rs Outdated Show resolved Hide resolved
src/rust/engine/process_execution/src/remote.rs Outdated Show resolved Hide resolved
src/rust/engine/process_execution/src/remote.rs Outdated Show resolved Hide resolved
src/rust/engine/workunit_store/src/lib.rs Show resolved Hide resolved
src/rust/engine/process_execution/src/cache.rs Outdated Show resolved Hide resolved
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Stu! Should this be picked to 2.6?

@@ -508,13 +525,19 @@ pub struct HeavyHittersInnerStore {
fn first_matched_parent(
workunit_records: &HashMap<SpanId, Workunit>,
mut span_id: Option<SpanId>,
is_terminal: impl Fn(&Workunit) -> bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does terminal mean here? Imo it's ambiguous if it's something like "final/complete" vs. "the CLI/terminal"

@stuhood stuhood added this to the 2.5.x milestone Jul 19, 2021
… fixes `Starting` messages.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the logging of cache hits!

@stuhood stuhood merged commit debdf4f into pantsbuild:main Jul 19, 2021
@stuhood stuhood deleted the stuhood/blocked-workunits branch July 19, 2021 23:10
stuhood added a commit to stuhood/pants that referenced this pull request Jul 19, 2021
…build#12369)

Our "graph" of execution is a DAG, while the workunits (used to visualize and record traces) form a tree. Because they form a tree, workunits do not always report that they are blocked on work that they didn't start, but which some other node started (common when lots of @rules are waiting on another single @rule which only one of them started).

To fix this, we make the `blocked` property of a workunit an atomic mutable, and skip rendering the parents of blocked leaves. We use the `blocked` flag both for `Tasks` (which wait directly for memoized Nodes, and so frequently block in this way), and in `BoundedCommandRunner`, which temporarily blocks the workunit while we're acquiring the semaphore. Additionally, we skip rendering or walking through `Completed` workunits, which can happen in the case of speculation if a parent workunit completes before a child.

In order to toggle the `blocked` property on workunits, we expose the current `RunningWorkunit` in two new places: the `CommandRunner` and `WrappedNode`. In both cases, this is to allow the generic code to consume the workunit created by their callers and mark it blocked (for `Task` and `BoundedCommandRunner`).

Fixes pantsbuild#12349.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Jul 20, 2021
… (#12376)

Our "graph" of execution is a DAG, while the workunits (used to visualize and record traces) form a tree. Because they form a tree, workunits do not always report that they are blocked on work that they didn't start, but which some other node started (common when lots of @rules are waiting on another single @rule which only one of them started).

To fix this, we make the `blocked` property of a workunit an atomic mutable, and skip rendering the parents of blocked leaves. We use the `blocked` flag both for `Tasks` (which wait directly for memoized Nodes, and so frequently block in this way), and in `BoundedCommandRunner`, which temporarily blocks the workunit while we're acquiring the semaphore. Additionally, we skip rendering or walking through `Completed` workunits, which can happen in the case of speculation if a parent workunit completes before a child.

In order to toggle the `blocked` property on workunits, we expose the current `RunningWorkunit` in two new places: the `CommandRunner` and `WrappedNode`. In both cases, this is to allow the generic code to consume the workunit created by their callers and mark it blocked (for `Task` and `BoundedCommandRunner`).

Fixes #12349.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Sep 21, 2021
#12369 adjusted the workunit graph to have the `BoundedCommandRunner` mark (what it thought was) its parent workunit as blocking while waiting to acquire a slot on the semaphore. But when #12748 fixed rendering of parent workunits, we experienced a regression in rendering with remote caching enabled: "Scheduling: ..." workunits were rendered when a process was blocked.

#12369 contained a bug: the workunit being marked blocked by the `BoundedCommandRunner` was not always it's direct parent: in particular, under remote caching the workunit being marked blocking was in fact its grandparent. Marking that workunit blocked had no effect, because its child (the parent of the semaphore acquisition) would still cause it to render.

To fix that, we move back to directly creating a workunit for `BoundedCommandRunner` semaphore acquisition, rather than marking the inbound workunit blocked. This also has the benefit of recording how long processes waited to acquire slots.

This bug is to some degree an indictment of explicitly passing workunits to improve clarity... but on the other hand, it also seems to more strongly encourage operating on workunits that you have created, and which are living on your stack.
stuhood added a commit to stuhood/pants that referenced this pull request Sep 21, 2021
…ild#12973)

pantsbuild#12369 adjusted the workunit graph to have the `BoundedCommandRunner` mark (what it thought was) its parent workunit as blocking while waiting to acquire a slot on the semaphore. But when pantsbuild#12748 fixed rendering of parent workunits, we experienced a regression in rendering with remote caching enabled: "Scheduling: ..." workunits were rendered when a process was blocked.

pantsbuild#12369 contained a bug: the workunit being marked blocked by the `BoundedCommandRunner` was not always it's direct parent: in particular, under remote caching the workunit being marked blocking was in fact its grandparent. Marking that workunit blocked had no effect, because its child (the parent of the semaphore acquisition) would still cause it to render.

To fix that, we move back to directly creating a workunit for `BoundedCommandRunner` semaphore acquisition, rather than marking the inbound workunit blocked. This also has the benefit of recording how long processes waited to acquire slots.

This bug is to some degree an indictment of explicitly passing workunits to improve clarity... but on the other hand, it also seems to more strongly encourage operating on workunits that you have created, and which are living on your stack.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Sep 21, 2021
…ick of #12973) (#12975)

#12369 adjusted the workunit graph to have the `BoundedCommandRunner` mark (what it thought was) its parent workunit as blocking while waiting to acquire a slot on the semaphore. But when #12748 fixed rendering of parent workunits, we experienced a regression in rendering with remote caching enabled: "Scheduling: ..." workunits were rendered when a process was blocked.

#12369 contained a bug: the workunit being marked blocked by the `BoundedCommandRunner` was not always it's direct parent: in particular, under remote caching the workunit being marked blocking was in fact its grandparent. Marking that workunit blocked had no effect, because its child (the parent of the semaphore acquisition) would still cause it to render.

To fix that, we move back to directly creating a workunit for `BoundedCommandRunner` semaphore acquisition, rather than marking the inbound workunit blocked. This also has the benefit of recording how long processes waited to acquire slots.

This bug is to some degree an indictment of explicitly passing workunits to improve clarity... but on the other hand, it also seems to more strongly encourage operating on workunits that you have created, and which are living on your stack.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

With --no-dynamic-ui and caching enabled, ghost tasks rendered in long-running-tasks
4 participants