Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dropped output race during UI teardown #14093

Merged
merged 3 commits into from Jan 6, 2022

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Jan 6, 2022

As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid.

To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous Mutex directly around the stderr destination.

Fixes #13276 and #11626.

[ci skip-build-wheels]

Copy link
Sponsor Member Author

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits are useful to review independently.

Comment on lines +80 to +82
/// NB: This method must be very careful to avoid logging. Between the point where we have taken
/// exclusive access to the Console and when the UI has actually been initialized, attempts to
/// log from this method would deadlock (by causing the method to wait for _itself_ to finish).
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: This was already true before this change: because we were using a sync_channel, any attempt to log in this method before the second task was spawned (which couldn't be spawned until we were actually initialized) would have caused this method to deadlock.

fn setup_bars(
&self,
executor: Executor,
) -> Result<(Vec<ProgressBar>, MultiProgressTask), String> {
// Stderr is propagated across a channel to remove lock interleavings between stdio and the UI.
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving this some thought, I believe that the lock interleavings referred to here are no longer possible as the code exists today. The indicatif ProgressBar has an internal RwLock, but none of our code executes underneath it (because it accesses stdio directly while it has exclusive access). So it will never end up "above" one of our own locks in the callstack.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this a few times and it looks reasonble.

src/rust/engine/ui/src/console_ui.rs Show resolved Hide resolved
@Eric-Arellano
Copy link
Contributor

Should we cherry-pick back to 2.8? It's a pretty bad broken window.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Jan 6, 2022

Should we cherry-pick back to 2.8? It's a pretty bad broken window.

Yes, but I'd like to wait until it's been on main for a week or two.

@stuhood stuhood added this to the 2.8.x milestone Jan 6, 2022
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@stuhood stuhood enabled auto-merge (squash) January 6, 2022 21:15
@stuhood stuhood merged commit aa0932a into pantsbuild:main Jan 6, 2022
@stuhood stuhood deleted the stuhood/ui-teardown-drop branch January 6, 2022 22:19
// And another to propagate stderr, which will exit automatically when the channel closes.
let _stderr_task = executor.spawn_blocking(move || {
while let Ok(stderr) = stderr_receiver.recv() {
match first_bar.upgrade() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the bug was here. We've definitely got stderr to flush, but the WeakProgressBar may have evaporated already in which case we drop the stderr.

let dest_bar = {
let stderr_dest_bar = stderr_dest_bar.lock();
// We can safely unwrap here because the Mutex is held until the bar is initialized.
stderr_dest_bar.as_ref().unwrap().upgrade().ok_or(())?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change appears to work, but I also never see a failure to upgrade the progress bar from weak; i.e.: I neve see a log::warn / retry.

Copy link
Sponsor Member Author

@stuhood stuhood Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the stderr callback fails, the stdio crate will clear exclusive access and fall back to directly accessing stdio: the empty error is the signal to fallback. So that's expected!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read of the stdio crate is that it should fall back to looging and that should include a warn level log lione I do not see:

res.map_err(|()| "Failed to write stderr to Exclusive".to_owned())
}
InnerDestination::Logging => {
// Release the lock on the Destination before logging.
std::mem::drop(destination);
log::info!("stderr: {:?}", String::from_utf8_lossy(content));
return;
}
};
// Release the lock, clear the Console, log the error and retry.
let error_str = format!(
"Failed to write stderr to {:?}, falling back to Logging: {:?}",
destination, error_res
);
std::mem::drop(destination);
self.console_clear();
log::warn!("{}", error_str);

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Actually, you're right. When we fail to write to the handler and tear down stderr, we also clear the Console, which would cause things to be logged to .pants.d/pants.log, rather than to stderr.

So... things should probably work the way I was claiming above (an error in the stderr callback should clear exclusive access and re-enable the Console), but they don't currently. You can check the log to see if you're getting any output there, but I suspect that removing an extra thread and channel here might also have made the error harder to reproduce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 1 hit and that was back on 12/22 - i.e.: not from this fix:

$ grep -r "Failed to write" .pants.d/
.pants.d/run-tracker/pants_run_2021_12_22_11_37_03_636_20047a7c6a314cd6876147aeee73d2b7/logs:11:47:27.31 [WARN] Failed to write stderr to Console { console: Console { stdin_handle: Some(File { fd: 163, path: "/dev/pts/5 (deleted)", read: true, write: false }), stdout_handle: Some(File { fd: 164, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_handle: Some(File { fd: 313, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_use_color: true } }, falling back to Logging: Err("Input/output error (os error 5)")

stuhood added a commit to stuhood/pants that referenced this pull request Mar 3, 2022
As described in pantsbuild#13276 and pantsbuild#11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In pantsbuild#13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid.

To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination.

Fixes pantsbuild#13276 and pantsbuild#11626.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Mar 3, 2022
…4690)

As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid.

To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination.

Fixes #13276 and #11626.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent goal output
3 participants