New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dropped output race during UI teardown #14093
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits are useful to review independently.
/// NB: This method must be very careful to avoid logging. Between the point where we have taken | ||
/// exclusive access to the Console and when the UI has actually been initialized, attempts to | ||
/// log from this method would deadlock (by causing the method to wait for _itself_ to finish). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: This was already true before this change: because we were using a sync_channel
, any attempt to log in this method before the second task was spawned (which couldn't be spawned until we were actually initialized) would have caused this method to deadlock.
fn setup_bars( | ||
&self, | ||
executor: Executor, | ||
) -> Result<(Vec<ProgressBar>, MultiProgressTask), String> { | ||
// Stderr is propagated across a channel to remove lock interleavings between stdio and the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After giving this some thought, I believe that the lock interleavings referred to here are no longer possible as the code exists today. The indicatif
ProgressBar
has an internal RwLock
, but none of our code executes underneath it (because it accesses stdio
directly while it has exclusive access). So it will never end up "above" one of our own locks in the callstack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read this a few times and it looks reasonble.
Should we cherry-pick back to 2.8? It's a pretty bad broken window. |
Yes, but I'd like to wait until it's been on |
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
// And another to propagate stderr, which will exit automatically when the channel closes. | ||
let _stderr_task = executor.spawn_blocking(move || { | ||
while let Ok(stderr) = stderr_receiver.recv() { | ||
match first_bar.upgrade() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the bug was here. We've definitely got stderr to flush, but the WeakProgressBar may have evaporated already in which case we drop the stderr.
let dest_bar = { | ||
let stderr_dest_bar = stderr_dest_bar.lock(); | ||
// We can safely unwrap here because the Mutex is held until the bar is initialized. | ||
stderr_dest_bar.as_ref().unwrap().upgrade().ok_or(())? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change appears to work, but I also never see a failure to upgrade the progress bar from weak; i.e.: I neve see a log::warn / retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the stderr
callback fails, the stdio
crate will clear exclusive access and fall back to directly accessing stdio: the empty error is the signal to fallback. So that's expected!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My read of the stdio crate is that it should fall back to looging and that should include a warn level log lione I do not see:
pants/src/rust/engine/stdio/src/lib.rs
Lines 340 to 357 in 2b909b2
res.map_err(|()| "Failed to write stderr to Exclusive".to_owned()) | |
} | |
InnerDestination::Logging => { | |
// Release the lock on the Destination before logging. | |
std::mem::drop(destination); | |
log::info!("stderr: {:?}", String::from_utf8_lossy(content)); | |
return; | |
} | |
}; | |
// Release the lock, clear the Console, log the error and retry. | |
let error_str = format!( | |
"Failed to write stderr to {:?}, falling back to Logging: {:?}", | |
destination, error_res | |
); | |
std::mem::drop(destination); | |
self.console_clear(); | |
log::warn!("{}", error_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Actually, you're right. When we fail to write to the handler and tear down stderr
, we also clear the Console
, which would cause things to be logged to .pants.d/pants.log
, rather than to stderr.
So... things should probably work the way I was claiming above (an error in the stderr callback should clear exclusive access and re-enable the Console
), but they don't currently. You can check the log to see if you're getting any output there, but I suspect that removing an extra thread and channel here might also have made the error harder to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only 1 hit and that was back on 12/22 - i.e.: not from this fix:
$ grep -r "Failed to write" .pants.d/
.pants.d/run-tracker/pants_run_2021_12_22_11_37_03_636_20047a7c6a314cd6876147aeee73d2b7/logs:11:47:27.31 [WARN] Failed to write stderr to Console { console: Console { stdin_handle: Some(File { fd: 163, path: "/dev/pts/5 (deleted)", read: true, write: false }), stdout_handle: Some(File { fd: 164, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_handle: Some(File { fd: 313, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_use_color: true } }, falling back to Logging: Err("Input/output error (os error 5)")
As described in pantsbuild#13276 and pantsbuild#11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In pantsbuild#13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid. To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination. Fixes pantsbuild#13276 and pantsbuild#11626. [ci skip-build-wheels]
…4690) As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid. To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination. Fixes #13276 and #11626. [ci skip-build-wheels]
As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid.
To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous
Mutex
directly around the stderr destination.Fixes #13276 and #11626.
[ci skip-build-wheels]