Fix dropped output race during UI teardown #14093

stuhood · 2022-01-06T20:29:30Z

As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid.

To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous Mutex directly around the stderr destination.

Fixes #13276 and #11626.

[ci skip-build-wheels]

stuhood

Commits are useful to review independently.

stuhood · 2022-01-06T20:31:04Z

src/rust/engine/ui/src/console_ui.rs

+  /// NB: This method must be very careful to avoid logging. Between the point where we have taken
+  /// exclusive access to the Console and when the UI has actually been initialized, attempts to
+  /// log from this method would deadlock (by causing the method to wait for _itself_ to finish).


FWIW: This was already true before this change: because we were using a sync_channel, any attempt to log in this method before the second task was spawned (which couldn't be spawned until we were actually initialized) would have caused this method to deadlock.

stuhood · 2022-01-06T20:34:43Z

src/rust/engine/ui/src/console_ui.rs

  fn setup_bars(
    &self,
    executor: Executor,
  ) -> Result<(Vec<ProgressBar>, MultiProgressTask), String> {
-    // Stderr is propagated across a channel to remove lock interleavings between stdio and the UI.


After giving this some thought, I believe that the lock interleavings referred to here are no longer possible as the code exists today. The indicatif ProgressBar has an internal RwLock, but none of our code executes underneath it (because it accesses stdio directly while it has exclusive access). So it will never end up "above" one of our own locks in the callstack.

Eric-Arellano

I've read this a few times and it looks reasonble.

src/rust/engine/ui/src/console_ui.rs

Eric-Arellano · 2022-01-06T21:02:54Z

Should we cherry-pick back to 2.8? It's a pretty bad broken window.

stuhood · 2022-01-06T21:04:55Z

Should we cherry-pick back to 2.8? It's a pretty bad broken window.

Yes, but I'd like to wait until it's been on main for a week or two.

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

jsirois · 2022-01-07T04:57:22Z

src/rust/engine/ui/src/console_ui.rs

-    // And another to propagate stderr, which will exit automatically when the channel closes.
-    let _stderr_task = executor.spawn_blocking(move || {
-      while let Ok(stderr) = stderr_receiver.recv() {
-        match first_bar.upgrade() {


So the bug was here. We've definitely got stderr to flush, but the WeakProgressBar may have evaporated already in which case we drop the stderr.

jsirois · 2022-01-07T04:57:24Z

src/rust/engine/ui/src/console_ui.rs

+        let dest_bar = {
+          let stderr_dest_bar = stderr_dest_bar.lock();
+          // We can safely unwrap here because the Mutex is held until the bar is initialized.
+          stderr_dest_bar.as_ref().unwrap().upgrade().ok_or(())?


This change appears to work, but I also never see a failure to upgrade the progress bar from weak; i.e.: I neve see a log::warn / retry.

When the stderr callback fails, the stdio crate will clear exclusive access and fall back to directly accessing stdio: the empty error is the signal to fallback. So that's expected!

My read of the stdio crate is that it should fall back to looging and that should include a warn level log lione I do not see:

pants/src/rust/engine/stdio/src/lib.rs

Lines 340 to 357 in 2b909b2

res.map_err(|()| "Failed to write stderr to Exclusive".to_owned())

}

InnerDestination::Logging => {

// Release the lock on the Destination before logging.

std::mem::drop(destination);

log::info!("stderr: {:?}", String::from_utf8_lossy(content));

return;

}

};

// Release the lock, clear the Console, log the error and retry.

let error_str = format!(

"Failed to write stderr to {:?}, falling back to Logging: {:?}",

destination, error_res

);

std::mem::drop(destination);

self.console_clear();

log::warn!("{}", error_str);

Hm. Actually, you're right. When we fail to write to the handler and tear down stderr, we also clear the Console, which would cause things to be logged to .pants.d/pants.log, rather than to stderr.

So... things should probably work the way I was claiming above (an error in the stderr callback should clear exclusive access and re-enable the Console), but they don't currently. You can check the log to see if you're getting any output there, but I suspect that removing an extra thread and channel here might also have made the error harder to reproduce.

Only 1 hit and that was back on 12/22 - i.e.: not from this fix:

$ grep -r "Failed to write" .pants.d/ .pants.d/run-tracker/pants_run_2021_12_22_11_37_03_636_20047a7c6a314cd6876147aeee73d2b7/logs:11:47:27.31 [WARN] Failed to write stderr to Console { console: Console { stdin_handle: Some(File { fd: 163, path: "/dev/pts/5 (deleted)", read: true, write: false }), stdout_handle: Some(File { fd: 164, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_handle: Some(File { fd: 313, path: "/dev/pts/5 (deleted)", read: false, write: true }), stderr_use_color: true } }, falling back to Logging: Err("Input/output error (os error 5)")

As described in pantsbuild#13276 and pantsbuild#11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In pantsbuild#13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid. To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination. Fixes pantsbuild#13276 and pantsbuild#11626. [ci skip-build-wheels]

…4690) As described in #13276 and #11626, a race condition can occur between releasing exclusive access to the Console and the dynamic UI shutting down. In #13276, the issue was triaged to a task accepting output without confirming that the UI was still valid, and then dropping it if the UI was no longer valid. To fix this, replace the extra task and channel (which could not be used "transactionally" to both accept stderr and validate that the UI was still valid), with an anonymous `Mutex` directly around the stderr destination. Fixes #13276 and #11626. [ci skip-build-wheels]

stuhood added 2 commits January 5, 2022 16:16

Remove unused teardown_mpsc.

e17d904

Swap to synchronously writing to the bar under an anonymous Mutex.

c2f242b

[ci skip-build-wheels]

stuhood requested review from jsirois, tdyas and Eric-Arellano January 6, 2022 20:29

stuhood commented Jan 6, 2022

View reviewed changes

stuhood mentioned this pull request Jan 6, 2022

Inconsistent goal output #13276

Closed

Eric-Arellano approved these changes Jan 6, 2022

View reviewed changes

src/rust/engine/ui/src/console_ui.rs Show resolved Hide resolved

stuhood added the needs-cherrypick label Jan 6, 2022

stuhood added this to the 2.8.x milestone Jan 6, 2022

Review feedback.

68c94eb

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

stuhood enabled auto-merge (squash) January 6, 2022 21:15

Eric-Arellano approved these changes Jan 6, 2022

View reviewed changes

stuhood merged commit aa0932a into pantsbuild:main Jan 6, 2022

stuhood deleted the stuhood/ui-teardown-drop branch January 6, 2022 22:19

jsirois reviewed Jan 7, 2022

View reviewed changes

This was referenced Jan 21, 2022

Optionally render the --dynamic-ui with Prodash #14221

Merged

PoC: Adding goal name to dynamic UI output #14249

Closed

stuhood mentioned this pull request Mar 3, 2022

Fix dropped output race during UI teardown (cherrypick of #14093) #14690

Merged

stuhood removed the needs-cherrypick label Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dropped output race during UI teardown #14093

Fix dropped output race during UI teardown #14093

stuhood commented Jan 6, 2022 •

edited

stuhood left a comment

stuhood Jan 6, 2022

stuhood Jan 6, 2022

Eric-Arellano left a comment

Eric-Arellano commented Jan 6, 2022

stuhood commented Jan 6, 2022 •

edited

jsirois Jan 7, 2022

jsirois Jan 7, 2022

stuhood Jan 7, 2022 •

edited

jsirois Jan 7, 2022

stuhood Jan 7, 2022

jsirois Jan 7, 2022

	res.map_err(\|()\| "Failed to write stderr to Exclusive".to_owned())
	}
	InnerDestination::Logging => {
	// Release the lock on the Destination before logging.
	std::mem::drop(destination);
	log::info!("stderr: {:?}", String::from_utf8_lossy(content));
	return;
	}
	};

	// Release the lock, clear the Console, log the error and retry.
	let error_str = format!(
	"Failed to write stderr to {:?}, falling back to Logging: {:?}",
	destination, error_res
	);
	std::mem::drop(destination);
	self.console_clear();
	log::warn!("{}", error_str);

Fix dropped output race during UI teardown #14093

Fix dropped output race during UI teardown #14093

Conversation

stuhood commented Jan 6, 2022 • edited

stuhood left a comment

Choose a reason for hiding this comment

stuhood Jan 6, 2022

Choose a reason for hiding this comment

stuhood Jan 6, 2022

Choose a reason for hiding this comment

Eric-Arellano left a comment

Choose a reason for hiding this comment

Eric-Arellano commented Jan 6, 2022

stuhood commented Jan 6, 2022 • edited

jsirois Jan 7, 2022

Choose a reason for hiding this comment

jsirois Jan 7, 2022

Choose a reason for hiding this comment

stuhood Jan 7, 2022 • edited

Choose a reason for hiding this comment

jsirois Jan 7, 2022

Choose a reason for hiding this comment

stuhood Jan 7, 2022

Choose a reason for hiding this comment

jsirois Jan 7, 2022

Choose a reason for hiding this comment

stuhood commented Jan 6, 2022 •

edited

stuhood commented Jan 6, 2022 •

edited

stuhood Jan 7, 2022 •

edited