Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve liveliness checker #6714

Merged
merged 8 commits into from Jul 11, 2019

Conversation

@baumanj
Copy link
Member

commented Jul 9, 2019

Add mechanisms for indicating a thread has ended, either with an error (will be logged for a period of time, alongside potentially deadlocked threads) or expectedly. Also, add special must_use types to help ensure correct usage. See individual commits for details.

Also, add liveliness checking to the launcher.

@baumanj baumanj added the X-fix label Jul 9, 2019

@baumanj baumanj self-assigned this Jul 9, 2019

@christophermaier
Copy link
Contributor

left a comment

Looks good so far... I think 1 week might be a tad excessive, but it's configurable, so 🤷‍♂

@baumanj

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

I think 1 week might be a tad excessive

I'm open to other suggestions. I only added the culling to prevent runaway memory consumption, but if that's happening, it implies the same threads are ending with errors over and over, so maybe a shorter time would be appropriate.

However, if there was a rare occurrence of an error, it would be a shame to miss it because it got culled before anyone looked. We could make the culling logic only trigger if there are a large number of items in the ThreadStatusMap, but I'm hesitant to make the logic more complex before we see a problem in practice.

@christophermaier

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

I like the idea of culling, and I definitely wouldn't want to make the logic any more complex. I was thinking maybe a few days would be a good threshold, but it's really no big deal. It is configurable, after all.

@@ -855,7 +855,7 @@ impl Manager {
// TODO (CM): Investigate the appropriateness of capturing any
// errors or panics generated in this loop and performing some
// kind of controlled shutdown.
let shutdown_mode = loop {
let shutdown_mode: habitat_common::sync::ThreadReturn<ShutdownMode> = loop {

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 10, 2019

Contributor

I think we might not want type the loop like this, given that we still do some work after the loop finishes. That's actually where we were hanging with the regression fixed by #6717.

If this code was in place earlier, it would look like the main loop hadn't deadlocked at all.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 10, 2019

Contributor

In general though, the explicit approach of typing the thread exits looks really slick! 👍

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 10, 2019

Author Member

I think we might not want type the loop like this, given that we still do some work after the loop finishes.

Sure, that's no problem. Though if we are calling mark_thread_alive only in the loop, we want to make sure all the paths that exit from the loop opt out of liveliness checking, right?

If this code was in place earlier, it would look like the main loop hadn't deadlocked at all.

I'm not sure I understand. Can you elaborate? I'm guessing the place where we were stuck prior to #6717 is https://github.com/habitat-sh/habitat/blob/master/components/sup/src/manager.rs#L1086, yes? If that's the case, getting a deadlock warning from this thread was more like a happy accident. For correctly identifying future-based deadlock, I think we should be replacing the wait there with a poll that calls mark_thread_alive, no? Or maybe, I'm mistaken because my Futures knowledge is pretty minimal as yet.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

I think I'm OK with not having anything in here do any unregistering of threads from the deadlock detector. Once we get out of the loop, we should be shutting down in a timely fashion anyway. I don't think we want any kind of polling and periodic marking of the main thread as still being alive once we get down to the point where we're draining the reactor of futures.

#6717 fixes things in the near term, but I'll think a bit more about ways that we can more completely ensure that our "long-running" futures are terminated before we get to the "drain the reactor" part.

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 11, 2019

Author Member

I'm torn on this one. In the current state of the world, you're absolutely right. What I worry about is down the line and how changes to the code are likely to cause either false positives or false negatives.

We can remove the unregister calls here, but only because it's the main thread for the entire supervisor and only because once it leaves this particular loop, it should shut down in a timely fashion. That seems like a lot of assumptions that are fairly brittle. I could certainly imagine this loop getting moved to a separate thread in the future or an additional loop coming after which represented a different state of operation, but which isn't expected to end in a short period of time.

As we use futures more and more, I think the value of having the liveliness checker work with them becomes more important. Right now, removing the unregister calls upon exiting the main loop here would give us "deadlock likely" messages for the main loop, but it wouldn't tell us where it was stuck. My first thought would be false positive, and that wold be reinforced by the lack of unregister calls. Understanding this isn't the case requires realizing that main is a special case. That may be clear in retrospect, but if we can avoid making special cases, I think that'll reduce the cognitive load at a time we may be already tired and/or stressed (i.e., troubleshooting erroneous behavior in production).

The two improvements I think would be of significant value are making the liveliness checker work for tracking futures progress and adding an indicator of where in the code mark_thread_alive is called (more than just the thread name). If we had these capabilities, it would've allowed us to zero in on where we were stuck directly. I anticipate this won't be the last time futures code is associated with getting unexpectedly stuck.

For the time being, let's go with removing the unregister calls and I can file issues for the other enhancements if you agree about where we want to go with this and we can prioritize getting to a place that would truly expedite the discovery of issues like this in the future.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

I agree with all this. In order to really work with futures, we're going to have to make some additional changes to the tracker anyway, since they get passed around on a threadpool. (i.e., just the thread name won't be useful; otherwise we'll see stuff like "'tokio-threadpool-3' hasn't checked in in X seconds; deadlock likely").

It would be interesting to get some kind of introspection on the threadpool, though. If a future ends up blocking because it's somehow deadlocked on one of these read-write locks, then that would block up a thread. Even if it wasn't deadlocked, the fact that a future would jam up a threadpool thread for an appreciable amount of time is itself a sign of a poorly-structured future, and that would be a good thing to know anyway. Perhaps that's a good avenue of exploration, if there's not already some off-the-shelf crates that address that.

While it's not going to be 100% applicable to this exact use case, I'm very keen on figuring out how to incorporate tokio-trace into our codebase. I suspect that will be useful in helping us get a clearer picture of everything that's going on internally.

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 12, 2019

Author Member

Filed #6723

pub struct ThreadUnregistered<T = (), E = std::convert::Infallible>(Result<T, E>);

impl<T, E> ThreadUnregistered<T, E> {
pub fn into_result(self) -> Result<T, E> { self.0 }

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 11, 2019

Author Member

In case you're wondering why we can't use Into/From traits: https://users.rust-lang.org/t/generic-conversion-from-newtypes/16247

@baumanj baumanj marked this pull request as ready for review Jul 11, 2019

@baumanj baumanj requested a review from raskchanky as a code owner Jul 11, 2019

@christophermaier
Copy link
Contributor

left a comment

There's a lot of really cool stuff in here; I love it!

/// from the liveliness checker or else be subject to false positives. In general, the idiom is
/// ```
/// use habitat_common::liveliness_checker;
/// # let error = false;

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

Did you mean to leave line this in?

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 11, 2019

Author Member

Yes, without that line, cargo test --doc -p habitat_common would fail.

The # is to hide it from the documentation display. See, for example: https://github.com/rust-lang/rust/blob/master/src/liballoc/string.rs#L953 vs http://doc.rust-lang.org/1.36.0/alloc/string/struct.String.html#method.try_reserve.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

Ah, gotcha... 👍

@@ -38,7 +37,7 @@ lazy_static! {

pub fn spawn_thread(name: String, server: Server, timing: Timing) -> std::io::Result<()> {
thread::Builder::new().name(name)
.spawn(move || run_loop(&server, &timing))
.spawn(move || -> ! { run_loop(&server, &timing) })

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

I notice that removing the -> ! doesn't seem to have an effect, as long as run_loop is defined to return either nothing, or !.

As soon as I modified run_loop to return something, though, this refuses to compile. Neat!

/// }
/// }
/// ```
pub fn and_divergent(self) {}

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

This is a nice touch.

check_thread_heartbeats(statuses, threshold);
log_dead_threads(statuses);
cull_dead_threads(statuses, max_time_since_death);
thread::sleep(delay);

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 11, 2019

Contributor

Wait, how will we know this thread is still alive?

:trollface: Quis custodiet ipsos custodes? :trollface:

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 11, 2019

Author Member

I nearly added it, before considering how pointless that would be

baumanj added a commit that referenced this pull request Jul 11, 2019

Don't unregister the supervisor main loop with liveliness checker
See #6714 (comment)

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
@davidMcneil
Copy link
Contributor

left a comment

This is really slick! 👍

baumanj added some commits Jul 9, 2019

Allow threads to self remove from liveliness checking
Threads which call habitat_common::sync::mark_thread_alive and then
exit (intentionally or due to error) will no longer be reported as
likely deadlocked if they call ::mark_thread_dead before exiting.

Threads can indicate a normal, or erroneous error to mark_thread_dead.
Threads which exited with an error will be logged along with an
explanation of the error.

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Eventually stop logging threads that died with an error
To avoid boundless growth of the thread heartbeat collection, once a
thread has been dead for HAB_THREAD_DEAD_IGNORE_DELAY_SECS, stop
logging it. Default delay: 1 week.

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Rename heartbeats collection
Now that we're storing dead threads as well, it's more accurately a
collection of statuses. This is less error-prone than having two
separate collections, since there's no possibility of one thread ending
up in both: a thread can logically only have one status.

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Enforce liveliness checking deregistration
Add a sentinel type (ThreadReturn) to ensure all code paths call
mark_thread_dead when they leave the loop which calls mark_thread_alive.

These names aren't ideal, but that will be addressed separately.

Also, add some diverging annotations to thread spawning to make the
indended behavior clearer.

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Rename and reorganize liveliness checker
Move from habitat_common's sync module to new liveliness_checker module
mark_thread_dead -> unregister_thread
ThreadReturn -> ThreadUnregistered

Also, add more documentation, improve error message for calling
unregister_thread on a previously untracked thread and add must_use
to unregister_thread.

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Add liveliness_checker module
Also increase type safety with `must_use` and `CheckedThread`

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Fix up comments
Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>
Don't unregister the supervisor main loop with liveliness checker
See #6714 (comment)

Signed-off-by: Jon Bauman <5906042+baumanj@users.noreply.github.com>

@baumanj baumanj force-pushed the improve-liveliness-checker branch from 7442bc5 to a2101c0 Jul 11, 2019

@christophermaier christophermaier merged commit dcbbbff into master Jul 11, 2019

5 checks passed

DCO This commit has a DCO Signed-off-by
Details
buildkite/habitat-sh-habitat-master-verify Build #2632 passed (19 minutes, 28 seconds)
Details
buildkite/habitat-sh-habitat-master-website Build #3235 passed (38 seconds)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
expeditor/config-validation Validated your Expeditor config file
Details

@christophermaier christophermaier deleted the improve-liveliness-checker branch Jul 11, 2019

chef-ci added a commit that referenced this pull request Jul 11, 2019

Update CHANGELOG.md with details from pull request #6714
Obvious fix; these changes are the result of automation not creative thinking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.