Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Graceful Shutdown #339

Merged
merged 38 commits into from Dec 12, 2017
Merged

Introducing Graceful Shutdown #339

merged 38 commits into from Dec 12, 2017

Conversation

pulltab
Copy link
Contributor

@pulltab pulltab commented Nov 10, 2017

Sources, sinks, and filters now shutdown gracefully and in a manner which fosters at least once delivery.

In response to a SIGINT or SIGTERM, the cernan main thread now:

  1. Gracefully terminates and joins all running sources. For sources with protocols enabled with ACKs, this will ensure downstream clients are aware when their intended source is no longer available. Graceful terminate is achieved by use of mio and its polling features to select between client traffic and so called SYSTEM events. At present there is only one supported SYSTEM event (SHUTDOWN).

  2. Gracefully terminates running filters and sinks, blocking until all have terminated. This is accomplished by injecting the new metric::event::Shutdown at the end of Hopper queue associated with the given consumer. Upon receipt of a metric::event::Shutdown, filters and sinks flush and terminate.

@pulltab
Copy link
Contributor Author

pulltab commented Nov 10, 2017

Resolves #34

@blakebarnett
Copy link
Contributor

blakebarnett commented Nov 10, 2017

This is awesome, not proficient enough with rust to review properly but I can't wait to see it in action :)

src/util.rs Outdated
@@ -1,5 +1,7 @@
//! Utility module, a grab-bag of functionality

extern crate mio;
Copy link
Contributor Author

@pulltab pulltab Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be cleaned up.


fn broadcast_shutdown(workers : &HashMap<String, SinkWorker>){
let mut source_channels = Vec::new();
for (id, worker) in workers.iter() {
Copy link
Contributor Author

@pulltab pulltab Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blt This broadcast approach thwarts at least once delivery.

Consider a case where we have a topology where a sink is at the end of some chain of sinks/filters. What we really want is to inject the SHUTDOWN at the top most sink/filter for each chain then let that SHUTDOWN event be forwarded along.

Do you happen to know a way to just iterate over the root nodes of each chain?

Copy link
Collaborator

@blt blt Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it’s goofy because there’s no proper graph structure but we build
up the “top” of the paths for time pulses. I imagine you could reuse that
list.

Somewhat longer term we should probably invest in a proper graph. Cycle
detection would be nice.

blt
blt previously requested changes Nov 13, 2017
Copy link
Collaborator

@blt blt left a comment

I'm still learning about Mio so the comments I have here so far are minimal. I'm going to start a test-bench for this branch.

fn broadcast_shutdown(workers : &HashMap<String, SinkWorker>){
let mut source_channels = Vec::new();
for (id, worker) in workers.iter() {
println!("Signaling shutdown to {:?}", id);
Copy link
Collaborator

@blt blt Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a call to the logger. An INFO maybe?

);
while let Some(mut telem) = Q.pop() {
let mut events = mio::Events::with_capacity(1024);
match poll.poll(& mut events, Some(slp)) {
Copy link
Collaborator

@blt blt Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poll.poll will block, will it not? If it does, that's a problem for the steady beat of telemetry.

If we're not catching a shutdown here should we be polling?

Copy link
Contributor Author

@pulltab pulltab Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does block, with the same slp value as before.

Same behavior as https://github.com/postmates/cernan/pull/339/files#diff-fa755baccca5ebdb28f49f30a86be0f7L112.

My expectation is that one day we will be catching system events here, hence it was plumbed in.

@blt blt force-pushed the ipc-shutdown branch 2 times, most recently from a561d1e to ea495cf Compare Nov 25, 2017
@blt
Copy link
Collaborator

blt commented Nov 25, 2017

Hhhmmmm so this works fine in debug mode, but as soon as you compile in release-mode sending a SIGKILL will freeze things up.

@blt
Copy link
Collaborator

blt commented Nov 25, 2017

No, no, I can trigger it in debug mode too. Never mind my last.

fn shutdown(&mut self) -> () {
self.flush();
// This won't work:
// https://docs.rs/hyper/0.10.13/hyper/server/struct.Listening.html#method.close
Copy link
Collaborator

@blt blt Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pulltab here's the problem.

Copy link
Contributor Author

@pulltab pulltab Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the following line you introduced resolve the issue?

Copy link
Collaborator

@blt blt Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. :(

@blt blt force-pushed the ipc-shutdown branch 2 times, most recently from 4ab608d to 4559aeb Compare Dec 1, 2017
@blt blt dismissed their stale review Dec 2, 2017

Boy it's been a while since that last review.

fn handle_tcp(
chans: util::Channel,
tags: sync::Arc<metric::TagMap>,
socket_map: HashMap<mio::Token, mio::net::TcpListener>,
Copy link
Collaborator

@blt blt Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, HashMap in the wild. Hoping to convert this to slab.

Copy link
Contributor Author

@pulltab pulltab Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

thread::spawn(move || {
handle_stream(chans, tags, stream);
tags: sync::Arc<metric::TagMap>,
listener_map: HashMap<mio::Token, mio::net::TcpListener>,
Copy link
Collaborator

@blt blt Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker for HashMap -> slab conversion.

Copy link
Contributor Author

@pulltab pulltab Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

use protobuf::repeated::RepeatedField;
use protocols::prometheus::*;
// use protobuf::Message;
// use protobuf::repeated::RepeatedField;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker for commented-out code.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

break;
}
}
// let default_header =
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker for commented-out code.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

Err(e) => {
panic!(format!("Failed to send prometheus response! {:?}", e));
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't panic here. Previously we incremented PROMETHEUS_REPORT_ERROR on failure.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to a warn!


match request.respond(response) {
Ok(_) => {
PROMETHEUS_WRITE_TEXT.fetch_add(1, Ordering::Relaxed);
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous behaviour was to increment PROMETHEUS_WRITE_TEXT when we attempted to write, uh, text. We can probably remove both WRITE_BINARY and WRITE_TEXT telemetry now that we're all-in on text.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both and added PROMETHEUS_REPORT_SUCCESS to compliment PROMETHEUS_REPORT_ERROR.

Copy link
Collaborator

@blt blt Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this in uncommitted code maybe?


let host_port =
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tuple here should be able to coerce into the IP needed. Yeah, looks like (&str, u16) implements ToSocketAddrs.

}
res.end()
}
// fn write_binary(aggrs: Iter) -> io::Result<()> {
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker for commented out code.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -1,20 +1,17 @@
extern crate mio;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker to remove redundant extern crate mio in favor of the one in lib.rs.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with a use statement.

@@ -1,3 +1,5 @@
extern crate mio;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker to remove redundant extern crate mio in favor of the one in lib.rs.

@@ -3,6 +3,8 @@
//! In cernan a `Source` is a place where all `metric::Event` come from, feeding
//! down into the source's forwards for further processing. Statsd is a source
//! that creates `Telemetry`, `FileServer` is a source that creates `LogLine`s.
extern crate mio;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker to remove redundant extern create.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -1,3 +1,5 @@
extern crate mio;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker to remove redundant extern crate.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -1,14 +1,19 @@
extern crate mio;
Copy link
Collaborator

@blt blt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker to remove redundant extern create.

Copy link
Contributor Author

@pulltab pulltab Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Brian L. Troutwine and others added 18 commits Dec 12, 2017
I've added a 'shutdown' to the sink to allow for the controlled
shutdown of internal resources. The shutdown will hang for prometheus
sink because of the issue linked in the comments.

We're probably going to have to update prometheus sink HTTP server.
This also implies that, going forward, we'll have to be very careful
about testing shutdown functions for fully running sinks.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
I misunderstood what John was doing and broke his work. This commit
fixes the thing I broke: statsd source shutdown. What I'd misunderstood
was that the constants::SYSTEM token is _only_ readable when it's time
for a source to go the way of tears in the rain.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
This branch was very chatty during build. It is no longer.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Previously shutdown took `&mut self` which meant that the Sink
storage could still be valid after shutdown. This stank in terms
of actually performing shutdown. Now the Sinks can be sure they
own everything on the way out.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
This commit re-enables write_text in the prometheus sink. Had
to fiddle with the imports a touch but we're back on the road.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
write_text no longer creates an http::Response but does require
a buffer to be passed into it, which it then fills up.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
This commit moves some of the clones we were doing into reference
passes, saving on a handful of copies / increments of atomic counters.

Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Brian L. Troutwine and others added 6 commits Dec 12, 2017
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
Replaces usage of HashMap to store Token -> Evented mappings.
Needed in latest master.
Signed-off-by: Brian L. Troutwine <blt@postmates.com>
blt
blt approved these changes Dec 12, 2017
Copy link
Collaborator

@blt blt left a comment

The day of glory has come.

@pulltab pulltab merged commit 6313149 into master Dec 12, 2017
2 checks passed
@blt blt deleted the ipc-shutdown branch Dec 12, 2017
@gliush
Copy link
Contributor

gliush commented Dec 13, 2017

Cool!

@blt blt mentioned this pull request Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants