Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding graceful shutdown to server #1977

Merged
merged 1 commit into from Aug 1, 2023
Merged

Conversation

nmittler
Copy link
Contributor

@nmittler nmittler commented Jun 25, 2023

Fixes #1976.

@nmittler
Copy link
Contributor Author

@djc @bluejekyll this is a WIP, but I wanted to send it out to get early feedback.

@nmittler
Copy link
Contributor Author

@djc regarding your #1976 (comment), looks like we don't currently depend on hyper. Not sure if it would be acceptable to bring that in as a general dependency. Thoughts?

@djc
Copy link
Collaborator

djc commented Jun 27, 2023

I did not mean bringing in hyper as a dependency, merely copying the way it implements graceful shutdown (which seems like it might be lighter-weight than the linkerd stuff that you copied)?

@nmittler nmittler changed the title [WIP] Adding graceful shutdown to ServerFuture. [WIP] Adding graceful shutdown to Server. Jun 27, 2023
@nmittler nmittler changed the title [WIP] Adding graceful shutdown to Server. [WIP] Adding graceful shutdown to server Jun 27, 2023
@nmittler
Copy link
Contributor Author

nmittler commented Jun 27, 2023

@djc I reworked this based on your comments regarding hyper. So far, I've only implemented UDP so we can see what it looks like. PTAL.

@djc
Copy link
Collaborator

djc commented Jun 28, 2023

This seems to add a whole bunch of abstraction on top of what's there now just to add graceful shutdown. Instead, I think we should plug graceful shutdown functionality into the existing abstractions.

@nmittler
Copy link
Contributor Author

nmittler commented Jun 28, 2023

@djc

This seems to add a whole bunch of abstraction on top of what's there now just to add graceful shutdown. Instead, I think we should plug graceful shutdown functionality into the existing abstractions.

My first pass with the linkerd drain crate was just this. However, after looking into a more generic approach (as is done with hyper's Server API), I ran problems with the current structure of ServerFuture...

Ideally, when we start each server loop we would pass in a graceful shutdown future that can be used in stream.take_until to stop accepting incoming connections. However, if we pre-create the server loops when we register the socket, there's no easy way to insert a graceful shutdown future later on.

A few paths forward that come to mind:

  1. Provide generic registerXXX methods. This might be the least invasive approach. The idea here is that for every register method we have a generic version of the method that accepts a user-defined future for graceful shutdown. For example, we would have both fn register_socket(UdpSocket) and
pub fn register_socket_graceful<Fut, Out>(&mut self, socket: net::UdpSocket, drain: Fut)
    where
        Fut: Future<Output = Out> + Send + Unpin + 'static,
        Out: Send + Unpin + 'static {
    ...
    }

Note that because the future is internally being passed into tokio::spawn, both the Future and its Output have to be static to satisfy the lifetime requirements. And while this would solve the problem with the fewest changes to the existing code, it seems a bit smelly.

  1. Add the drain future as a member of ServerFuture. This actually gets tricky, since all of the generic parameters have to now be added to ServerFuture, itself. This makes it difficult to have a simple new method that takes only the handler, because there is no way to infer all of the other parameters.
  2. Break apart the structure so that we don't create the server loops until later when the decision of whether to use graceful shutdown is made. In this case, the registerXXX methods would just add futures, and then we would have a serve or run method that takes an optional graceful shutdown future. This ends up requiring the most restructuring of the server code, since the storage of the futures requires a known size. I solved this in the new code by using a Server enum, and then having a CompositeServer that mainains a list of them.

I'm definitlely not a rust expert so there may better ways around some of the issues I was encountering. Although at the end of the day, the Server structure that I came up with seems pretty clean, albeit a bit of a departure from the current code.

I'm definitely open to suggestions.

@djc @bluejekyll thoughts?

@bluejekyll
Copy link
Member

bluejekyll commented Jun 28, 2023

I'm coming to this conversation a little late. Looking briefly, I'm not sure why the new server is necessary?

Thinking about how I solve similar problems,I would probably just share an AtomicBool to all the listener loops, and then have the loops exit cleanly after all current requests complete? I haven't put a ton of thought into this right now, but is there a reason something as simple as that wouldn't work?

I haven't looked at the code in a lot of detail right now, but based on my memory of it I think this would be possible. Is this something you considered? (I haven't looked at the Hyper implementation mentioned). The Type would be a handle that would be returned by the server, something roughly like:

#[derive(Clone)]
struct ServerHandle {
    is_shutdown: Arc<AtomicBool>,    // if false, then handlers should continue accepting work
}

impl ServerHandle {
    /// Tell the server handlers to shutdown
    pub fn shutdown(&self) {
        is_shutdown.store(true);
    }

    /// Handler loops would read this every loop iteration, stop if true
    pub fn is_shutdown(&self) -> bool {
        is_shutdown.load()
    }
}

The ServerFuture then already has this method to await tasks finishing, so it just becomes a task of figuring out how we ire the above through all the listeners: https://docs.rs/trust-dns-server/0.22.1/trust_dns_server/server/struct.ServerFuture.html#method.block_until_done

Ok, looking a little more at the ServerFuture type, I think we could add something like the ServerHandle above that's created in the ServerFuture's constructor. It would then be passed into each listeners loop. Then a new method could be made on ServerFuture to get it for control of shutdown: ServerFuture::get_server_handle. Would that be useful in the context you were thinking?

@nmittler
Copy link
Contributor Author

nmittler commented Jun 29, 2023

@bluejekyll That's fair. The current complexity is due to the fact that it allows the application to use their shutdown hooks directly, similarly to what hyper does. This does introduce a lot of complexity, and I think you're right, ... we don't need it. Will update the PR shortly.

@bluejekyll
Copy link
Member

@djc do you agree with this approach, would be good to get that feedback before @nmittler puts in a ton of work.

@nmittler
Copy link
Contributor Author

@bluejekyll @djc I refactored things, PTAL when you get a moment.

@nmittler nmittler force-pushed the graceful branch 2 times, most recently from e8635d8 to 499e2c0 Compare June 29, 2023 20:18
@nmittler
Copy link
Contributor Author

Also, not sure why the build is failing to find tokio::select! ... everything builds fine for me locally.

@djc
Copy link
Collaborator

djc commented Jun 30, 2023

@djc do you agree with this approach, would be good to get that feedback before @nmittler puts in a ton of work.

Yup, sounds like the right direction to me!

crates/server/src/server/https_handler.rs Outdated Show resolved Hide resolved
crates/server/src/server/quic_handler.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Outdated Show resolved Hide resolved
crates/server/src/server/server_future.rs Show resolved Hide resolved
@djc
Copy link
Collaborator

djc commented Jun 30, 2023

Also, not sure why the build is failing to find tokio::select! ... everything builds fine for me locally.

Probably because tokio's Cargo features aren't depended upon correctly.

@nmittler nmittler force-pushed the graceful branch 3 times, most recently from c621369 to 08293c5 Compare June 30, 2023 13:37
@nmittler
Copy link
Contributor Author

@djc I resolved the issue with finding tokio::select! but now there is a different issue, that doesn't seem to be directly related to this PR: https://github.com/bluejekyll/trust-dns/actions/runs/5423528718/jobs/9861952221?pr=1977. Any thoughts?

@djc
Copy link
Collaborator

djc commented Jul 1, 2023

No, you'll need to dig in and figure out what's going on.

@nmittler nmittler force-pushed the graceful branch 3 times, most recently from ce21981 to e7a7604 Compare July 10, 2023 16:28
@nmittler
Copy link
Contributor Author

nmittler commented Jul 10, 2023

@bluejekyll @djc I ended up going back to using linkerd's drain-rs for controlling the graceful shutdown. The problem I was seeing with using AtomicBool as a Future is that it never triggers the Waker, which can lead to the shutdown hanging indefinitely. The drain-rs lib seems pretty light to me ... I don't think it should bloat the server much.

Additionally, this PR seems to be having issues with code coverage (currently at ~61%). I've updated all uses of ServerFuture that I found in our tests with the new graceful shutdown logic. Looking at the output, it appears that most of the missed coverage is for existing code. I suspect the original code landed before the code coverage analysis was added? Can we just force merge this PR as-is? Adding coverage to ServerFuture seems to be beyond the scope here.

@bluejekyll
Copy link
Member

Ah, yes. I see the issue, needing to wake all the tasks on shutdown, I hadn't really been thinking about that, so thanks for pointing out the complexity here.

I guess the question I have, is drain-rs the simplest object type for this? I don't know much about that dependency, but in theory, any future aware channel would work just as well and avoid a new dependency, correct?

@nmittler
Copy link
Contributor Author

@bluejekyll

in theory, any future aware channel would work just as well and avoid a new dependency, correct?

Are you thinking there is something we can leverage from existing dependencies here? Or are you suggesting that we roll our own? AFAICT the drain library only depends on tokio, and it's just using channels under the covers. I think it would be preferrable to just use that if the only alternative is to roll our own. WDYT?

@djc
Copy link
Collaborator

djc commented Jul 28, 2023

@bluejekyll

in theory, any future aware channel would work just as well and avoid a new dependency, correct?

Are you thinking there is something we can leverage from existing dependencies here? Or are you suggesting that we roll our own? AFAICT the drain library only depends on tokio, and it's just using channels under the covers. I think it would be preferrable to just use that if the only alternative is to roll our own. WDYT?

That sounds plausible to me.

@nmittler
Copy link
Contributor Author

@bluejekyll @djc I've rebased and also added a small wrapper around the linkerd drain signal, so that we don't expose their API from ServerFuture. PTAL

Also, any thoughts on this?

Additionally, this PR seems to be having issues with code coverage (currently at ~61%). I've updated all uses of ServerFuture that I found in our tests with the new graceful shutdown logic. Looking at the output, it appears that most of the missed coverage is for existing code. I suspect the original code landed before the code coverage analysis was added? Can we just force merge this PR as-is? Adding coverage to ServerFuture seems to be beyond the scope here.

@djc
Copy link
Collaborator

djc commented Jul 30, 2023

Sure, there's no need to fix up all of the lack of coverage here.

@bluejekyll
Copy link
Member

Thanks for all the explanations, @nmittler, I agree with this usage. I also just took a look at the code, seems reasonable enough to me. On the code coverage, I use that to really understand he risk we might have in regards to changes coming in, not something necessarily required for us to hit.

Copy link
Member

@bluejekyll bluejekyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've reviewed. I think this overall looks good. I don't want to get pedantic on naming, but I'm not sure I would name the local variables drain but instead something more like shutdown or something that's a little closer to their usage.

@djc, can you close out some of your conversations if you don't have any open concerns?

Thanks for putting this all together @nmittler , I know this has taken a while, I appreciate you working through all the discussion points and taking the time to see this through.

@nmittler nmittler force-pushed the graceful branch 2 times, most recently from 44522b5 to bf1f954 Compare July 30, 2023 19:22
@nmittler
Copy link
Contributor Author

Thanks @bluejekyll. FYI, I just pushed a change to use shutdown instead of drain.

@nmittler
Copy link
Contributor Author

nmittler commented Aug 1, 2023

@djc any other comments/concerns before we land this?

@bluejekyll
Copy link
Member

Merging, thanks for getting this done, @nmittler!

@bluejekyll bluejekyll merged commit 578ce5a into hickory-dns:main Aug 1, 2023
17 of 18 checks passed
@nmittler
Copy link
Contributor Author

nmittler commented Aug 1, 2023

@bluejekyll @djc thank you!

@nmittler
Copy link
Contributor Author

nmittler commented Aug 1, 2023

@bluejekyll what is the cadence for pushing new artifacts? I'd like to switch Istio over to using the graceful shutdown logic when it's available.

@bluejekyll
Copy link
Member

Since we're on alpha's right now, I'd be happy to publish another release today.

@nmittler
Copy link
Contributor Author

nmittler commented Aug 1, 2023

@bluejekyll excellent! Thanks for all your help!

@theduke
Copy link

theduke commented Aug 9, 2023

@nmittler thanks for implementing this!

I unfortunately couldn't dedicate more time than creating the issue.

@caobug
Copy link
Contributor

caobug commented Oct 1, 2023

We shouldn't move ownership on shutdown_gracefully(), that complicates things.

@djc
Copy link
Collaborator

djc commented Oct 1, 2023

Please file a new issue or PR with suggestions on what we should do instead and what your use case is (why does this complicate things).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graceful shutdown of ServerFuture
5 participants