Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyzers, cache: Allow to shut down gracefully #392

Merged
merged 12 commits into from May 1, 2023

Conversation

mitjat
Copy link
Collaborator

@mitjat mitjat commented Apr 27, 2023

This PR was motivated by #360 which introduces an on-disk cache (KV store), but does not close it cleanly when exiting. This PR adds a cleanup mechanism to the cache, and (partly by necessity, partly for code hygiene) extends the cleanup mechanism to all the analyzers.

At a high level, all analyzers are now provided with an external context.Context in their Start() method, and they are on the lookout for the ctx.Done() channel closing. The top-level code closes the channel on regular program termination and on SIGINT (Ctrl+C).

There is an additional change piggybacking on this PR:

  • The recently-added KVStore (cache) now uses cbor to serialize its keys. It previously used gob, which in my testing complained for certain keys. (I think null pointers? I didn't investigate further, as I think it's preferable to not introduce new serialization formats (here, gob) into the project anyway.)

Testing: Ran in three different scenarios, verified that KV store (which is the furthest down the stack of cleanup methods) closes correctly:

  • analyzers with a limited range in config; allow them to finish "on their own"
  • kill analyzers with a single ctrl+C
  • kill analyzers with a double ctrl+C (by introducing an artificial delay into cleanup); verify that the program is killable at the cost of incomplete cleanup

cmd/analyzer/analyzer.go Outdated Show resolved Hide resolved
select {
case <-signalChan:
// We received the first Ctrl+C; try to shut down gracefully.
a.Shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller of Start, either runAnalyzer or func in rootMain calls .Shutdown. should only have to return, rather than putting these a.Shutdown calls in various places. are we able to return?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh are we not able to return because we don't cancel the context until something calls Shutdown thus calling cancelAnalyzers? seems like it would be more natural to receive the signal and only call cancelAnalyzers. no a.Shutdown(). and let us return normally

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might also let us extract this into a helper that only creates a ctx that cancels on ctrl+c

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment. Is your main concern that the caller of Start() now needs to be careful to call Shutdown(), therefore a clean shutdown is not guaranteed? Not sure what kind of returning you had in mind; where and why would the callers want to return?

This might be relevant: The analysis Service (which owns/controls all the analyzers; the other service we have is the API service) was originally designed with explicit cleanup, i.e. with Start+Shutdown. So the Service's Shutdown() is always called when appropriate, e.g. here (and in another spot; there are only two calls to Start() anyway).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, referring to that. i.e. it's (or was previously at least) up to the thing using the analyzer Service (1) to call its Start and (2) to call its Shutdown.

my understanding of this change is that Shutdown is now something that sometimes gets called by the service itself inside Start, sometimes from outside. in fact, if I'm following this right, both?

I feel it's a smaller change and less complicated to cancel the context and let the subordinate analyzers finish, let the Start return, and let the thing using the analyzer Service call Shutdown as before

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown is now something that sometimes gets called by the service itself inside Start, sometimes from outside

That was indeed funky. Thank you for noticing.

Your observation caused me to think about the code more, and I noticed more smells that I then refactored away:

  • The Shutdown() method of the Service had to be called by the caller of Service::Start(). Then it makes more sense to just call defer Shutdown() from Start() and not burden the caller. We do so now.
  • The code introduced by this PR repeated that sin (Start+Shutdown) at the Analyzer level. I changed it now to also use defer.
  • Several Shutdown() methods throughout the codebase were IMO misleadingly named. They did not actively cause anything to shut down; they performed resource cleanup and assumed that any active processing had completed. I renamed them appropriately, IIRC to either cleanup() or Close().

I also refactored the code around this comment thread so it no longer waits for ctrl+C in a separate goroutine; I find the new approach much easier to grok.

I added all these changes as a separate commit, should you prefer to view them that way, but I think it makes more sense to diff against main.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo looking good

little asymmetry, but don't let this distract you: initializing those resources is in a public function NewMain etc, while cleaning them up is in a private method cleanup. once you make a service, you have no way out other than to Start it and let Start clean things up when it's done

@mitjat mitjat requested a review from pro-wh April 27, 2023 21:02
cmd/analyzer/analyzer.go Outdated Show resolved Hide resolved
@mitjat mitjat force-pushed the mitjat/graceful-shutdown branch 3 times, most recently from 62cb459 to a3150be Compare April 29, 2023 04:04
@mitjat mitjat requested a review from pro-wh April 29, 2023 04:08
func (m *Main) Start() {
ctx := context.Background()
func (m *Main) Start(ctx context.Context) {
defer m.cleanup()

if err := m.prework(); err != nil {
Copy link
Member

@ptrus ptrus Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pass the ctx into prework so that this cancelable context is also used in SendBatch there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, are there any other context.Background's around?

@mitjat mitjat force-pushed the mitjat/graceful-shutdown branch from 8467cd6 to 14f8632 Compare May 1, 2023 18:16
@mitjat mitjat requested a review from ptrus May 1, 2023 18:17
@@ -143,6 +149,14 @@ func (m *Main) Start() {
m.cfg.Range.From, m.cfg.Range.To))
}

func (m *Main) cleanup() {
if err := m.cfg.Source.Close(); err != nil {
m.logger.Error("failed to cleanly close consensus data source",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus -> runtime (I think?)

a.logger.Info("all analyzers have exited cleanly")
return
case <-signalChan:
a.logger.Info("received second interrupt, forcing exit")
Copy link
Collaborator

@pro-wh pro-wh May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analysis of this logic:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is a little sophisticated. do we specifically want to program a custom second-ctrl-c behavior? if not, we could unregister the handler and let the second ctrl+c do whatever the default action is. exit the program immediately

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Implemented it now. For my taste, the code is about the same complexity to understand though. There's no more explicit handling of a second ctrl+C, which is nicer and simpler (and more agnostic of surrounding code, which might theoretically have its own handlers); that's good. But there is the additional cognitive burden of tracking whether the custom signal interceptor is active or not.

I also feel the image is harder to grok than the code :). Also, not sure it's correct; the far left path takes three ctrl+C to reach "process exited"?

Copy link
Collaborator

@pro-wh pro-wh May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, in the previous code, the second ctrl+c makes this function return, which transfers control to the defers. first, the defer to unregister the signal notifier. second, to call cleanup. cleanup I'm pretty sure is fast, it's only closing the target storage postgresql db connection. but what happens after that? the analyzer goroutines may still be running. I think go keeps the process alive until those goroutines finish. but with no signal handler installed, a third ctrl+c does make the program exit

Copy link
Collaborator

@pro-wh pro-wh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes

Copy link
Collaborator

@Andrew7234 Andrew7234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@mitjat mitjat force-pushed the mitjat/graceful-shutdown branch from fa44ffa to c4695c5 Compare May 1, 2023 23:04
@mitjat mitjat enabled auto-merge May 1, 2023 23:04
@mitjat mitjat merged commit 377880d into main May 1, 2023
5 checks passed
@mitjat mitjat deleted the mitjat/graceful-shutdown branch May 1, 2023 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants