Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved graceful shutdown - Agent #2031

Merged
merged 6 commits into from
Feb 24, 2020
Merged

Improved graceful shutdown - Agent #2031

merged 6 commits into from
Feb 24, 2020

Conversation

jpkrohling
Copy link
Contributor

@jpkrohling jpkrohling commented Jan 24, 2020

Which problem is this PR solving?

  • Resolves Graceful shutdown by properly handling SIGTERM #295;
  • The current code seems to go in the direction of graceful shutdown but fails to correctly implement this behavior for all components. For instance, before this PR, the gRPC server would be closed without taking in-flight requests into consideration;
  • More thoughtful shutdown order, like in the all-in-one command: the agent is now closed before the collector disappears, preventing messages like the following after SIGTERM:
WARNING: 2020/02/18 13:27:44 grpc: addrConn.createTransport failed to connect to {127.0.0.1:14250  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:14250: connect: connection refused". Reconnecting...

Short description of the changes

  • The first commit shows what would be necessary to get a graceful shutdown for the collector
  • The second commit moves most of the startup/shutdown logic for the collector into a "component". I'm not happy with its current state, especially about its location, but it shows that quite a good amount of code could be removed by adopting this idea. The same could be applied to the other components, but I want to run this idea here first, before spending much time on this refactoring.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-overdue refactoring!

all-in-one/main should be much smaller indeed, and just reuse the components start sequence. It also makes the start method testable.

pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
pkg/collector/component.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

Status: this is taking quite some time, as it requires moving other things around as well, to avoid cyclic dependencies between packages. I think I'll split this into two tasks: the graceful shutdown for the collector, and the refactoring for the main.go files.

@jpkrohling jpkrohling requested a review from a team as a code owner February 12, 2020 10:29
cmd/agent/app/agent.go Outdated Show resolved Hide resolved
cmd/agent/app/agent.go Outdated Show resolved Hide resolved
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general top-level comment: please think of the ownership of closables

@jpkrohling jpkrohling changed the title WIP - Improved graceful shutdown Improved graceful shutdown Feb 13, 2020
@codecov
Copy link

codecov bot commented Feb 13, 2020

Codecov Report

Merging #2031 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2031      +/-   ##
==========================================
+ Coverage   96.35%   96.37%   +0.01%     
==========================================
  Files         214      214              
  Lines       10532    10532              
==========================================
+ Hits        10148    10150       +2     
+ Misses        326      325       -1     
+ Partials       58       57       -1
Impacted Files Coverage Δ
cmd/query/app/server.go 94.52% <0%> (+2.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb780f2...54066e5. Read the comment docs.

@@ -131,6 +137,10 @@ func (h *zipkinSpanHandler) SubmitZipkinBatch(spans []*zipkincore.Span, options
return responses, nil
}

func (h *zipkinSpanHandler) Close() error {
return h.modelProcessor.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we might be closing this twice? It's the ownership issue again - handlers don't create processors, so they should not be closing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right. This wasn't a problem because the processor's #Close() was checking whether it had been closed before. I kept this safety check, but extracted the building of the span processor into its own function. The collector now builds it explicitly and passes to the handler builder.

While I don't quite like that the collector calls builder.BuildProcessor() and then builder.BuildHandlers(processor), I think it's the best option that makes sense without building another abstraction with its own #Close().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Collector builds the processor explicitly, shouldn't it be in charge of closing it as well, rather than having the handlers do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

What if I told you that, after all that I wrote before, I say that I actually forgot to remove it from the handler's #Close() methods...

Hopefully, it's now fixed in this last commit.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split agent and collector changes into different PRs? E.g. git checkout master cmd/collector to get the agent-only changes. There's too much going on here.

cmd/agent/app/agent.go Outdated Show resolved Hide resolved
cmd/agent/app/agent.go Show resolved Hide resolved
}

// NewAgent creates the new Agent.
func NewAgent(
proxy CollectorProxy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need Agent to be responsible for closing CollectorProxy if it didn't create it? There may be other services using that connection beyond the main agent's services.

Copy link
Contributor Author

@jpkrohling jpkrohling Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. It makes the all-in-one a bit more complicated for now, but it is indeed the right thing to do.

The question is whether we'd want the Agent to start the proxy. Right now, it does not, but I believe it's the only consumer of this proxy.

@@ -69,6 +69,7 @@ var (
type CollectorProxy interface {
GetReporter() reporter.Reporter
GetManager() configmanager.ClientConfigManager
Close() error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending decision on whether agent should be in charge of closing the proxy. If not, then this Close() method may be unnecessary since whoever created the proxy would have access to Close() on the concrete type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cost of leaving this here? Having it as part of the interface forces new implementations to at least think about the #Close() operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having Close() in an interface is pretty often an anti-pattern that indicates a problem elsewhere. But in some cases it is appropriate when the object that holds reference to this type is indeed requires an interface, because it either deals with different implementations or a decorator pattern is used that requires the decorator to accept an interface. In this concrete case I don't have a strong opinion, aside from my general preference of not introducing things until they are actually necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of the "cost" is cmd/agent/app/builder_test.go which now needs to add no-op Close().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll remove it, we can add it back if there's a need in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding this back, because both implementations actually create a connection on their own and need to properly close it. The only implementation that does not need a #Close is the test.

@@ -23,6 +23,7 @@ import (
grpcManager "github.com/jaegertracing/jaeger/cmd/agent/app/configmanager/grpc"
"github.com/jaegertracing/jaeger/cmd/agent/app/reporter"
aReporter "github.com/jaegertracing/jaeger/cmd/agent/app/reporter"
"github.com/jaegertracing/jaeger/pkg/multierror"
)

// ProxyBuilder holds objects communicating with collector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this to CollectorProxy (but ok to do in another PR)

cmd/collector/app/collector.go Outdated Show resolved Hide resolved
@@ -169,7 +177,18 @@ func (c *Collector) Close() error {
defer cancel()
}

// by now, we shouldn't have any in-flight requests anymore
// by now, we shouldn't have any in-flight requests anymore, close the processor and handlers
for _, closer := range []io.Closer{c.spanProcessor, c.zipkinSpansHandler, c.jaegerBatchesHandler, c.grpcHandler} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all three span handlers the Close is no-op. What's the point of adding it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the case is strong for removing it, as we have three implementations and none needs to close anything, this seems like the type of thing that would require a closer in some implementations.

I can remove it, but the message that it sends is stronger than the fact that it's currently no-op.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is sends is that these handlers might be stateful in some way and require Close to clean-up, which is kind of the opposite of what they are now: stateless proxies that simply forward to collector (while not managing the life cycle of collector proxy). I am in favor of simplicity, rather than adding code that might possibly be needed in the future.

It also complicates our internal build since now we also have to call close on these handlers, to respect their API.

cmd/collector/app/span_handler_builder.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
cmd/collector/app/span_processor.go Outdated Show resolved Hide resolved
@yurishkuro
Copy link
Member

The current code seems to go in the direction of graceful shutdown but fails to correctly implement this behavior for all components

Could you improve the description by explaining what was wrong with the current state?

@jpkrohling
Copy link
Contributor Author

Could you improve the description by explaining what was wrong with the current state?

Done!

@jpkrohling
Copy link
Contributor Author

E.g. git checkout master cmd/collector

TIL! This PR has been updated to include only the agent changes. I also squashed the commits, as the original commit set doesn't make much sense anymore.

@jpkrohling jpkrohling changed the title Improved graceful shutdown Improved graceful shutdown - Agent Feb 18, 2020
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Contributor Author

@yurishkuro is this ready to be merged, once 1.17 is out?

@yurishkuro yurishkuro merged commit d5036f4 into jaegertracing:master Feb 24, 2020
@jpkrohling jpkrohling deleted the 295-GracefulShutdown branch July 28, 2021 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graceful shutdown by properly handling SIGTERM
2 participants