Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define error aware interface #1275

Merged
merged 37 commits into from Sep 29, 2021
Merged

Define error aware interface #1275

merged 37 commits into from Sep 29, 2021

Conversation

synzhu
Copy link
Contributor

@synzhu synzhu commented Sep 10, 2021

Defines a new Error Aware interface which an owner of a component can use to be notified when a fatal error occurs in the component. The owner can then decide what to do which this error (restart the component, log the error, propagate the error up to its own parent component, etc).

Eventually, every component could implement this interface. After a component is started, the owner of the component would also starts up a thread that reads from the error channel. When an error occurs, that thread can parse the error and decide how to respond to it.

makes progress on https://github.com/dapperlabs/flow-go/issues/5829

TODO

  • Update the code so that irrecoverable errors encountered during startup and shutdown are handled properly. Currently an irrecoverable error thrown during startup will cause start() inside RunComponent to block forever, and error thrown during shutdown will be ignored.
  • Handle race condition where multiple irrecoverables are thrown at the same time from a component
  • Write a test case where error is thrown during startup (while waiting for Ready)
  • Write a test case where error is thrown during shutdown

@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2021

Codecov Report

Merging #1275 (0bcea39) into master (28a6922) will increase coverage by 0.04%.
The diff coverage is 60.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1275      +/-   ##
==========================================
+ Coverage   56.25%   56.29%   +0.04%     
==========================================
  Files         500      504       +4     
  Lines       31162    31267     +105     
==========================================
+ Hits        17531    17603      +72     
- Misses      11262    11287      +25     
- Partials     2369     2377       +8     
Flag Coverage Δ
unittests 56.29% <60.28%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
consensus/config.go 0.00% <0.00%> (ø)
consensus/participant.go 15.95% <0.00%> (ø)
model/flow/role.go 89.36% <ø> (ø)
network/p2p/middleware.go 0.00% <0.00%> (ø)
network/test/testUtil.go 94.90% <ø> (ø)
network/topology/helper.go 82.35% <ø> (-1.86%) ⬇️
network/p2p/libp2pNode.go 65.45% <30.76%> (-1.11%) ⬇️
network/p2p/subscription_filter.go 47.82% <47.82%> (ø)
module/irrecoverable/irrecoverable.go 50.00% <50.00%> (ø)
network/topology/randomizedTopology.go 66.07% <50.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ba7bcb...0bcea39. Read the comment docs.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stab in the right direction, because it constrains the implementer to have a place where to send irrecoverable errors (the error channel).

Here's a list of things that seem important to me:

  • make it clear this is for irrecoverable errors from the PoV of the ErrorAware module, i.e. that this isn't meant to encompass all error management therein,
  • force the implementer to have a way to react to those irrecoverable errors, i.e. require a func OnError(err error),
  • and then possibly link the two with default processing, perhaps by doing the following (?) :
    • having an ErrorBase that listens to the error channel and passes the errors read there to OnError
    • and then having implementing components embed the ErrorBase

@synzhu
Copy link
Contributor Author

synzhu commented Sep 10, 2021

It's a stab in the right direction, because it constrains the implementer to have a place where to send irrecoverable errors (the error channel).

Here's a list of things that seem important to me:

  • make it clear this is for irrecoverable errors from the PoV of the ErrorAware module, i.e. that this isn't meant to encompass all error management therein,

  • force the implementer to have a way to react to those irrecoverable errors, i.e. require a func OnError(err error),

  • and then possibly link the two with default processing, perhaps by doing the following (?) :

    • having an ErrorBase that listens to the error channel and passes the errors read there to OnError
    • and then having implementing components embed the ErrorBase

@huitseeker Actually, the way I think about this is slightly different.

The Errors() function of a component allows the owner (aka parent) of the component to handle fatal errors in the component. I don't think the component itself should handle its own fatal errors. In other words, the implementer of the component shouldn't have to react to it's own irrecoverable errors, because said errors are by definition irrecoverable from its own point of view. Only the parent of the component has enough context to handle these errors.

so the way this works is, once the owner of component A starts is, it then spawns a new routine which polls from the error channel of component A. If it receives an error, then it can shutdown component A, or restart it, or propagate the error on to it's own parent, etc.

@gomisha
Copy link
Contributor

gomisha commented Sep 13, 2021

I like the idea of having the caller / parent (not the component itself) handle the child component's error. We could have a mechanism to keep a running log of the errors that could be further analyzed for things like flakiness (is the same component throwing an error at frequent intervals?)

How would we test this? i.e. under what conditions would components throw an error?

What would be the expected error handling behavior and can we standardize this? (i.e. restart component, log the error somewhere permanent for later analysis, notify someone)

@huitseeker
Copy link
Contributor

@smnzhu I think we're on the same page, I wasn't clear about two interfaces:

  • one in the component, which is the error sink,
  • one outside of the component, which has to react to errors, and to which my comment applied.

@synzhu
Copy link
Contributor Author

synzhu commented Sep 13, 2021

How would we test this? i.e. under what conditions would components throw an error?

What would be the expected error handling behavior and can we standardize this? (i.e. restart component, log the error somewhere permanent for later analysis, notify someone)

@gomisha Components will throw an error anytime they encounter a fatal / irrecoverable error.

The expected error handling behavior will be different for each component, and may also depend on the specific error that was thrown. The implementation of the error handling behavior should be left up to the parent component, and I don't think there is a standard behavior that would apply to all cases.

@huitseeker I see what you're saying. For now, I don't see a need to force the parent to have a way to react to errors or implement default processing, because if we implement a single OnError() method on the parent, we lose information about which specific child component the error came from. I think we should wait until this error handling interface is being used more broadly across the codebase, before identifying common patterns in the way it is used and extracting that into some common helper structs or functions.

In the future if components are standardized with a Run() / Start() interface as proposed here, I could imagine having a helper function

type ErrorHandler func(context.Context, <-chan error, func())
type ComponentFactory func() module.RunnableReadyDoneErrorAware

StartComponent(context.Context, ComponentFactory, ErrorHandler)

which could be used as follows:

ctx, cancel := context.WithCancel(context.Background())

componentFactory := func() module.RunnableReadyDoneErrorAware {
    return NewComponentX(arg1, arg2)
}

StartComponent(ctx, componentFactory, func(errCtx context.Context, errs <-chan error, restart func()) {
    select {
	case err := <-errs:
		// handle the error here. In this example, we restart the component
		restart()
	case <-errCtx.Done():
		return
    }
})

and implemented (roughly) as follows:

func StartComponent(ctx context.Context, componentFactory ComponentFactory, errorHandler ErrorHandler) {
	restartChan := make(chan struct{})

	start := func() (context.CancelFunc, <-chan struct{}) {
		// context used to restart the component
		runCtx, cancel := context.WithCancel(ctx)

                component := componentFactory()
		component.Run(runCtx)
		
		select {
		case <-ctx.Done():
			runtime.Goexit()
		case <-component.Ready():
		}

		go errorHandler(runCtx, component.Errors(), func() {
			restartChan <- struct{}{}
			runtime.Goexit()
		})

		return cancel, component.Done()
	}

	go func() {
		for{
			cancel, done := start()
			
			select {
			case <-ctx.Done():
				return
			case <-restartChan:
				cancel()
			}
			
			select {
			case <-ctx.Done():
				return
			case <-done:
			}
		}
	}()
}

However, I think it's too early for stuff like this rn.

For now, I added a new helper struct ErrorManager. Currently it is not being used in any useful way in consensus follower, but in general here is how you would use it when implementing a new component:

  1. You create an instance of ErrorManager as a field in your component during instantiation.
  2. When implementing the component, anywhere you would have previously logged fatal or panicked, instead call ErrorManager.ThrowError(). This will put the error into the error channel and aborts the current goroutine. Effectively, this mimicks the behavior of a panic or log.Fatal from the perspective of the goroutine that the error is thrown from.
  3. You can then use ErrorManager.Errors() to implement the component's own Errors() interface.

One potential issue I currently see with this approach is that it may not behave in a desirable way if a fatal error is caused as a result of some external party calling a public method on the component (e.g componentX.DoStuff()). If the processing happens in the caller's thread, then ErrorManager.ThrowError() would abort the caller's goroutine, which is probably bad. However, I think this is an issue that would apply to any alternative approach to error handling as well. Ultimately, the solution will probably be to enforce that all communication between components happens via channels / queues, or at least ensure that any public methods a component exposes cannot cause a fatal / irrecoverable error in the component itself.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, thanks a lot for getting this started, there's a ton of great ideas in there, such as:

  • the separation between in and out-of component,
  • the use of a closure for start,
  • the IrrecoverableError signaling through a channel

I've left a few alternate suggestions in #1308, here are the main points:

  • the word Error is overly broad for situations encountered inside the component: I've tried to suggest Irrecoverable for errors that relinquish the control flow,
  • the signaling of the irrecoverable error does not have a lot of variability, it pretty much needs to be what's in your ThrowError, hence I'm not sure this should be an interface. Maybe I missed something: do you have suggestion for vastly different behaviors for the emitter-side of an irrecoverable error?
  • the receiving side of the irrecoverable error seems to me that it has a lot of variability. In particular, there may be a significant task of getting the system in a "known good state" before restart,
  • I would like RunComponent to be hierarchical. For instance, if the Start of a Startable component returns an error, there's no use doing a restart. Hence this is an irrecoverable error from the PoV of the RunComponent, hence I'd love to be able to send that error to an irrecoverable handler in an enclosing context.
  • To represent this hierarchy, I've tried to use the component we use to pass things around hierarchically: the Context
  • the most complex function in both our approaches is RunComponent, and it seems valuable to me to make it generic (so nobody else has to rewrite one), and, hopefully, small

Possibly superfluous:
- type assertions to be able to use a function on Context instances that are actually IrrecoverableSignalerContext,

Possibly missing:
- do we need to demo more in the component. E.g. RunComponent may demo cleanup, etc?
@huitseeker huitseeker mentioned this pull request Sep 21, 2021
1 task
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following failure, has this merged master?

GO111MODULE=on go test -coverprofile=coverage.txt -covermode=atomic  --tags relic ./...
?   	github.com/onflow/flow-go/access	[no test files]
?   	github.com/onflow/flow-go/access/legacy	[no test files]
?   	github.com/onflow/flow-go/access/legacy/convert	[no test files]
<nil> INF admin server starting up admin=command_runner
<nil> INF process loop shutting down admin=command_runner
<nil> INF admin server shutting down admin=command_runner
panic: send on closed channel

goroutine 11 [running]:
github.com/onflow/flow-go/admin.(*adminServer).RunCommand(0xc00040ce58, {0x183af20, 0xc0000c85a0}, 0xc0001aeec0)
	/Users/huitseeker/tmp/flow-go/admin/server.go:32 +0x1d2
github.com/onflow/flow-go/admin/admin._Admin_RunCommand_Handler({0x169efa0, 0xc00040ce58}, {0x183af20, 0xc0000c85a0}, 0xc00049a8a0, 0x0)
	/Users/huitseeker/tmp/flow-go/admin/admin/admin_grpc.pb.go:77 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000436700, {0x1846e48, 0xc00017e180}, 0xc0007ac480, 0xc00049ee40, 0x1cd0bd0, 0x0)
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1297 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000436700, {0x1846e48, 0xc00017e180}, 0xc0007ac480, 0x0)
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1626 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:941 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:939 +0x294
FAIL	github.com/onflow/flow-go/admin	0.557s

module/run_component_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the tests!

module/run_component_test.go Show resolved Hide resolved
Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@synzhu
Copy link
Contributor Author

synzhu commented Sep 29, 2021

bors merge

@bors
Copy link
Contributor

bors bot commented Sep 29, 2021

@bors bors bot merged commit c40cfbc into master Sep 29, 2021
@bors bors bot deleted the smnzhu/error-handling branch September 29, 2021 04:55
@huitseeker huitseeker mentioned this pull request Oct 4, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants