Define error aware interface #1275

synzhu · 2021-09-10T02:39:37Z

Defines a new Error Aware interface which an owner of a component can use to be notified when a fatal error occurs in the component. The owner can then decide what to do which this error (restart the component, log the error, propagate the error up to its own parent component, etc).

Eventually, every component could implement this interface. After a component is started, the owner of the component would also starts up a thread that reads from the error channel. When an error occurs, that thread can parse the error and decide how to respond to it.

makes progress on https://github.com/dapperlabs/flow-go/issues/5829

TODO

Update the code so that irrecoverable errors encountered during startup and shutdown are handled properly. Currently an irrecoverable error thrown during startup will cause start() inside RunComponent to block forever, and error thrown during shutdown will be ignored.
Handle race condition where multiple irrecoverables are thrown at the same time from a component
Write a test case where error is thrown during startup (while waiting for Ready)
Write a test case where error is thrown during shutdown

codecov-commenter · 2021-09-10T02:55:43Z

Codecov Report

Merging #1275 (0bcea39) into master (28a6922) will increase coverage by 0.04%.
The diff coverage is 60.28%.

@@            Coverage Diff             @@
##           master    #1275      +/-   ##
==========================================
+ Coverage   56.25%   56.29%   +0.04%     
==========================================
  Files         500      504       +4     
  Lines       31162    31267     +105     
==========================================
+ Hits        17531    17603      +72     
- Misses      11262    11287      +25     
- Partials     2369     2377       +8

Flag	Coverage Δ
unittests	`56.29% <60.28%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
consensus/config.go	`0.00% <0.00%> (ø)`
consensus/participant.go	`15.95% <0.00%> (ø)`
model/flow/role.go	`89.36% <ø> (ø)`
network/p2p/middleware.go	`0.00% <0.00%> (ø)`
network/test/testUtil.go	`94.90% <ø> (ø)`
network/topology/helper.go	`82.35% <ø> (-1.86%)`	⬇️
network/p2p/libp2pNode.go	`65.45% <30.76%> (-1.11%)`	⬇️
network/p2p/subscription_filter.go	`47.82% <47.82%> (ø)`
module/irrecoverable/irrecoverable.go	`50.00% <50.00%> (ø)`
network/topology/randomizedTopology.go	`66.07% <50.00%> (ø)`
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ba7bcb...0bcea39. Read the comment docs.

huitseeker

It's a stab in the right direction, because it constrains the implementer to have a place where to send irrecoverable errors (the error channel).

Here's a list of things that seem important to me:

make it clear this is for irrecoverable errors from the PoV of the ErrorAware module, i.e. that this isn't meant to encompass all error management therein,
force the implementer to have a way to react to those irrecoverable errors, i.e. require a func OnError(err error),
and then possibly link the two with default processing, perhaps by doing the following (?) :
- having an ErrorBase that listens to the error channel and passes the errors read there to OnError
- and then having implementing components embed the ErrorBase

synzhu · 2021-09-10T23:15:58Z

It's a stab in the right direction, because it constrains the implementer to have a place where to send irrecoverable errors (the error channel).

Here's a list of things that seem important to me:

make it clear this is for irrecoverable errors from the PoV of the ErrorAware module, i.e. that this isn't meant to encompass all error management therein,

force the implementer to have a way to react to those irrecoverable errors, i.e. require a func OnError(err error),

and then possibly link the two with default processing, perhaps by doing the following (?) :

having an ErrorBase that listens to the error channel and passes the errors read there to OnError

and then having implementing components embed the ErrorBase

@huitseeker Actually, the way I think about this is slightly different.

The Errors() function of a component allows the owner (aka parent) of the component to handle fatal errors in the component. I don't think the component itself should handle its own fatal errors. In other words, the implementer of the component shouldn't have to react to it's own irrecoverable errors, because said errors are by definition irrecoverable from its own point of view. Only the parent of the component has enough context to handle these errors.

so the way this works is, once the owner of component A starts is, it then spawns a new routine which polls from the error channel of component A. If it receives an error, then it can shutdown component A, or restart it, or propagate the error on to it's own parent, etc.

gomisha · 2021-09-13T13:31:49Z

I like the idea of having the caller / parent (not the component itself) handle the child component's error. We could have a mechanism to keep a running log of the errors that could be further analyzed for things like flakiness (is the same component throwing an error at frequent intervals?)

How would we test this? i.e. under what conditions would components throw an error?

What would be the expected error handling behavior and can we standardize this? (i.e. restart component, log the error somewhere permanent for later analysis, notify someone)

huitseeker · 2021-09-13T16:24:21Z

@smnzhu I think we're on the same page, I wasn't clear about two interfaces:

one in the component, which is the error sink,
one outside of the component, which has to react to errors, and to which my comment applied.

synzhu · 2021-09-13T18:12:40Z

How would we test this? i.e. under what conditions would components throw an error?

What would be the expected error handling behavior and can we standardize this? (i.e. restart component, log the error somewhere permanent for later analysis, notify someone)

@gomisha Components will throw an error anytime they encounter a fatal / irrecoverable error.

The expected error handling behavior will be different for each component, and may also depend on the specific error that was thrown. The implementation of the error handling behavior should be left up to the parent component, and I don't think there is a standard behavior that would apply to all cases.

@huitseeker I see what you're saying. For now, I don't see a need to force the parent to have a way to react to errors or implement default processing, because if we implement a single OnError() method on the parent, we lose information about which specific child component the error came from. I think we should wait until this error handling interface is being used more broadly across the codebase, before identifying common patterns in the way it is used and extracting that into some common helper structs or functions.

In the future if components are standardized with a Run() / Start() interface as proposed here, I could imagine having a helper function

type ErrorHandler func(context.Context, <-chan error, func())
type ComponentFactory func() module.RunnableReadyDoneErrorAware

StartComponent(context.Context, ComponentFactory, ErrorHandler)

which could be used as follows:

ctx, cancel := context.WithCancel(context.Background())

componentFactory := func() module.RunnableReadyDoneErrorAware {
    return NewComponentX(arg1, arg2)
}

StartComponent(ctx, componentFactory, func(errCtx context.Context, errs <-chan error, restart func()) {
    select {
	case err := <-errs:
		// handle the error here. In this example, we restart the component
		restart()
	case <-errCtx.Done():
		return
    }
})

and implemented (roughly) as follows:

func StartComponent(ctx context.Context, componentFactory ComponentFactory, errorHandler ErrorHandler) {
	restartChan := make(chan struct{})

	start := func() (context.CancelFunc, <-chan struct{}) {
		// context used to restart the component
		runCtx, cancel := context.WithCancel(ctx)

                component := componentFactory()
		component.Run(runCtx)
		
		select {
		case <-ctx.Done():
			runtime.Goexit()
		case <-component.Ready():
		}

		go errorHandler(runCtx, component.Errors(), func() {
			restartChan <- struct{}{}
			runtime.Goexit()
		})

		return cancel, component.Done()
	}

	go func() {
		for{
			cancel, done := start()
			
			select {
			case <-ctx.Done():
				return
			case <-restartChan:
				cancel()
			}
			
			select {
			case <-ctx.Done():
				return
			case <-done:
			}
		}
	}()
}

However, I think it's too early for stuff like this rn.

For now, I added a new helper struct ErrorManager. Currently it is not being used in any useful way in consensus follower, but in general here is how you would use it when implementing a new component:

You create an instance of ErrorManager as a field in your component during instantiation.
When implementing the component, anywhere you would have previously logged fatal or panicked, instead call ErrorManager.ThrowError(). This will put the error into the error channel and aborts the current goroutine. Effectively, this mimicks the behavior of a panic or log.Fatal from the perspective of the goroutine that the error is thrown from.
You can then use ErrorManager.Errors() to implement the component's own Errors() interface.

One potential issue I currently see with this approach is that it may not behave in a desirable way if a fatal error is caused as a result of some external party calling a public method on the component (e.g componentX.DoStuff()). If the processing happens in the caller's thread, then ErrorManager.ThrowError() would abort the caller's goroutine, which is probably bad. However, I think this is an issue that would apply to any alternative approach to error handling as well. Ultimately, the solution will probably be to enforce that all communication between components happens via channels / queues, or at least ensure that any public methods a component exposes cannot cause a fatal / irrecoverable error in the component itself.

huitseeker

In general, thanks a lot for getting this started, there's a ton of great ideas in there, such as:

the separation between in and out-of component,
the use of a closure for start,
the IrrecoverableError signaling through a channel

I've left a few alternate suggestions in #1308, here are the main points:

the word Error is overly broad for situations encountered inside the component: I've tried to suggest Irrecoverable for errors that relinquish the control flow,
the signaling of the irrecoverable error does not have a lot of variability, it pretty much needs to be what's in your ThrowError, hence I'm not sure this should be an interface. Maybe I missed something: do you have suggestion for vastly different behaviors for the emitter-side of an irrecoverable error?
the receiving side of the irrecoverable error seems to me that it has a lot of variability. In particular, there may be a significant task of getting the system in a "known good state" before restart,
I would like RunComponent to be hierarchical. For instance, if the Start of a Startable component returns an error, there's no use doing a restart. Hence this is an irrecoverable error from the PoV of the RunComponent, hence I'd love to be able to send that error to an irrecoverable handler in an enclosing context.
To represent this hierarchy, I've tried to use the component we use to pass things around hierarchically: the Context
the most complex function in both our approaches is RunComponent, and it seems valuable to me to make it generic (so nobody else has to rewrite one), and, hopefully, small

Possibly superfluous: - type assertions to be able to use a function on Context instances that are actually IrrecoverableSignalerContext, Possibly missing: - do we need to demo more in the component. E.g. RunComponent may demo cleanup, etc?

[module] Suggestions for #1275

module/irrecoverable/irrecoverable.go

huitseeker

I get the following failure, has this merged master?

GO111MODULE=on go test -coverprofile=coverage.txt -covermode=atomic  --tags relic ./...
?   	github.com/onflow/flow-go/access	[no test files]
?   	github.com/onflow/flow-go/access/legacy	[no test files]
?   	github.com/onflow/flow-go/access/legacy/convert	[no test files]
<nil> INF admin server starting up admin=command_runner
<nil> INF process loop shutting down admin=command_runner
<nil> INF admin server shutting down admin=command_runner
panic: send on closed channel

goroutine 11 [running]:
github.com/onflow/flow-go/admin.(*adminServer).RunCommand(0xc00040ce58, {0x183af20, 0xc0000c85a0}, 0xc0001aeec0)
	/Users/huitseeker/tmp/flow-go/admin/server.go:32 +0x1d2
github.com/onflow/flow-go/admin/admin._Admin_RunCommand_Handler({0x169efa0, 0xc00040ce58}, {0x183af20, 0xc0000c85a0}, 0xc00049a8a0, 0x0)
	/Users/huitseeker/tmp/flow-go/admin/admin/admin_grpc.pb.go:77 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000436700, {0x1846e48, 0xc00017e180}, 0xc0007ac480, 0xc00049ee40, 0x1cd0bd0, 0x0)
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1297 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000436700, {0x1846e48, 0xc00017e180}, 0xc0007ac480, 0x0)
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:1626 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:941 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/Users/huitseeker/golang/pkg/mod/google.golang.org/grpc@v1.40.0/server.go:939 +0x294
FAIL	github.com/onflow/flow-go/admin	0.557s

module/run_component_test.go

huitseeker

Awesome! Thanks for the tests!

module/run_component_test.go

peterargue

looks great!

synzhu · 2021-09-29T04:19:45Z

bors merge

bors · 2021-09-29T04:55:08Z

Build succeeded:

Define error aware interface

3f27b74

synzhu requested review from huitseeker, vishalchangrani, gomisha and peterargue September 10, 2021 02:39

huitseeker reviewed Sep 10, 2021

View reviewed changes

synzhu requested a review from huitseeker September 10, 2021 23:16

define new error base component

a2971d1

rename to error manager

dbeb249

Merge branch 'master' into smnzhu/error-handling

dcee55a

peterargue mentioned this pull request Sep 14, 2021

[Consensus Follower] Make fatal errors recoverable #1291

Closed

synzhu added 4 commits September 14, 2021 17:50

implement error handling helper func

a643f03

Update common.go

ef50fb6

Update common.go

73e8bdd

Merge branch 'master' into smnzhu/error-handling

1bccc1e

huitseeker reviewed Sep 17, 2021

View reviewed changes

huitseeker mentioned this pull request Sep 18, 2021

Admin interface for node operators #1222

Merged

3 tasks

huitseeker and others added 4 commits September 21, 2021 11:03

[module] Apply review comments

4c947a1

tweak handler trigger logic

e653e6f

rename error handler -> orrecoverable handler

7f91ef2

Merge pull request #1308 from huitseeker/smnzhu/error-handling

2e8bca3

[module] Suggestions for #1275

huitseeker mentioned this pull request Sep 21, 2021

[FLIP] Component API #1167

Merged

1 task

synzhu added 2 commits September 21, 2021 18:06

Renamings and refactor RunComponent

46baf6c

undo consensus follower changes

30c652a

synzhu added 4 commits September 24, 2021 11:35

add comments and update semantics of Start

50e92c7

handle edge cases

5930a6e

update example

b51ed23

added empty tests

0e7708f

peterargue reviewed Sep 24, 2021

View reviewed changes

module/irrecoverable/irrecoverable.go Outdated Show resolved Hide resolved

synzhu added 3 commits September 24, 2021 15:52

fix lint error

8ffda55

add run_component_test

6f0b657

Update irrecoverable_example_test.go

2de33dd

synzhu requested review from peterargue and huitseeker September 25, 2021 04:39

Merge branch 'master' into smnzhu/error-handling

d81f072

huitseeker reviewed Sep 27, 2021

View reviewed changes

module/run_component_test.go Outdated Show resolved Hide resolved

synzhu added 4 commits September 27, 2021 13:52

Update run_component_test.go

c0552e2

Merge branch 'master' into smnzhu/error-handling

96dd381

generate mocks

43f9d6a

update logger

d2ad181

synzhu requested a review from huitseeker September 27, 2021 23:14

add comment explaining recover

38dfa8a

peterargue reviewed Sep 28, 2021

View reviewed changes

module/run_component_test.go Show resolved Hide resolved

added new synchronous startup error test

0bcea39

synzhu requested a review from peterargue September 28, 2021 03:33

huitseeker approved these changes Sep 28, 2021

View reviewed changes

module/run_component_test.go Show resolved Hide resolved

peterargue approved these changes Sep 28, 2021

View reviewed changes

bors bot merged commit c40cfbc into master Sep 29, 2021

bors bot deleted the smnzhu/error-handling branch September 29, 2021 04:55

huitseeker mentioned this pull request Oct 4, 2021

[Network] Middleware component #1355

Merged

3 tasks

peterargue mentioned this pull request Nov 2, 2021

Use ComponentManager in NodeBuilder #1559

Merged

synzhu mentioned this pull request May 12, 2022

[Access] Migrate ingestion engine to new Component interface #2411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define error aware interface #1275

Define error aware interface #1275

synzhu commented Sep 10, 2021 •

edited

codecov-commenter commented Sep 10, 2021 •

edited

huitseeker left a comment •

edited

synzhu commented Sep 10, 2021 •

edited

gomisha commented Sep 13, 2021

huitseeker commented Sep 13, 2021

synzhu commented Sep 13, 2021 •

edited

huitseeker left a comment

huitseeker left a comment

huitseeker left a comment

peterargue left a comment

synzhu commented Sep 29, 2021

bors bot commented Sep 29, 2021

Define error aware interface #1275

Define error aware interface #1275

Conversation

synzhu commented Sep 10, 2021 • edited

TODO

codecov-commenter commented Sep 10, 2021 • edited

Codecov Report

huitseeker left a comment • edited

Choose a reason for hiding this comment

synzhu commented Sep 10, 2021 • edited

gomisha commented Sep 13, 2021

huitseeker commented Sep 13, 2021

synzhu commented Sep 13, 2021 • edited

huitseeker left a comment

Choose a reason for hiding this comment

huitseeker left a comment

Choose a reason for hiding this comment

huitseeker left a comment

Choose a reason for hiding this comment

peterargue left a comment

Choose a reason for hiding this comment

synzhu commented Sep 29, 2021

bors bot commented Sep 29, 2021

synzhu commented Sep 10, 2021 •

edited

codecov-commenter commented Sep 10, 2021 •

edited

huitseeker left a comment •

edited

synzhu commented Sep 10, 2021 •

edited

synzhu commented Sep 13, 2021 •

edited