-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Consensus] Hotfix: Fix synchornization handling response #1001
[Consensus] Hotfix: Fix synchornization handling response #1001
Conversation
afc6f75
to
113bf1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to merge this hotfix, but in general I think we should be keeping the error return, but capturing these expected error cases explicitly and wrapping the error with type information so that the higher-up layer is able to handle it appropriately rather than causing a crash.
113bf1b
to
214746d
Compare
Codecov Report
@@ Coverage Diff @@
## master #1001 +/- ##
==========================================
- Coverage 53.38% 53.28% -0.10%
==========================================
Files 318 318
Lines 21487 21510 +23
==========================================
- Hits 11471 11462 -9
- Misses 8450 8483 +33
+ Partials 1566 1565 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
||
final := e.finalSnapshot().head | ||
e.core.HandleHeight(final, res.Height) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the logic for not inspecting errors and at least logging them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove this line, because HandleHeight
doesn't return error, and there isn't a case that would return a non-nil error. So I just removed this non-existent case
@@ -610,7 +603,6 @@ func (e *Engine) onBlockResponse(originID flow.Identifier, res *messages.BlockRe | |||
} | |||
e.comp.SubmitLocal(synced) | |||
} | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the logic for not inspecting errors and at least logging them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. there is no error ever returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing Jordan's comment, I think just logging errors and continuing on a best-effort basis is fundamentally incompatible with working towards a BFT implementation:
flow-go/engine/common/synchronization/engine.go
Lines 431 to 433 in 214746d
if err != nil { | |
engine.LogError(e.log, err) | |
} |
(please see TPM from February 11th for reasoning)
I am ok with this as a hot fix, but I think this is not a viable approach for master
. I would like to propose the revisions in #1008 (PR is targeting this branch for now).
…error-handling-in-synchronization # Conflicts: # engine/common/synchronization/engine.go
Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>
Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>
…ronization proposal for handling networking errors in `synchronization.Engine`
Str("origin", originID.String()). | ||
Logger() | ||
if errors.Is(err, engine.IncompatibleInputTypeError) { | ||
lg.Error().Msg("received message with incompatible type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexHentschel Do we actually want to return after this line? If I understand correctly, this if
statement captures the case where we get a message with incompatible input type from the network, right? In that case, we probably shouldn't crash? Without the return
after this line we will proceed to the lg.Fatal
which comes after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
I think throwing fatal errors here is intended, as explained here and here
added new sentinel error engine.IncompatibleInputTypeError, which engines can raise, if they receive an input that does not have any of the expected types (previously, engines just errored with an unspecific error)
this allows us to reject inputs from the networking layer with incompatible types
In contrast, an engine should generally not receive an incompatible input from a trusted internal component within the node. This would likely be an implementation bug and should crash the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an engine should generally not receive an incompatible input from a trusted internal component within the node. This would likely be an implementation bug and should crash the node
I believe this second case is reflected in the code here. However, my concern about the code above is that it seems to represent the first case:
this allows us to reject inputs from the networking layer with incompatible types
The code above is in Submit
and not SubmitLocal
, which means it is processing a message which came from the network, right? So if we receive an invalid input from the network layer, shouldn't we just log the error, reject the input, and continue? We only want to throw fatal error if the invalid input is from an internal component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Good catch. Yes, this would be from external source, where we could expect that they feed us with invalid inputs. I intended to return here (hence logging the error), but I messed this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks for the catch @smnzhu
Str("origin", originID.String()). | ||
Logger() | ||
if errors.Is(err, engine.IncompatibleInputTypeError) { | ||
lg.Error().Msg("received message with incompatible type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thanks
Str("origin", originID.String()). | ||
Logger() | ||
if errors.Is(err, engine.IncompatibleInputTypeError) { | ||
lg.Error().Msg("received message with incompatible type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@smnzhu thanks for noticing the missing error returns for external inputs of incompatible type. You are right, these should not crash the node. @zhangchiqing I took the liberty to add the three missing returns and commit (74f6395) directly to your branch, because I screwed this up originally in my PR. Hope that was ok 😅 (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎸
This PR fixes a bug that when there is network connectivity issue when sending the sync request, the node will crash. The fix will simply log the error.