Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus] Hotfix: Fix synchornization handling response #1001

Merged
merged 14 commits into from
Jul 27, 2021

Conversation

zhangchiqing
Copy link
Member

@zhangchiqing zhangchiqing commented Jul 20, 2021

This PR fixes a bug that when there is network connectivity issue when sending the sync request, the node will crash. The fix will simply log the error.

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to merge this hotfix, but in general I think we should be keeping the error return, but capturing these expected error cases explicitly and wrapping the error with type information so that the higher-up layer is able to handle it appropriately rather than causing a crash.

@zhangchiqing zhangchiqing force-pushed the leo/fix-synchornization-handle-response branch from 113bf1b to 214746d Compare July 20, 2021 18:51
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2021

Codecov Report

Merging #1001 (586f0f1) into master (208bb5e) will decrease coverage by 0.09%.
The diff coverage is 8.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1001      +/-   ##
==========================================
- Coverage   53.38%   53.28%   -0.10%     
==========================================
  Files         318      318              
  Lines       21487    21510      +23     
==========================================
- Hits        11471    11462       -9     
- Misses       8450     8483      +33     
+ Partials     1566     1565       -1     
Flag Coverage Δ
unittests 53.28% <8.10%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
engine/consensus/ingestion/engine.go 47.09% <0.00%> (-6.19%) ⬇️
engine/consensus/sealing/engine.go 49.00% <0.00%> (-1.78%) ⬇️
engine/errors.go 15.78% <ø> (ø)
engine/common/synchronization/engine.go 64.21% <12.50%> (-1.07%) ⬇️
engine/enqueue.go 66.66% <25.00%> (-9.34%) ⬇️
module/mempool/epochs/transactions.go 94.73% <0.00%> (-5.27%) ⬇️
cmd/util/ledger/migrations/storage_v4.go 41.56% <0.00%> (-0.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 208bb5e...586f0f1. Read the comment docs.


final := e.finalSnapshot().head
e.core.HandleHeight(final, res.Height)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic for not inspecting errors and at least logging them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove this line, because HandleHeight doesn't return error, and there isn't a case that would return a non-nil error. So I just removed this non-existent case

@@ -610,7 +603,6 @@ func (e *Engine) onBlockResponse(originID flow.Identifier, res *messages.BlockRe
}
e.comp.SubmitLocal(synced)
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic for not inspecting errors and at least logging them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. there is no error ever returned

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing Jordan's comment, I think just logging errors and continuing on a best-effort basis is fundamentally incompatible with working towards a BFT implementation:

if err != nil {
engine.LogError(e.log, err)
}

(please see TPM from February 11th for reasoning)

I am ok with this as a hot fix, but I think this is not a viable approach for master. I would like to propose the revisions in #1008 (PR is targeting this branch for now).

…error-handling-in-synchronization

# Conflicts:
#	engine/common/synchronization/engine.go
Alexander Hentschel and others added 6 commits July 26, 2021 10:47
Str("origin", originID.String()).
Logger()
if errors.Is(err, engine.IncompatibleInputTypeError) {
lg.Error().Msg("received message with incompatible type")
Copy link
Contributor

@synzhu synzhu Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexHentschel Do we actually want to return after this line? If I understand correctly, this if statement captures the case where we get a message with incompatible input type from the network, right? In that case, we probably shouldn't crash? Without the return after this line we will proceed to the lg.Fatal which comes after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

I think throwing fatal errors here is intended, as explained here and here

added new sentinel error engine.IncompatibleInputTypeError, which engines can raise, if they receive an input that does not have any of the expected types (previously, engines just errored with an unspecific error)
this allows us to reject inputs from the networking layer with incompatible types
In contrast, an engine should generally not receive an incompatible input from a trusted internal component within the node. This would likely be an implementation bug and should crash the node

Copy link
Contributor

@synzhu synzhu Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an engine should generally not receive an incompatible input from a trusted internal component within the node. This would likely be an implementation bug and should crash the node

I believe this second case is reflected in the code here. However, my concern about the code above is that it seems to represent the first case:

this allows us to reject inputs from the networking layer with incompatible types

The code above is in Submit and not SubmitLocal, which means it is processing a message which came from the network, right? So if we receive an invalid input from the network layer, shouldn't we just log the error, reject the input, and continue? We only want to throw fatal error if the invalid input is from an internal component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Good catch. Yes, this would be from external source, where we could expect that they feed us with invalid inputs. I intended to return here (hence logging the error), but I messed this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks for the catch @smnzhu

Str("origin", originID.String()).
Logger()
if errors.Is(err, engine.IncompatibleInputTypeError) {
lg.Error().Msg("received message with incompatible type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks

Str("origin", originID.String()).
Logger()
if errors.Is(err, engine.IncompatibleInputTypeError) {
lg.Error().Msg("received message with incompatible type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@AlexHentschel
Copy link
Member

AlexHentschel commented Jul 27, 2021

@smnzhu thanks for noticing the missing error returns for external inputs of incompatible type. You are right, these should not crash the node.

@zhangchiqing I took the liberty to add the three missing returns and commit (74f6395) directly to your branch, because I screwed this up originally in my PR. Hope that was ok 😅 (?)

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎸

@zhangchiqing zhangchiqing merged commit c3d239e into master Jul 27, 2021
@zhangchiqing zhangchiqing deleted the leo/fix-synchornization-handle-response branch July 27, 2021 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants