-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in trie hash computation #7337
Fix bugs in trie hash computation #7337
Conversation
very god pr. |
Thank you, yes, I was just investigating the failures. Looks like my use of a 'real' mdbx may be the problem, will try to rebase with the memdb variant. Edit: I see, also looks like something's wrong in the rpcdaemon test -- I'd missed that one running locally, thanks, will fix. |
only 1 rwtx can exist at a time and we spend it for new blocks execution. |
I've fixed the memdb vs. mdbx issues for failing tests. But, it looks like I may have introduced a new 'real' bug with these fixes. Marking as draft for the moment, as I think it's going to take at least a few more hours of investigation and extending the tests to cover it (so will tackle that in my morning). |
There are currently a number of bugs in the flat DB trie hash computation. These bugs are improbable in 'real' scenarios (hence the ability to sync the assorted large chains), and, the repercussions of encountering them are low (generally re-computing the hash at a later block will succeed). Still, they can cause the process to crash, or deadlock, and a clever adversary could feasibly construct a block or storage update designed to trigger these bugs. (Or, an unlucky block may trigger them inadvertently). Based on the tracing in the code, it seems that some of these bugs may have been witnessed in the wild, but not reliably reproduced (for instance when `maxlen >= len(curr)`). 1. There is an infinite loop that can occur in _nextSiblingOfParentInMem. This occurs in the account path when the c.k[1] entry is 'nil' and the code will continuously retry resolving this entry indefinitely. 2. When the next trie node is deeper in the trie than the previous, but is not a descendent of the previous trie node, then any higher trie nodes which are not overwritten are inadvertently left in memory (when they should be nil-ed) resulting in an incorrect hash or index out of bounds panic. 3. When the last trie node being processed is comprised entirely of 'f' nibbles, the 'FirstNotCoveredPrefix' returns an empty byte array because there is no next nibble sub-tree. This causes keys to inappropriately be reprocessed triggering either an index out of bounds panic or incorrect hash results. 4. When the _nextSiblingInDB path is triggered, if the next nibble subtree contains no trie table entries, then any keys with a prefix in that subtree will be skipped resulting in incorrect hash results. The fuzzing seeds included cover all four of these cases (for both accounts and storage where appropriate). As fuzzing is baked into the native go 1.18 toolchain, running 'go test' (as the Makefile currently does) is adequate to cover these failures. To run additional fuzzing the `-fuzz` flag may be provided as documented in the go test doc.
a4c1f83
to
943e35a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed up the failures. They stemmed from two issues:
- I was creating a new mdbx instance and cleaning it up in each iteration of the fuzzing. This was problematic in CI and the memdb is a better choice in this case anyway.
- When I was cleaning up this commit for a PR I inadvertently copied from of the account trie cursor code into the storage cursor. This change did not break the existing tests, so, I've extended the new tests to cover this case as well.
I'm also including comments on the changes to hopefully make it easier to review / understand these changes.
Once this patch is (hopefully) accepted, I'll plan to submit some additional tests around the proof generation. In trying to construct those tests I kept running into this set of bugs, so I have deliberately detached these fixes from the proof related tests to allow for simpler review.
if _, err := c._seek(c.next, []byte{}); err != nil { | ||
return err | ||
} | ||
if c.k[c.lvl] == nil || !bytes.HasPrefix(c.next, c.k[c.lvl]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pay extra attention to this particular check in review. Because in general, setting SkipState
to false
too aggressively will result in lower performance, but not incorrect results, I'd like to make sure that this check is correct.
if err1 != nil { | ||
return EmptyRoot, err1 | ||
} | ||
if keyIsBefore(ihK, kHex) || !bytes.HasPrefix(kHex, nil) { // read all accounts until next AccTrie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for
!bytes.HasPrefix(kHex, nil)
will always return false, so, removed.
if accTrie.SkipState { | ||
goto SkipAccounts | ||
} | ||
|
||
for k, kHex, v, err1 := accs.Seek(accTrie.FirstNotCoveredPrefix()); k != nil; k, kHex, v, err1 = accs.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event that FirstNotCoveredPrefix
returns 0x0
because the last trie entry is of the form 0xf..f0..0
then this loop inappropriately re-processes account keys.
@@ -667,7 +679,7 @@ func (r *RootHashAggregator) saveValueAccount(isIH, hasTree bool, v *accounts.Ac | |||
// has 2 basic operations: _preOrderTraversalStep and _preOrderTraversalStepNoInDepth | |||
type AccTrieCursor struct { | |||
SkipState bool | |||
is, lvl int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is
field is incremented, but never used/read, so, removed.
return true | ||
} | ||
|
||
c.lvl = nonNilLvl + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event that c.k[2]
is nil
, and the seek goes outside of the prefix derived from c.k[1]
(in kBuf
). Then c.lvl
is set to 2
, and the loop starts again. Because c.k[2]
is nil, the whole process repeats infinitely.
@@ -889,6 +902,18 @@ func (c *AccTrieCursor) _unmarshal(k, v []byte) { | |||
if c.lvl >= len(k) { | |||
from, to = len(k)+1, c.lvl+2 | |||
} | |||
|
|||
// Consider a trie DB with keys like: [0xa, 0xbb], then unmarshaling 0xbb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this comment explains it pretty well, but, when trie node 0xa
is processed, it is assigned to c.k[0]
, then, when the next trie node 0xbb
is processed, because its length 2
is longer than 0xa
's length 1
, the from
is set to 2
, and the entry for 0xa
in c.k[1]
is not cleared. Thus, when looking for the sibling of a parent, the code incorrectly traverses back to this already processed node.
FYI: i did run |
Tested by integration dtate_stages on mainnet. Will do one more round of review and merge after release. |
This PR extends the function merged in #7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai>
This PR extends the function merged in ledgerwatch#7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai>
* fix build issue * simplify the unwind * Add additional trie proof testing (ledgerwatch#7382) This PR extends the function merged in ledgerwatch#7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai> * e3: drain when no new tasks causing deadlock (ledgerwatch#7395) * Enable negative Merkle proofs for eth_getProof (ledgerwatch#7393) This addresses the last known deficiency of the eth_getProof implementation. The previous code would return an error in the event that the element was not found in the trie. EIP-1186 allows for 'negative' proofs where a proof demonstrates that an element cannot be in the trie, so this commit updates the logic to support that case. Co-authored-by: Jason Yellick <jason@enya.ai> * e3: dont fetch code hash in unwind (ledgerwatch#7416) * Caplin: Fixed sentinel deadlock (ledgerwatch#7419) * fix block rlp logging bug (ledgerwatch#7441) fixes a logging statement was logging the outer rpc proto instead of the inner rlp of the block as intended * fix: erigon_getLatestLogs (ledgerwatch#7450) When calling erigon_getLatestLogs, I was getting a crash in [erigon_receipts.go](https://github.com/ledgerwatch/erigon/blob/beb97784d43ece5acde365a74efe8763692ebdba/cmd/rpcdaemon/commands/erigon_receipts.go#L254). I think it is a simple indexing bug ``` [service.go:217 panic.go:884 panic.go:113 erigon_receipts.go:254 value.go:586 value.go:370 service.go:222 handler.go:494 handler.go:444 handler.go:392 handler.go:223 handler.go:316 asm_amd64.s:1598] [WARN] [05-05|21:13:59.749] Served conn=100.70.204.111:50141 method=erigon_getLatestLogs reqid=1 t=621.5936ms err="method handler crashed" ``` * LeakDetector: use it to find which resource was created but not closed (leaked) (ledgerwatch#7457) * reduce default --db.size.limit from 7 to 3 Tb (to fit defaults of some systems) (ledgerwatch#7478) * Enode logging broke when NAT Parameter set in 2.43.0 (ledgerwatch#7480) for ledgerwatch#7472 * erigon init: created db with wrong pageSize (ledgerwatch#7482) * p2p: fix peer ID serialization (ledgerwatch#7495) The peer ID in sentry.proto is a H512 / 64 bytes value, and MarshalPubkey creates it from a public key. There's no need to cut the first byte, because MarshalPubkey already does it. Doing so results in a 63 bytes value that is incompatible with silkworm sentry. * Allow ephemeral ports for p2p (ledgerwatch#7503) Currently, the p2p ports require an explicit enumeration of ports to pick. Sometimes, for instance when writing integration tests utilizing an Erigon binary the particular p2p port does not matter and trying to pick non-allocated port ranges is fragile. This small PR simply checks to see if the enumerated port is '0', in which case it disables the probing check which would otherwise cause Erigon not to try binding to an ephemeral port. Co-authored-by: Jason Yellick <jason@enya.ai> * remove "db" log line from (ledgerwatch#7509) * e2: self-heal after accidental blocks delete (download blocks eventually, then work as usual) (ledgerwatch#7516) * e2: ReadAhead of blocks, senders accounts, code (ledgerwatch#7501) It improves performance of initial sync (stage exec) by 5-20% when blocks snapshots are mounted to high-latency drive and when chaindata is on high-latency drive. And improving cold-start performance. Current implementation using 2 goroutines for ReadAhead. It also producing more garbage, can improve it later (here are dashboard with impact). ``` mainnet2-1: with ReadAhead mainnet2-3: no ReadAhead ``` <img width="949" alt="Screenshot 2023-05-12 at 09 24 31" src="https://github.com/ledgerwatch/erigon/assets/46885206/b90b1fa8-9099-48ff-95b3-86e864a36d46"> <img width="845" alt="Screenshot 2023-05-12 at 09 24 13" src="https://github.com/ledgerwatch/erigon/assets/46885206/39d90c0c-a9d5-4735-8c03-da1455b147aa"> * Fix trace_filter regression due to gasBailout argument (ledgerwatch#7539) (ledgerwatch#7540) Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local> * Update t8n_test.go (ledgerwatch#7550) * e2: reduce StageSyncStep interface size (ledgerwatch#7561) * e2: avoid do RestoreCodeHash twice (ledgerwatch#7706) - do it only once in HistoryStateReader * StagedSync: fix canRunCycleInOneTransaction logic (ledgerwatch#7712) * Better version of libp2p where losing peers bug fixed (ledgerwatch#7726) Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> * fix merge error --------- Co-authored-by: Jason Yellick <7431583+jyellick@users.noreply.github.com> Co-authored-by: Jason Yellick <jason@enya.ai> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: Giulio rebuffo <giulio.rebuffo@gmail.com> Co-authored-by: Roberto Bayardo <bayardo@alum.mit.edu> Co-authored-by: Joe Netti <joe@netti.dev> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local>
There are currently a number of bugs in the flat DB trie hash computation. These bugs are improbable in 'real' scenarios (hence the ability to sync the assorted large chains), and, the repercussions of encountering them are low (generally re-computing the hash at a later block will succeed). Still, they can cause the process to crash, or deadlock, and a clever adversary could feasibly construct a block or storage update designed to trigger these bugs. (Or, an unlucky block may trigger them inadvertently). Based on the tracing in the code, it seems that some of these bugs may have been witnessed in the wild, but not reliably reproduced (for instance when
maxlen >= len(curr)
).There is an infinite loop that can occur in _nextSiblingOfParentInMem. This occurs in the account path when the c.k[1] entry is 'nil' and the code will continuously retry resolving this entry indefinitely.
When the next trie node is deeper in the trie than the previous, but is not a descendent of the previous trie node, then the old trie node is inadvertently left in memory instead of nil-ed. This results in an incorrect hash being computed.
When the last trie node being processed is comprised entirely of 'f' nibbles, the 'FirstNotCoveredPrefix' returns an empty byte array, because there is no next nibble sub-tree. This causes keys to inappropriately be reprocessed triggering either an index out of bounds panic or incorrect hash results.
When the _nextSiblingInDB path is triggered, if the next nibble subtree contains no trie table entries, then any keys with a prefix in that subtree will be skipped resulting in incorrect hash results.
The fuzzing seeds included cover all four of these cases (for both accounts and storage where appropriate). As fuzzing is baked into the native go 1.18 toolchain, running 'go test' (as the Makefile currently does) is adequate to cover these failures. To run additional fuzzing the
-fuzz
flag may be provided as documented in the go test doc.