Fix bugs in trie hash computation #7337

jyellick · 2023-04-19T02:14:58Z

There are currently a number of bugs in the flat DB trie hash computation. These bugs are improbable in 'real' scenarios (hence the ability to sync the assorted large chains), and, the repercussions of encountering them are low (generally re-computing the hash at a later block will succeed). Still, they can cause the process to crash, or deadlock, and a clever adversary could feasibly construct a block or storage update designed to trigger these bugs. (Or, an unlucky block may trigger them inadvertently). Based on the tracing in the code, it seems that some of these bugs may have been witnessed in the wild, but not reliably reproduced (for instance when maxlen >= len(curr)).

There is an infinite loop that can occur in _nextSiblingOfParentInMem. This occurs in the account path when the c.k[1] entry is 'nil' and the code will continuously retry resolving this entry indefinitely.
When the next trie node is deeper in the trie than the previous, but is not a descendent of the previous trie node, then the old trie node is inadvertently left in memory instead of nil-ed. This results in an incorrect hash being computed.
When the last trie node being processed is comprised entirely of 'f' nibbles, the 'FirstNotCoveredPrefix' returns an empty byte array, because there is no next nibble sub-tree. This causes keys to inappropriately be reprocessed triggering either an index out of bounds panic or incorrect hash results.
When the _nextSiblingInDB path is triggered, if the next nibble subtree contains no trie table entries, then any keys with a prefix in that subtree will be skipped resulting in incorrect hash results.

The fuzzing seeds included cover all four of these cases (for both accounts and storage where appropriate). As fuzzing is baked into the native go 1.18 toolchain, running 'go test' (as the Makefile currently does) is adequate to cover these failures. To run additional fuzzing the -fuzz flag may be provided as documented in the go test doc.

AskAlexSharov · 2023-04-19T02:35:03Z

very god pr.
some tests failed: https://github.com/ledgerwatch/erigon/actions/runs/4738824219/jobs/8413072512?pr=7337#step:8:91
FYI: can run tests locally by make test-integration

jyellick · 2023-04-19T02:36:35Z

Thank you, yes, I was just investigating the failures. Looks like my use of a 'real' mdbx may be the problem, will try to rebase with the memdb variant.

Edit: I see, also looks like something's wrong in the rpcdaemon test -- I'd missed that one running locally, thanks, will fix.

AskAlexSharov · 2023-04-19T04:21:51Z

only 1 rwtx can exist at a time and we spend it for new blocks execution.
so, forkValidator/miner need to exec txs and calc state root on memdb+rotx.

jyellick · 2023-04-19T04:25:14Z

I've fixed the memdb vs. mdbx issues for failing tests. But, it looks like I may have introduced a new 'real' bug with these fixes. Marking as draft for the moment, as I think it's going to take at least a few more hours of investigation and extending the tests to cover it (so will tackle that in my morning).

There are currently a number of bugs in the flat DB trie hash computation. These bugs are improbable in 'real' scenarios (hence the ability to sync the assorted large chains), and, the repercussions of encountering them are low (generally re-computing the hash at a later block will succeed). Still, they can cause the process to crash, or deadlock, and a clever adversary could feasibly construct a block or storage update designed to trigger these bugs. (Or, an unlucky block may trigger them inadvertently). Based on the tracing in the code, it seems that some of these bugs may have been witnessed in the wild, but not reliably reproduced (for instance when `maxlen >= len(curr)`). 1. There is an infinite loop that can occur in _nextSiblingOfParentInMem. This occurs in the account path when the c.k[1] entry is 'nil' and the code will continuously retry resolving this entry indefinitely. 2. When the next trie node is deeper in the trie than the previous, but is not a descendent of the previous trie node, then any higher trie nodes which are not overwritten are inadvertently left in memory (when they should be nil-ed) resulting in an incorrect hash or index out of bounds panic. 3. When the last trie node being processed is comprised entirely of 'f' nibbles, the 'FirstNotCoveredPrefix' returns an empty byte array because there is no next nibble sub-tree. This causes keys to inappropriately be reprocessed triggering either an index out of bounds panic or incorrect hash results. 4. When the _nextSiblingInDB path is triggered, if the next nibble subtree contains no trie table entries, then any keys with a prefix in that subtree will be skipped resulting in incorrect hash results. The fuzzing seeds included cover all four of these cases (for both accounts and storage where appropriate). As fuzzing is baked into the native go 1.18 toolchain, running 'go test' (as the Makefile currently does) is adequate to cover these failures. To run additional fuzzing the `-fuzz` flag may be provided as documented in the go test doc.

jyellick

I've fixed up the failures. They stemmed from two issues:

I was creating a new mdbx instance and cleaning it up in each iteration of the fuzzing. This was problematic in CI and the memdb is a better choice in this case anyway.
When I was cleaning up this commit for a PR I inadvertently copied from of the account trie cursor code into the storage cursor. This change did not break the existing tests, so, I've extended the new tests to cover this case as well.

I'm also including comments on the changes to hopefully make it easier to review / understand these changes.

Once this patch is (hopefully) accepted, I'll plan to submit some additional tests around the proof generation. In trying to construct those tests I kept running into this set of bugs, so I have deliberately detached these fixes from the proof related tests to allow for simpler review.

jyellick · 2023-04-19T02:28:15Z

turbo/trie/trie_root.go

 	if _, err := c._seek(c.next, []byte{}); err != nil {
 		return err
 	}
+	if c.k[c.lvl] == nil || !bytes.HasPrefix(c.next, c.k[c.lvl]) {


Please pay extra attention to this particular check in review. Because in general, setting SkipState to false too aggressively will result in lower performance, but not incorrect results, I'd like to make sure that this check is correct.

jyellick · 2023-04-19T16:26:38Z

turbo/trie/trie_root.go

 			if err1 != nil {
 				return EmptyRoot, err1
 			}
-			if keyIsBefore(ihK, kHex) || !bytes.HasPrefix(kHex, nil) { // read all accounts until next AccTrie


The check for

!bytes.HasPrefix(kHex, nil)

will always return false, so, removed.

jyellick · 2023-04-19T16:27:53Z

turbo/trie/trie_root.go

 		if accTrie.SkipState {
 			goto SkipAccounts
 		}

-		for k, kHex, v, err1 := accs.Seek(accTrie.FirstNotCoveredPrefix()); k != nil; k, kHex, v, err1 = accs.Next() {


In the event that FirstNotCoveredPrefix returns 0x0 because the last trie entry is of the form 0xf..f0..0 then this loop inappropriately re-processes account keys.

jyellick · 2023-04-19T16:28:30Z

turbo/trie/trie_root.go

@@ -667,7 +679,7 @@ func (r *RootHashAggregator) saveValueAccount(isIH, hasTree bool, v *accounts.Ac
 // has 2 basic operations:  _preOrderTraversalStep and _preOrderTraversalStepNoInDepth
 type AccTrieCursor struct {
 	SkipState       bool
-	is, lvl         int


This is field is incremented, but never used/read, so, removed.

jyellick · 2023-04-19T16:32:22Z

turbo/trie/trie_root.go

-				return true
-			}
-
-			c.lvl = nonNilLvl + 1


In the event that c.k[2] is nil, and the seek goes outside of the prefix derived from c.k[1] (in kBuf). Then c.lvl is set to 2, and the loop starts again. Because c.k[2] is nil, the whole process repeats infinitely.

jyellick · 2023-04-19T16:36:30Z

turbo/trie/trie_root.go

@@ -889,6 +902,18 @@ func (c *AccTrieCursor) _unmarshal(k, v []byte) {
 	if c.lvl >= len(k) {
 		from, to = len(k)+1, c.lvl+2
 	}
+
+	// Consider a trie DB with keys like: [0xa, 0xbb], then unmarshaling 0xbb


Hopefully this comment explains it pretty well, but, when trie node 0xa is processed, it is assigned to c.k[0], then, when the next trie node 0xbb is processed, because its length 2 is longer than 0xa's length 1, the from is set to 2, and the entry for 0xa in c.k[1] is not cleared. Thus, when looking for the sibling of a parent, the code incorrectly traverses back to this already processed node.

AskAlexSharov · 2023-04-21T03:56:48Z

FYI: i did run make test-integration on CI: https://github.com/ledgerwatch/erigon/actions/runs/4761261203

AskAlexSharov · 2023-04-24T01:47:44Z

Tested by integration dtate_stages on mainnet. Will do one more round of review and merge after release.

This PR extends the function merged in #7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai>

This PR extends the function merged in ledgerwatch#7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai>

* fix build issue * simplify the unwind * Add additional trie proof testing (ledgerwatch#7382) This PR extends the function merged in ledgerwatch#7337 to cover the proof generation paths as well. The first commit simply migrates the internal proof checking code from the rpc `eth_getProof` test to be available externally exported under `turbo/trie`. In the process, it translates the in place testing assertions to return more normally as errors and adapts to use the pre-existing trie node types which are defined slightly differently (although this code is still intended only for testing purposes). The second commit refactors the existing trie fuzzing tests from the previous PR and adds new fuzzing tests for the proof generation. In order to validate the proofs, these fuzzing tests fundamentally do two things. Firstly, after bootstrapping the test (by seeding, and modifying the db), for each key we compute the 'naive' proof utilizing the existing code in `proof.go` and the in memory `trie.Trie` structure. The `trie.Trie` code actually had a couple small bugs which are fixed in this PR (not handling value nodes, and not honoring the `NewTestRLPTrie` contract of pre-RLP encoded nodes in proof generation). Secondly, we re-compute the same proof for the flatDB production variant, and verify that it is exactly the same proof as computed by the naive implementation. This fuzzing has been run for ~72 hours locally with no errors. Although this code hasn't identified any new bugs in the proof generation path, it improves coverage and should help to prevent regressions. Additional extensions will be provided in a subsequent PR. --------- Co-authored-by: Jason Yellick <jason@enya.ai> * e3: drain when no new tasks causing deadlock (ledgerwatch#7395) * Enable negative Merkle proofs for eth_getProof (ledgerwatch#7393) This addresses the last known deficiency of the eth_getProof implementation. The previous code would return an error in the event that the element was not found in the trie. EIP-1186 allows for 'negative' proofs where a proof demonstrates that an element cannot be in the trie, so this commit updates the logic to support that case. Co-authored-by: Jason Yellick <jason@enya.ai> * e3: dont fetch code hash in unwind (ledgerwatch#7416) * Caplin: Fixed sentinel deadlock (ledgerwatch#7419) * fix block rlp logging bug (ledgerwatch#7441) fixes a logging statement was logging the outer rpc proto instead of the inner rlp of the block as intended * fix: erigon_getLatestLogs (ledgerwatch#7450) When calling erigon_getLatestLogs, I was getting a crash in [erigon_receipts.go](https://github.com/ledgerwatch/erigon/blob/beb97784d43ece5acde365a74efe8763692ebdba/cmd/rpcdaemon/commands/erigon_receipts.go#L254). I think it is a simple indexing bug ``` [service.go:217 panic.go:884 panic.go:113 erigon_receipts.go:254 value.go:586 value.go:370 service.go:222 handler.go:494 handler.go:444 handler.go:392 handler.go:223 handler.go:316 asm_amd64.s:1598] [WARN] [05-05|21:13:59.749] Served conn=100.70.204.111:50141 method=erigon_getLatestLogs reqid=1 t=621.5936ms err="method handler crashed" ``` * LeakDetector: use it to find which resource was created but not closed (leaked) (ledgerwatch#7457) * reduce default --db.size.limit from 7 to 3 Tb (to fit defaults of some systems) (ledgerwatch#7478) * Enode logging broke when NAT Parameter set in 2.43.0 (ledgerwatch#7480) for ledgerwatch#7472 * erigon init: created db with wrong pageSize (ledgerwatch#7482) * p2p: fix peer ID serialization (ledgerwatch#7495) The peer ID in sentry.proto is a H512 / 64 bytes value, and MarshalPubkey creates it from a public key. There's no need to cut the first byte, because MarshalPubkey already does it. Doing so results in a 63 bytes value that is incompatible with silkworm sentry. * Allow ephemeral ports for p2p (ledgerwatch#7503) Currently, the p2p ports require an explicit enumeration of ports to pick. Sometimes, for instance when writing integration tests utilizing an Erigon binary the particular p2p port does not matter and trying to pick non-allocated port ranges is fragile. This small PR simply checks to see if the enumerated port is '0', in which case it disables the probing check which would otherwise cause Erigon not to try binding to an ephemeral port. Co-authored-by: Jason Yellick <jason@enya.ai> * remove "db" log line from (ledgerwatch#7509) * e2: self-heal after accidental blocks delete (download blocks eventually, then work as usual) (ledgerwatch#7516) * e2: ReadAhead of blocks, senders accounts, code (ledgerwatch#7501) It improves performance of initial sync (stage exec) by 5-20% when blocks snapshots are mounted to high-latency drive and when chaindata is on high-latency drive. And improving cold-start performance. Current implementation using 2 goroutines for ReadAhead. It also producing more garbage, can improve it later (here are dashboard with impact). ``` mainnet2-1: with ReadAhead mainnet2-3: no ReadAhead ``` <img width="949" alt="Screenshot 2023-05-12 at 09 24 31" src="https://github.com/ledgerwatch/erigon/assets/46885206/b90b1fa8-9099-48ff-95b3-86e864a36d46"> <img width="845" alt="Screenshot 2023-05-12 at 09 24 13" src="https://github.com/ledgerwatch/erigon/assets/46885206/39d90c0c-a9d5-4735-8c03-da1455b147aa"> * Fix trace_filter regression due to gasBailout argument (ledgerwatch#7539) (ledgerwatch#7540) Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local> * Update t8n_test.go (ledgerwatch#7550) * e2: reduce StageSyncStep interface size (ledgerwatch#7561) * e2: avoid do RestoreCodeHash twice (ledgerwatch#7706) - do it only once in HistoryStateReader * StagedSync: fix canRunCycleInOneTransaction logic (ledgerwatch#7712) * Better version of libp2p where losing peers bug fixed (ledgerwatch#7726) Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> * fix merge error --------- Co-authored-by: Jason Yellick <7431583+jyellick@users.noreply.github.com> Co-authored-by: Jason Yellick <jason@enya.ai> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: Giulio rebuffo <giulio.rebuffo@gmail.com> Co-authored-by: Roberto Bayardo <bayardo@alum.mit.edu> Co-authored-by: Joe Netti <joe@netti.dev> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro-2.local>

jyellick marked this pull request as draft April 19, 2023 04:23

jyellick force-pushed the fix-trie-root-bugs branch from a4c1f83 to 943e35a Compare April 19, 2023 16:24

jyellick commented Apr 19, 2023

View reviewed changes

jyellick marked this pull request as ready for review April 19, 2023 16:47

AskAlexSharov merged commit 23d5c7c into ledgerwatch:devel Apr 25, 2023
12 checks passed

jyellick mentioned this pull request Apr 25, 2023

Add additional trie proof testing #7382

Merged

jyellick deleted the fix-trie-root-bugs branch April 25, 2023 15:56

blxdyx mentioned this pull request Jun 13, 2023

Fix bugs in trie hash computation node-real/bsc-erigon#130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in trie hash computation #7337

Fix bugs in trie hash computation #7337

jyellick commented Apr 19, 2023

AskAlexSharov commented Apr 19, 2023

jyellick commented Apr 19, 2023 •

edited

AskAlexSharov commented Apr 19, 2023

jyellick commented Apr 19, 2023

jyellick left a comment

jyellick Apr 19, 2023

jyellick Apr 19, 2023 •

edited

jyellick Apr 19, 2023

jyellick Apr 19, 2023

jyellick Apr 19, 2023

jyellick Apr 19, 2023

AskAlexSharov commented Apr 21, 2023

AskAlexSharov commented Apr 24, 2023

Fix bugs in trie hash computation #7337

Fix bugs in trie hash computation #7337

Conversation

jyellick commented Apr 19, 2023

AskAlexSharov commented Apr 19, 2023

jyellick commented Apr 19, 2023 • edited

AskAlexSharov commented Apr 19, 2023

jyellick commented Apr 19, 2023

jyellick left a comment

Choose a reason for hiding this comment

jyellick Apr 19, 2023

Choose a reason for hiding this comment

jyellick Apr 19, 2023 • edited

Choose a reason for hiding this comment

jyellick Apr 19, 2023

Choose a reason for hiding this comment

jyellick Apr 19, 2023

Choose a reason for hiding this comment

jyellick Apr 19, 2023

Choose a reason for hiding this comment

jyellick Apr 19, 2023

Choose a reason for hiding this comment

AskAlexSharov commented Apr 21, 2023

AskAlexSharov commented Apr 24, 2023

jyellick commented Apr 19, 2023 •

edited

jyellick Apr 19, 2023 •

edited