Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

MaxMustermann2 · 2021-11-20T18:22:52Z

Merge ValidatorWrapper and ValidatorWrapperCopy to let callers ask for either a copy, or a pointer to the cached object. Additionally, give callers the option to not deep copy delegations (which is a heavy process). Copies need to be explicitly committed (and thus can be reverted), while the pointers are (auto-)committed when Finalise is called.
Add a UpdateValidatorWrapperWithRevert function, which is used by staking txs Delegate, Undelegate, and CollectRewards. Other 2 types of staking txs and db.Finalize continue to use UpdateValidateWrapper without revert, again, to save memory
Add unit tests which check
1. Revert goes through
2. Wrapper is as expected after revert
3. State is as expected after revert

Issue

Test

Unit Test Coverage

Before:

core/state: 70.0%

After:

core/state: 71.9%

Test/Run Logs

$ go test -run TestValidatorRevert
=== Testing validator wrapper revert ===
=== Creating new validator and wrapper ===
=== Writing validator and wrapper to state ===
=== Validator successfully written to state ===
=== Reverting validator to None ===
=== Revert successful, according to ValidatorWrapper ===
=== Revert successful, according to checkEqual ===
=== Adding delegation to validator in stateDB ===
=== Writing validator with delegation to state ===
=== Validator successfully written to state ===
=== Reverting delegation from validator ===
=== Revert successful, according to ValidatorWrapper ===
=== Revert successful, according to checkEqual ===
PASS
ok  	github.com/harmony-one/harmony/core/state	0.024s

Operational Checklist

Does this PR introduce backward-incompatible changes to the on-disk data structure and/or the over-the-wire protocol?. (If no, skip to question 8.)
No.
Describe the migration plan.. For each flag epoch, describe what changes take place at the flag epoch, the anticipated interactions between upgraded/non-upgraded nodes, and any special operational considerations for the migration.
Describe how the plan was tested.
How much minimum baking period after the last flag epoch should we allow on Pangaea before promotion onto mainnet?
What are the planned flag epoch numbers and their ETAs on Pangaea?
What are the planned flag epoch numbers and their ETAs on mainnet?

Note that this must be enough to cover baking period on Pangaea.
What should node operators know about this planned change?
Does this PR introduce backward-incompatible changes NOT related to on-disk data structure and/or over-the-wire protocol? (If no, continue to question 11.)
No.
Does the existing node.sh continue to work with this change?
What should node operators know about this change?
Does this PR introduce significant changes to the operational requirements of the node software, such as >20% increase in CPU, memory, and/or disk usage?
No. See comment.

JackyWYX · 2021-11-23T20:22:25Z

The problem is that, if you simply disable the cache, RLP decoding validatorWrapper from code field will definitely take a lot of CPU and memory resource. That's why a benchmark comparison is expected. The revert shall revert the cache as well.

MaxMustermann2 · 2021-11-24T19:29:36Z

The problem is that, if you simply disable the cache, RLP decoding validatorWrapper from code field will definitely take a lot of CPU and memory resource. That's why a benchmark comparison is expected. The revert shall revert the cache as well.

Your assessment is correct; the memory usage had more than doubled. I will modify the code to revert the cache, and get back to you.

Closes harmony-one/bounties#90 (1) Use LRU for ValidatorWrapper objects in stateDB to plug a potential memory leak (2) Merge ValidatorWrapper and ValidatorWrapperCopy to let callers ask for either a copy, or a pointer to the cached object. Additionally, give callers the option to not deep copy delegations (which is a heavy process). Copies need to be explicitly committed (and thus can be reverted), while the pointers are committed when Finalise is called. (3) Add a UpdateValidatorWrapperWithRevert function, which is used by staking txs `Delegate`, `Undelegate`, and `CollectRewards`. Other 2 types of staking txs and `db.Finalize` continue to use UpdateValidateWrapper without revert, again, to save memoery (4) Add unit tests which check a) Revert goes through b) Wrapper is as expected after revert c) State is as expected after revert

MaxMustermann2 · 2021-12-03T22:51:27Z

Using the shell script in #3773, I obtained the CPU and memory usage for this PR compared to the existing Harmony code at d5a8969 (the base for my branch) for ~50 minutes. The charts are plotted below:

Memory Usage

CPU Usage

JackyWYX

@LeoHChen @rlan35 The RP looks good to me. Please review the PR code in detail. Thanks

JackyWYX · 2022-01-03T08:49:18Z

@LeoHChen @rlan35 Please check out this PR to unblock the staking precompiles issue.

rlan35 · 2022-01-03T23:58:27Z

core/state/statedb.go

@@ -78,7 +83,7 @@ type DB struct {
 	stateObjects        map[common.Address]*Object
 	stateObjectsPending map[common.Address]struct{} // State objects finalized but not yet written to the trie
 	stateObjectsDirty   map[common.Address]struct{}
-	stateValidators     map[common.Address]*stk.ValidatorWrapper
+	stateValidators     *lru.Cache


This stateValidators is like a staged representation of the validatorWrappers in memory for easy modification without having to serialize/deserialize from the account.code field everything it's modified. This map keeps track of all the modifications which will needs to be committed to stateDB eventually. Changing it to cache with a limit could potentially lose modificaiton data for the validatorWrappers.

Good point. I propose setting validatorWrapperCacheLimit to 4,000 (currently the PR has it set to 1,000) to work around this issue. The rationale for doing so is that the block gas limit is 80,000,000, and the minimum cost of a staking transaction can be ~20,000 (conservatively; it is often higher). This means that at most 4,000 validator wrapper modifications can occur in a block. Alternatively, I am happy to change it back to a dictionary format. Let me know what you think.

Could you please change stateValidators to a data structure with something like

type validatorCache struct { dirty map[common.Address]*ValidatorWrapper cache *lru.Cache }

This will make it more readable and will not add unnecessary number of caches to the data structure.

Based on the new profiling, seems there isn't much benefit in using LRU. Blockchain system requires strong deterministic guarantee, please revert to the old way of using maps. Since each block state will be cleared after the block is processed, there shouldn't be memory leak issue on the old way.

I saw you added the dirty/cache struct. Since it's not improving any memory performance. Let's not complicate the existing code (may introduce new bugs and it's risky without extensive testing). Sorry for letting you change the code back and forth. (And also it's too complicated to have three boolean in the ValidatorWrapper() method.)

rlan35 · 2022-01-04T00:32:54Z

Using the shell script in #3773, I obtained the CPU and memory usage for this PR compared to the existing Harmony code at d5a8969 (the base for my branch) for ~50 minutes. The charts are plotted below:

Memory Usage

CPU Usage

NIce!

MaxMustermann2 · 2022-01-04T16:08:08Z

I attach the graphs for the updated PR below. Since I was using a slightly different system configuration, I re-ran the graph for the existing code base too.

Memory

CPU

Since the memory / CPU usage saved is not significantly different when using an LRU + map structure, go back to the original dictionary structure to keep code easy to read and have limited modifications.

MaxMustermann2 · 2022-01-05T16:40:38Z

Based on the new profiling, seems there isn't much benefit in using LRU. Blockchain system requires strong deterministic guarantee, please revert to the old way of using maps. Since each block state will be cleared after the block is processed, there shouldn't be memory leak issue on the old way.

Please see below the profiling.

Memory

CPU

rlan35 · 2022-01-06T07:05:02Z

core/state/statedb_test.go

@@ -984,3 +988,72 @@ func makeBLSPubSigPair() blsPubSigPair {

 	return blsPubSigPair{shardPub, shardSig}
 }
+
+func TestValidatorRevert(t *testing.T) {


Can you add more test cases? To cover more situations like modify,modify,revert; modify,modify,revert,revert. Or updating other fields rather than just the delegations. The test coverage right now is not high.

I have added more tests, please see TestValidatorMultipleReverts in particular. The coverage for new code in statedb.go has now increased.

rlan35 · 2022-01-06T07:06:39Z

Please also run a mainnet node with this new code; you can use the rcloned blockchain and let the node running to make sure it can synchronize all the new blocks without problems.

@rlan35

As requested by @rlan35, add tests beyond just adding and reverting a delegation. The tests are successive in the sense that we do multiple modifications to the wrapper, save a snapshot before each modification and revert to each of them to confirm everything works well. This change improves test coverage of statedb.go to 66.7% from 64.8% and that of core/state to 71.9% from 70.8%, and covers all the code that has been modified by this PR in statedb.go. For clarity, the modifications to the wrapper include (1) creation of wrapper in state, (2) adding a delegation to the wrapper, (3) increasing the blocks signed, and (4) a change in the validator Name and the BlockReward. Two additional tests have been added to cover the `panic` and the `GetCode` cases.

MaxMustermann2 · 2022-01-06T15:21:44Z

Please also run a mainnet node with this new code; you can use the rcloned blockchain and let the node running to make sure it can synchronize all the new blocks without problems.

The results with the memory and CPU usage are from a mainnet node, as specified in the bounty requirements. Do you need anything else from these runs?

rlan35 · 2022-01-06T20:18:43Z

Ok, that's good. Are the nodes being able to sync to latest block and stay in sync all the time?

MaxMustermann2 · 2022-01-07T19:40:43Z

Ok, that's good. Are the nodes being able to sync to latest block and stay in sync all the time?

Yes, although I had to merge into my build the main branch and #3976 to get the sync to catch up from the rclone base. The "catching up" lasted ~5.5 hours to cover a difference of ~18,250 blocks between the rcloned database and the mainnet. The node stayed in sync (according to the block number as well as hmyv2_inSync) for the next ~4.5 hours, the duration of my testing.

For the record, I used a storage optimized Digital Ocean droplet (not dedicated) with 8 cores, 64 GB RAM and 1.17 TB SSD. This decision was made because Harmony requirements recommend using an 8-core server if it's shared, and I needed at least 750 GB for the rclone.

rlan35 · 2022-01-07T22:27:08Z

core/state/statedb.go

+	}
+	// a copy of the existing store can be used for revert
+	// since we are replacing the existing with the new anyway
+	prev, err := db.ValidatorWrapper(addr, true, false)


Should this be "db.ValidatorWrapper(addr, false, true)"? Since you want a copy to be stored as the change journal?

This is because the caller is sending a copy, which is being added to the db, while the original is in the journal.

I see. make sense, the original is replaced with the new one, so it's safe to directly use it without copying. thanks

MaxMustermann2 force-pushed the revert-validator-wrapper branch from 3d3a5de to 6e8c849 Compare November 20, 2021 19:01

MaxMustermann2 force-pushed the revert-validator-wrapper branch from 6e8c849 to c2db259 Compare December 1, 2021 21:28

MaxMustermann2 force-pushed the revert-validator-wrapper branch from c2db259 to 0a32a00 Compare December 3, 2021 20:52

MaxMustermann2 marked this pull request as ready for review December 3, 2021 22:51

JackyWYX approved these changes Dec 13, 2021

View reviewed changes

rlan35 reviewed Jan 3, 2022

View reviewed changes

Change back to dictionary for stateValidators

fb88e3b

Since the memory / CPU usage saved is not significantly different when using an LRU + map structure, go back to the original dictionary structure to keep code easy to read and have limited modifications.

MaxMustermann2 force-pushed the revert-validator-wrapper branch from 944b001 to fb88e3b Compare January 5, 2022 16:37

MaxMustermann2 requested a review from rlan35 January 5, 2022 16:42

rlan35 reviewed Jan 6, 2022

View reviewed changes

MaxMustermann2 requested a review from rlan35 January 6, 2022 15:27

rlan35 reviewed Jan 7, 2022

View reviewed changes

rlan35 approved these changes Jan 7, 2022

View reviewed changes

rlan35 merged commit 5abe070 into harmony-one:main Jan 7, 2022

MaxMustermann2 mentioned this pull request Jan 8, 2022

Resolve harmony-one/bounties#77: Staking precompiles #3906

Merged

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

MaxMustermann2 commented Nov 20, 2021 •

edited

JackyWYX commented Nov 23, 2021

MaxMustermann2 commented Nov 24, 2021

MaxMustermann2 commented Dec 3, 2021

JackyWYX left a comment

JackyWYX commented Jan 3, 2022

rlan35 Jan 3, 2022

MaxMustermann2 Jan 4, 2022

JackyWYX Jan 4, 2022

rlan35 Jan 4, 2022

rlan35 Jan 4, 2022

rlan35 commented Jan 4, 2022

Memory Usage

CPU Usage

MaxMustermann2 commented Jan 4, 2022

MaxMustermann2 commented Jan 5, 2022

rlan35 Jan 6, 2022

MaxMustermann2 Jan 6, 2022 •

edited

rlan35 commented Jan 6, 2022

MaxMustermann2 commented Jan 6, 2022

rlan35 commented Jan 6, 2022

MaxMustermann2 commented Jan 7, 2022 •

edited

rlan35 Jan 7, 2022

MaxMustermann2 Jan 7, 2022

rlan35 Jan 7, 2022

Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

Conversation

MaxMustermann2 commented Nov 20, 2021 • edited

Issue

Test

Unit Test Coverage

Test/Run Logs

Operational Checklist

JackyWYX commented Nov 23, 2021

MaxMustermann2 commented Nov 24, 2021

MaxMustermann2 commented Dec 3, 2021

Memory Usage

CPU Usage

JackyWYX left a comment

Choose a reason for hiding this comment

JackyWYX commented Jan 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlan35 commented Jan 4, 2022

Memory Usage

CPU Usage

MaxMustermann2 commented Jan 4, 2022

Memory

CPU

MaxMustermann2 commented Jan 5, 2022

Memory

CPU

Choose a reason for hiding this comment

MaxMustermann2 Jan 6, 2022 • edited

Choose a reason for hiding this comment

rlan35 commented Jan 6, 2022

MaxMustermann2 commented Jan 6, 2022

rlan35 commented Jan 6, 2022

MaxMustermann2 commented Jan 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxMustermann2 commented Nov 20, 2021 •

edited

MaxMustermann2 Jan 6, 2022 •

edited

MaxMustermann2 commented Jan 7, 2022 •

edited