Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939

Merged
merged 3 commits into from Jan 7, 2022

Conversation

MaxMustermann2
Copy link
Contributor

@MaxMustermann2 MaxMustermann2 commented Nov 20, 2021

  1. Merge ValidatorWrapper and ValidatorWrapperCopy to let callers ask for either a copy, or a pointer to the cached object. Additionally, give callers the option to not deep copy delegations (which is a heavy process). Copies need to be explicitly committed (and thus can be reverted), while the pointers are (auto-)committed when Finalise is called.
  2. Add a UpdateValidatorWrapperWithRevert function, which is used by staking txs Delegate, Undelegate, and CollectRewards. Other 2 types of staking txs and db.Finalize continue to use UpdateValidateWrapper without revert, again, to save memory
  3. Add unit tests which check
    1. Revert goes through
    2. Wrapper is as expected after revert
    3. State is as expected after revert

Issue

harmony-one/bounties#90

Test

Unit Test Coverage

Before:

core/state: 70.0%

After:

core/state: 71.9%

Test/Run Logs

$ go test -run TestValidatorRevert
=== Testing validator wrapper revert ===
=== Creating new validator and wrapper ===
=== Writing validator and wrapper to state ===
=== Validator successfully written to state ===
=== Reverting validator to None ===
=== Revert successful, according to ValidatorWrapper ===
=== Revert successful, according to checkEqual ===
=== Adding delegation to validator in stateDB ===
=== Writing validator with delegation to state ===
=== Validator successfully written to state ===
=== Reverting delegation from validator ===
=== Revert successful, according to ValidatorWrapper ===
=== Revert successful, according to checkEqual ===
PASS
ok  	github.com/harmony-one/harmony/core/state	0.024s

Operational Checklist

  1. Does this PR introduce backward-incompatible changes to the on-disk data structure and/or the over-the-wire protocol?. (If no, skip to question 8.)
    No.

  2. Describe the migration plan.. For each flag epoch, describe what changes take place at the flag epoch, the anticipated interactions between upgraded/non-upgraded nodes, and any special operational considerations for the migration.

  3. Describe how the plan was tested.

  4. How much minimum baking period after the last flag epoch should we allow on Pangaea before promotion onto mainnet?

  5. What are the planned flag epoch numbers and their ETAs on Pangaea?

  6. What are the planned flag epoch numbers and their ETAs on mainnet?

    Note that this must be enough to cover baking period on Pangaea.

  7. What should node operators know about this planned change?

  8. Does this PR introduce backward-incompatible changes NOT related to on-disk data structure and/or over-the-wire protocol? (If no, continue to question 11.)
    No.

  9. Does the existing node.sh continue to work with this change?

  10. What should node operators know about this change?

  11. Does this PR introduce significant changes to the operational requirements of the node software, such as >20% increase in CPU, memory, and/or disk usage?
    No. See comment.

@JackyWYX
Copy link
Contributor

The problem is that, if you simply disable the cache, RLP decoding validatorWrapper from code field will definitely take a lot of CPU and memory resource. That's why a benchmark comparison is expected. The revert shall revert the cache as well.

@MaxMustermann2
Copy link
Contributor Author

The problem is that, if you simply disable the cache, RLP decoding validatorWrapper from code field will definitely take a lot of CPU and memory resource. That's why a benchmark comparison is expected. The revert shall revert the cache as well.

Your assessment is correct; the memory usage had more than doubled. I will modify the code to revert the cache, and get back to you.

Closes harmony-one/bounties#90
(1) Use LRU for ValidatorWrapper objects in stateDB to plug a potential
    memory leak
(2) Merge ValidatorWrapper and ValidatorWrapperCopy to let callers ask
    for either a copy, or a pointer to the cached object. Additionally,
    give callers the option to not deep copy delegations (which is a
    heavy process). Copies need to be explicitly committed (and thus
    can be reverted), while the pointers are committed when Finalise
    is called.
(3) Add a UpdateValidatorWrapperWithRevert function, which is used by
    staking txs `Delegate`, `Undelegate`, and `CollectRewards`. Other
    2 types of staking txs and `db.Finalize` continue to use
    UpdateValidateWrapper without revert, again, to save memoery
(4) Add unit tests which check
    a) Revert goes through
    b) Wrapper is as expected after revert
    c) State is as expected after revert
@MaxMustermann2
Copy link
Contributor Author

Using the shell script in #3773, I obtained the CPU and memory usage for this PR compared to the existing Harmony code at d5a8969 (the base for my branch) for ~50 minutes. The charts are plotted below:

Memory Usage

Memory

CPU Usage

CPU

@MaxMustermann2 MaxMustermann2 marked this pull request as ready for review December 3, 2021 22:51
Copy link
Contributor

@JackyWYX JackyWYX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeoHChen @rlan35 The RP looks good to me. Please review the PR code in detail. Thanks

@JackyWYX
Copy link
Contributor

JackyWYX commented Jan 3, 2022

@LeoHChen @rlan35 Please check out this PR to unblock the staking precompiles issue.

@@ -78,7 +83,7 @@ type DB struct {
stateObjects map[common.Address]*Object
stateObjectsPending map[common.Address]struct{} // State objects finalized but not yet written to the trie
stateObjectsDirty map[common.Address]struct{}
stateValidators map[common.Address]*stk.ValidatorWrapper
stateValidators *lru.Cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stateValidators is like a staged representation of the validatorWrappers in memory for easy modification without having to serialize/deserialize from the account.code field everything it's modified. This map keeps track of all the modifications which will needs to be committed to stateDB eventually. Changing it to cache with a limit could potentially lose modificaiton data for the validatorWrappers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I propose setting validatorWrapperCacheLimit to 4,000 (currently the PR has it set to 1,000) to work around this issue. The rationale for doing so is that the block gas limit is 80,000,000, and the minimum cost of a staking transaction can be ~20,000 (conservatively; it is often higher). This means that at most 4,000 validator wrapper modifications can occur in a block. Alternatively, I am happy to change it back to a dictionary format. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change stateValidators to a data structure with something like

type validatorCache struct {
    dirty map[common.Address]*ValidatorWrapper
    cache *lru.Cache
}

This will make it more readable and will not add unnecessary number of caches to the data structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the new profiling, seems there isn't much benefit in using LRU. Blockchain system requires strong deterministic guarantee, please revert to the old way of using maps. Since each block state will be cleared after the block is processed, there shouldn't be memory leak issue on the old way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you added the dirty/cache struct. Since it's not improving any memory performance. Let's not complicate the existing code (may introduce new bugs and it's risky without extensive testing). Sorry for letting you change the code back and forth. (And also it's too complicated to have three boolean in the ValidatorWrapper() method.)

@rlan35
Copy link
Contributor

rlan35 commented Jan 4, 2022

Using the shell script in #3773, I obtained the CPU and memory usage for this PR compared to the existing Harmony code at d5a8969 (the base for my branch) for ~50 minutes. The charts are plotted below:

Memory Usage

Memory

CPU Usage

CPU

NIce!

@MaxMustermann2
Copy link
Contributor Author

I attach the graphs for the updated PR below. Since I was using a slightly different system configuration, I re-ran the graph for the existing code base too.

Memory

Memory

CPU

CPU

Since the memory / CPU usage saved is not significantly different when
using an LRU + map structure, go back to the original dictionary
structure to keep code easy to read and have limited modifications.
@MaxMustermann2
Copy link
Contributor Author

Based on the new profiling, seems there isn't much benefit in using LRU. Blockchain system requires strong deterministic guarantee, please revert to the old way of using maps. Since each block state will be cleared after the block is processed, there shouldn't be memory leak issue on the old way.

Please see below the profiling.

Memory

Memory

CPU

CPU

@@ -984,3 +988,72 @@ func makeBLSPubSigPair() blsPubSigPair {

return blsPubSigPair{shardPub, shardSig}
}

func TestValidatorRevert(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more test cases? To cover more situations like modify,modify,revert; modify,modify,revert,revert. Or updating other fields rather than just the delegations. The test coverage right now is not high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more tests, please see TestValidatorMultipleReverts in particular. The coverage for new code in statedb.go has now increased.

@rlan35
Copy link
Contributor

rlan35 commented Jan 6, 2022

Please also run a mainnet node with this new code; you can use the rcloned blockchain and let the node running to make sure it can synchronize all the new blocks without problems.

As requested by @rlan35, add tests beyond just adding and reverting a
delegation. The tests are successive in the sense that we do multiple
modifications to the wrapper, save a snapshot before each modification
and revert to each of them to confirm everything works well. This change
improves test coverage of statedb.go to 66.7% from 64.8% and that of
core/state to 71.9% from 70.8%, and covers all the code that has been
modified by this PR in statedb.go.

For clarity, the modifications to the wrapper include (1) creation of
wrapper in state, (2) adding a delegation to the wrapper, (3)
increasing the blocks signed, and (4) a change in the validator Name and
the BlockReward. Two additional tests have been added to cover the
`panic` and the `GetCode` cases.
@MaxMustermann2
Copy link
Contributor Author

Please also run a mainnet node with this new code; you can use the rcloned blockchain and let the node running to make sure it can synchronize all the new blocks without problems.

The results with the memory and CPU usage are from a mainnet node, as specified in the bounty requirements. Do you need anything else from these runs?

@rlan35
Copy link
Contributor

rlan35 commented Jan 6, 2022

Ok, that's good. Are the nodes being able to sync to latest block and stay in sync all the time?

@MaxMustermann2
Copy link
Contributor Author

MaxMustermann2 commented Jan 7, 2022

Ok, that's good. Are the nodes being able to sync to latest block and stay in sync all the time?

Yes, although I had to merge into my build the main branch and #3976 to get the sync to catch up from the rclone base. The "catching up" lasted ~5.5 hours to cover a difference of ~18,250 blocks between the rcloned database and the mainnet. The node stayed in sync (according to the block number as well as hmyv2_inSync) for the next ~4.5 hours, the duration of my testing.

For the record, I used a storage optimized Digital Ocean droplet (not dedicated) with 8 cores, 64 GB RAM and 1.17 TB SSD. This decision was made because Harmony requirements recommend using an 8-core server if it's shared, and I needed at least 750 GB for the rclone.

}
// a copy of the existing store can be used for revert
// since we are replacing the existing with the new anyway
prev, err := db.ValidatorWrapper(addr, true, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "db.ValidatorWrapper(addr, false, true)"? Since you want a copy to be stored as the change journal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the caller is sending a copy, which is being added to the db, while the original is in the journal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. make sense, the original is replaced with the new one, so it's safe to directly use it without copying. thanks

@rlan35 rlan35 merged commit 5abe070 into harmony-one:main Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants