Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: cache the ID for Header entities #1279

Merged
merged 22 commits into from Sep 16, 2021

Conversation

jwinkler2083233
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2021

Codecov Report

Merging #1279 (8c21989) into master (698c774) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1279      +/-   ##
==========================================
+ Coverage   54.96%   55.01%   +0.04%     
==========================================
  Files         501      501              
  Lines       31713    31739      +26     
==========================================
+ Hits        17432    17461      +29     
+ Misses      11925    11921       -4     
- Partials     2356     2357       +1     
Flag Coverage Δ
unittests 55.01% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
model/flow/header.go 86.58% <100.00%> (+6.22%) ⬆️
...ngine/common/synchronization/finalized_snapshot.go 68.75% <0.00%> (-4.17%) ⬇️
...sus/approvals/assignment_collector_statemachine.go 47.11% <0.00%> (+4.80%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 698c774...8c21989. Read the comment docs.

@jwinkler2083233 jwinkler2083233 removed the request for review from AlexHentschel September 13, 2021 13:36
Jeff Winkler added 2 commits September 13, 2021 19:26
This has fixes for unit tests that were failing due to race conditions
that became apparent when the cache sped up the MakeID() call.
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Overall, I have strong concerns regarding the thread safety of this approach (see comments below). For merging this into master, I would need a rigorous argument why this is thread safe on all hardware platforms (based on the go memory model and potentially additional statements in the go developer forums).
Maybe I am missing something.

Furthermore, I feel we might be able to do much better, if we cache the IDs for all headers that are stored in memory, no just the one we most recently computed the ID for. There are many cases, where we traverse the chain of recent blocks, each time we receive a new one. Caching only the last header ID will not help because much, because we don't hit the same header repeatedly.

engine/collection/synchronization/engine_test.go Outdated Show resolved Hide resolved
engine/collection/test/cluster_switchover_test.go Outdated Show resolved Hide resolved
ledger/complete/wal/compactor_test.go Outdated Show resolved Hide resolved
model/flow/header.go Outdated Show resolved Hide resolved
model/flow/header.go Outdated Show resolved Hide resolved
@jwinkler2083233
Copy link
Contributor Author

jwinkler2083233 commented Sep 14, 2021

Thanks for the suggestion. Overall, I have strong concerns regarding the thread safety of this approach (see comments below). For merging this into master, I would need a rigorous argument why this is thread safe on all hardware platforms (based on the go memory model and potentially additional statements in the go developer forums).
Maybe I am missing something.

Furthermore, I feel we might be able to do much better, if we cache the IDs for all headers that are stored in memory, no just the one we most recently computed the ID for. There are many cases, where we traverse the chain of recent blocks, each time we receive a new one. Caching only the last header ID will not help because much, because we don't hit the same header repeatedly.

According to the testing, we were accessing the same header repeatedly, so I'm only checking the previous one. I suspect that looking further back in history would cause a performance issue, because it's a linear search, and each compare is about a 1000-byte binary comparison.

How about this idea: what if, with each MakeID call, we also create a Murmur3 hash with the result from Rlp.encodeBytes(), and store that in a deque-based cache of limited length for ID searching ?

@jwinkler2083233 jwinkler2083233 force-pushed the jwinkler2083233/5841-cache-header-id branch from 99186f0 to 82358e7 Compare September 14, 2021 14:53
@bluesign
Copy link
Contributor

I am go noob but I think this doesn't solve race conditions + getting makeID in mutex is a bad idea.

I think what you need is more like this:

mutex_hdr.Lock()
	prev_hdr_copy = prev_hdr
	previd_hdr_copy = previd_hdr
	h_copy = h
mutex_hdr.Unlock()

if equal(prev_hdr_copy, h_copy) 
	return previd_hdr_copy

newid = MakeID(h_copy)

mutex_hdr.Lock()
	prev_hdr = h_copy
	previd_hdr = newid
mutex_hdr.Unlock()

return newid

@jwinkler2083233
Copy link
Contributor Author

jwinkler2083233 commented Sep 14, 2021

I am go noob but I think this doesn't solve race conditions + getting makeID in mutex is a bad idea.

I think what you need is more like this:

mutex_hdr.Lock()
	prev_hdr_copy = prev_hdr
	previd_hdr_copy = previd_hdr
	h_copy = h
mutex_hdr.Unlock()

if equal(prev_hdr_copy, h_copy) 
	return previd_hdr_copy

newid = MakeID(h_copy)

mutex_hdr.Lock()
	prev_hdr = h_copy
	previd_hdr = newid
mutex_hdr.Unlock()

return newid

yes and no..

The only place I'm seeing reentrancy is during the unit tests. There's no reason to make a deep copy of h, because it is on the stack. For Header entity types, it appears from looking at the logs that the MakeID calls are not done across threads, and that they are sequential. What you're suggesting is to make a deep copy of the ID and a Header object. I can avoid the deep copy by using the mutex around the entire function.

For other entity types, like Transaction, I would agree with you. If the functions involved with MakeID for Header were more intensely concurrent, I would also agree with you.

Copy link
Contributor Author

@jwinkler2083233 jwinkler2083233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requested changes are addressed with the most recent commit.

@bluesign
Copy link
Contributor

Thanks for the answer. One more question, why not make ID as part of the Header struct and calculate on first ID() call if necessary, return same if calculated already.

It states that fingerprint is calculated on the immutable part of the header. So technically this will allow caching of all header IDs.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions. Some minor suggestions regarding the coding style.

Regarding the question whether to cache the ID for one or multiple headers:

  • In my mind, your approach taken in this PR is very light-weight to address the most severe performance bottleneck. I think we were anyway viewing it as a temporary fix, right?

How about this idea: what if, with each MakeID call, we also create a Murmur3 hash with the result from Rlp.encodeBytes(), and store that in a deque-based cache of limited length for ID searching ?

Overall, I think your existing fix is totally fine as a temporary solution. Lets stick with that. Utilizing a cache seems to me a long-term solution

model/flow/header.go Outdated Show resolved Hide resolved
model/flow/header.go Outdated Show resolved Hide resolved
@AlexHentschel
Copy link
Member

@bluesign

why not make ID as part of the Header struct and calculate on first ID() call if necessary, return same if calculated already.

Yes; from my perspective, this is the long-term goal. But it is a non-trivial task, because nodes are operating in a malicious environment. We can't just send the ID of a block (or header) over the wire and the recipient trusts that this value is correct. Instead, the recipient would compute the ID value when receiving the block. Essentially we would need a transformation step from "untrusted data structure" to "validated data structure". Only the validated data structure should contain a pre-computed ID. A node can locally cache the validated data structure (including the pre-computed ID).

To make the situation even more complex, it is established best practise (for blockchain systems) to only store the "untrusted data structure" in the node's data bases. When retrieving the block from the DB, the node would then re-compute the ID, which guarantees that the data has not been altered.
The motivation behind this practise is that data on disk could be corrupted and we should be validating it when retrieving.

model/flow/header.go Outdated Show resolved Hide resolved
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

Idle musings:
How often do we call this in contexts requiring deep equality, I wonder? Is it possible that we ferry around a header by reference far enough that we forget how to get its ID, rather than actually compare blocks from different source? In other terms, could we try pointer equality optimistically, and degrade to the full comparison when the pointers don't match?

ledger/complete/wal/compactor_test.go Show resolved Hide resolved
model/flow/header.go Show resolved Hide resolved
@AlexHentschel
Copy link
Member

@huitseeker @ramtinms I view this as a temporary solution. Hence, I am not concerned if we are not achieving max performance with this patch. I created an issue describing what I think a full solution should look like: https://github.com/dapperlabs/flow-go/issues/5853. It is going to be a bit work intensive, but hopefully not too bad 😅

@jwinkler2083233 jwinkler2083233 merged commit 6f4757a into master Sep 16, 2021
@jwinkler2083233 jwinkler2083233 deleted the jwinkler2083233/5841-cache-header-id branch September 16, 2021 19:28
@@ -95,6 +95,8 @@ func Test_AsyncUploader(t *testing.T) {
require.Equal(t, 3, callCount)
})

time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this added by accident?

mutexHeader.Lock()

// unlock at the return
defer mutexHeader.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this change. Wouldn't this lock block two different threads computing IDs of two headers?

If engine A is calling header1.ID(), and engine B is calling header2.ID(), before this change, they can run concurrently, but with this change, one engine has to be blocked and wait until the other finishes its call. no?

durkmurder added a commit that referenced this pull request Sep 21, 2021
* wip

* update mutator tests

* update epoch lockup tests

* update committee leader selection tests

* update comments and consolidated code a bit

* update sampling of entities

* minor polishing

* adding the peer manager back to the staked AN which supports the unstaked AN; adding option to disable connection pruning

* consolidated EECC logic in Protocol State mutator

* changing DisableConnectionPruning to  WithConnectionPruning

* Update engine/collection/synchronization/engine.go

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

* Replaced return with continue

* added access-tests to local integration tests

* Update state/protocol/badger/mutator.go

* copy epoch status when in EECC

* Replaced log for fatal

* adjust numbers

* Remove temp local cadence version

* define new staked topic validator

* Rename ExtractPeerID

* Added test

* address comments

* Add new methods to identity provider

* modify api

* update fixed_provider

* address some comments

* move messagesigningid

* Fix Middleware test

* fix topic validator test

* Move key translation to its own package, break circular dependency

* Update libp2pUtils.go

* Update contLoadGenerator.go

* Update middleware_test.go

* add error logging for topic validator unregister

* rename GetIdentity -> Identity

* Update middleware_test.go

* fix lint

* adding comments to the topic validator test (#1253)

* adding a comment to middelware.Subscribe describing the new change to add topic validators for stake nodes

* goimports

* making middleware.topologyPeers private again

* Update state/protocol/badger/mutator.go

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

* fix formatting

* Update engine.go

* update Cadence to commit 283fa4b180bf44abe9b46455ef0028ab5cb04226

* makes time to live exportable

* activates dns resolver on libp2p node

* adds dns ttl to scaffold

* adds dns cache ttl to base config

* adds flag for dns node cache ttl

* refactors scaffold with base config for dns cache

* update Cadence to commit eeb3fa23d8754367f93e5f725acd7e7f50a4579b

* adds dns cache for staked access node

* Sync with latest cadence changes

* adds dns cache for unstaked access node

* refactors tests with resolver

* fixes an error

* Dump removed contract codes to a file

* Fix nested type infering

* Invert the storage cleanup flag

* Refactor code

* add separate fallback-explicit EpochForView method

* refactors builder

* refactors resolver invocations

* Fix storage migration V5 tests

* fix mocked function

* Update to Cadence v0.19.0

* stripping out addresses from the root protocol snapshot for the unstaked AN

* Fix fvm payloads getting skipped

* Use cadence v0.19.0

* Use cadence v0.19.0 for intergration

* Update storage migration V4 to latest cadence changes

* Fix tests

* Remove previous storage migration V4 tests

* Fix linter errors

* skipping all validations which involve flow.Identity for the unstaked AN

* Removing unecessary fmt.Println

* Fix --no-migration flag for state extract

* Extracted core from proposal engine

* Renamed proposal to compliance engine

* Updated metrics. Updated compliaence engine for collection nodes to follow new design

* fix bug with tracer sensitivity

* add isSampled flag

* minor fix

* update mock files

* mock update

* clean up

* clean up

* hot fix

* clean up

* fix tests

* fix tests

* fix test

* test fix

* handle empty collection case

* test fix

* Fixed tests for core

* Updated core implementation

* Updated majority of engine tests

* Fixed compilation, linted

* Misc updated to godoc. Linted. Updated labels

* update Cadence to commit 2759300ebf616214cf8a04287f4379bb666b7c98

* Introduced LifecycleManager into hotstuff event loop, compliance engines

* Add block final state commitment

* adding an extra skipNwAddressRelatedValidations parameter to skip validations which use network addresses for the unstaked access node

* Update request_handler.go

* add random delay to periodic peer update thread

We are observing synchronized goroutine spikes causing consensus
finalization issues on large committees on benchet, with the same period
as this peer update process. This adds a random delay before spawning
the thread which requests updates, so that updates are not synchronised.

* update core-contracts version 0.7.7->0.7.9

This new version includes a critical bug fix for the FlowClusterQC
contract

* exit goroutine launched periodically while in initial delay

* add test case for stopping during delay

* add new contract contructor argument

* update state commitments

* tests

* incorporating review comments

* adding comments

* fixing the integration test framework

* minor update to goDoc

* refactoring and removing the sentinel NetworkAddressError

* Improve error handling in access node

* Remove unnecessary error wrapping

* uncommenting code to remove IDs from custers

* reverting previous change to result ID verification

* handle MarkFlagRequired errors

* add testing.T to machine account fixture

* add log hook test utility

* extend machine account tests to check log warn cases

* Updated error handling for consensus engine

* adding unit test

* Update engine_test.go

* update Cadence to commit 05fdc6cb695dca1ccc17b26eccb1d123a2049556

* update Cadence to commit 3d1926c0c05a7a661e9fb192d0bba661afc0d4c1

* Jaeger version update

* make tidy

* don't process empty spans

* go mod clean up

* update Cadence to commit 16d477f6d7d63ae9048f955e5ec22f52337f915c

* small optimization

* [Verification] Adds telemetry for outstanding chunk data pack requests (#1290)

* adds max chunk data pack attempts metric

* adds set max chunk data pack attempts at requester to interface

* generates mocks

* wires in metrics

* fixes TestCompleteRequestingUnsealedChunkLifeCycle

* refactors a log

* fixes TestRequestPendingChunkSealedBlock

* fixes TestRequestPendingChunkSealedBlock_Hybrid

* fixes TestDispatchingRequests_Hybrid(

* adds new metric to demo examples

* fixes tests

* refactors metric to track next unsealed height chunk attempts

* Update module/metrics/verification.go

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

* refactors metrics name

* refactors metrics name

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

* fix timetracking

* clean up exec node traces

* cleanup

* reduce nesting

* performance: cache the ID for Header entities (#1279)

* performance: cache the ID for Header entities

* performance: identify unit tests that have race conditions.

This has fixes for unit tests that were failing due to race conditions
that became apparent when the cache sped up the MakeID() call.

* performance:  adjustments for race condition tests and thread-safety

* performance:  fix lint problem

* Update compactor_test.go

* Update compactor_test.go

* Update cluster_switchover_test.go

* Update compactor_test.go

* Update cluster_switchover_test.go

* Update engine_test.go

* Update header.go

* Update header.go

* Update compactor_test.go

* performance: adjusted this file for linter

* trigger GitHub actions

* trigger GitHub actions

* trigger GitHub actions

* trigger GitHub actions

Co-authored-by: Jeff Winkler <jeff.winkler@dapperlabs.com>

* lower emergency sealing threshold (#1301)

* make tidy

* Update model/flow/identifier.go

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

* applying PR comments

* refactor sampling metod

* hot fix

* hot fix

* Revert "refactor sampling metod"

This reverts commit edcd95b.

* rename save State

* cleanup unnecesary data

* add chainID as a tag on root span

* Update model/flow/identifier.go

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

* fix matching engine (#1294)

* hot fix

* make tidy

Co-authored-by: vishal <1117327+vishalchangrani@users.noreply.github.com>
Co-authored-by: Ramtin M. Seraj <ramtinms@users.noreply.github.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
Co-authored-by: ramtinms <ramtin@axiomzen.co>
Co-authored-by: Yurii Oleksyshyn <yuraolex@gmail.com>
Co-authored-by: Misha Rybalov <gomisha>
Co-authored-by: Supun Setunga <supun.setunga@gmail.com>
Co-authored-by: Simon Zhu <simon.zsiyan@gmail.com>
Co-authored-by: François Garillot <francois.garillot@dapperlabs.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: Kan Zhang <kan@axiomzen.co>
Co-authored-by: Kay-Zee <Kay-Zee@users.noreply.github.com>
Co-authored-by: Yahya <yahya@dapperlabs.com>
Co-authored-by: SupunS <SupunS@users.noreply.github.com>
Co-authored-by: Yahya Hassanzadeh <yhassanzadeh13@ku.edu.tr>
Co-authored-by: Bastian Müller <bastian@axiomzen.co>
Co-authored-by: Maks Pawlak <120831+m4ksio@users.noreply.github.com>
Co-authored-by: Peter Argue <89119817+peterargue@users.noreply.github.com>
Co-authored-by: Janez Podhostnik <janez.podhostnik@gmail.com>
Co-authored-by: janezpodhostnik <janezpodhostnik@users.noreply.github.com>
Co-authored-by: turbolent <turbolent@users.noreply.github.com>
Co-authored-by: Jeff Winkler <jwinkler2083233@users.noreply.github.com>
Co-authored-by: Jeff Winkler <jeff.winkler@dapperlabs.com>
Co-authored-by: Janez Podhostnik <67895329+janezpodhostnik@users.noreply.github.com>
bors bot added a commit that referenced this pull request Jul 6, 2022
2704: Remove header caching r=SaveTheRbtz a=SaveTheRbtz

This 1 item caching introduces global lock.  Also code is quite brittle and can cause bugs if `Header` struct changes.

This diff is no-op on Skylake:
```
name                              old us/tx    new us/tx    delta
ComputeBlock/1/cols/16/txes-16     1.44k ± 0%   1.46k ± 3%    ~     (p=0.099 n=9+10)
ComputeBlock/1/cols/32/txes-16     1.37k ± 1%   1.38k ± 1%  +0.60%  (p=0.007 n=9+10)
ComputeBlock/1/cols/64/txes-16     1.36k ± 1%   1.35k ± 0%  -0.51%  (p=0.014 n=10+8)
ComputeBlock/1/cols/128/txes-16    1.34k ± 1%   1.34k ± 0%    ~     (p=0.673 n=10+9)
ComputeBlock/4/cols/16/txes-16     1.36k ± 2%   1.36k ± 1%    ~     (p=0.237 n=10+10)
ComputeBlock/4/cols/32/txes-16     1.34k ± 1%   1.34k ± 1%    ~     (p=0.329 n=10+10)
ComputeBlock/4/cols/64/txes-16     1.35k ± 1%   1.34k ± 2%    ~     (p=0.117 n=9+10)
ComputeBlock/4/cols/128/txes-16    1.36k ± 2%   1.35k ± 2%    ~     (p=0.287 n=9+10)
ComputeBlock/16/cols/16/txes-16    1.33k ± 1%   1.32k ± 1%    ~     (p=0.643 n=10+10)
ComputeBlock/16/cols/32/txes-16    1.32k ± 1%   1.32k ± 1%    ~     (p=0.202 n=9+10)
ComputeBlock/16/cols/64/txes-16    1.35k ± 1%   1.35k ± 1%    ~     (p=0.957 n=10+10)
ComputeBlock/16/cols/128/txes-16   1.32k ± 1%   1.32k ± 2%    ~     (p=0.643 n=10+10)
```

And a 3% improvement on M1 Pro:
```
$ go test -tags relic -bench='BenchmarkComputeBlock' -count 10 -run='^$' ./engine/execution/computation/.
...
$ benchstat old new
name                             old us/tx    new us/tx     delta
ComputeBlock/1/cols/16/txes-8       590 ± 3%      803 ±62%    ~     (p=0.404 n=10+10)
ComputeBlock/1/cols/32/txes-8       554 ± 2%      559 ± 2%  +0.93%  (p=0.043 n=9+8)
ComputeBlock/1/cols/64/txes-8       557 ± 4%      538 ± 2%  -3.46%  (p=0.000 n=10+10)
ComputeBlock/1/cols/128/txes-8      560 ± 1%      538 ± 2%  -3.92%  (p=0.000 n=7+9)
ComputeBlock/4/cols/16/txes-8       558 ± 3%      539 ± 1%  -3.51%  (p=0.000 n=10+10)
ComputeBlock/4/cols/32/txes-8       555 ± 1%      540 ± 1%  -2.81%  (p=0.000 n=10+10)
ComputeBlock/4/cols/64/txes-8       556 ± 2%      538 ± 2%  -3.13%  (p=0.000 n=10+10)
ComputeBlock/4/cols/128/txes-8      546 ± 1%      527 ± 2%  -3.50%  (p=0.000 n=10+10)
ComputeBlock/16/cols/16/txes-8      552 ± 3%      533 ± 2%  -3.44%  (p=0.000 n=10+10)
ComputeBlock/16/cols/32/txes-8      548 ± 1%      527 ± 2%  -3.68%  (p=0.000 n=9+10)
ComputeBlock/16/cols/64/txes-8      548 ± 1%      530 ± 1%  -3.34%  (p=0.000 n=10+10)
ComputeBlock/16/cols/128/txes-8     522 ± 1%      519 ± 2%    ~     (p=0.327 n=9+9)
```

cc: `@zhangchiqing` since he was also looking at it here #1279 (comment)

Co-authored-by: Alexey Ivanov <rbtz@dapperlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants