Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race caused by design of native Neo validators/committee cache #3110

Merged
merged 7 commits into from
Oct 10, 2023

Conversation

AnnaShaleva
Copy link
Member

@AnnaShaleva AnnaShaleva commented Aug 29, 2023

Fix validators/committee cache of native Neo contract. See the bug description in the commit messages.

Close #2989 along the way.

@AnnaShaleva
Copy link
Member Author

And I'd add this PR into 0.102.0 milestone because of the last commit. We don't want this bug to be present in our next release.

@AnnaShaleva AnnaShaleva added this to the v0.102.0 milestone Aug 29, 2023
@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #3110 (beba0f0) into master (eeb439f) will decrease coverage by 0.04%.
Report is 7 commits behind head on master.
The diff coverage is 87.23%.

❗ Current head beba0f0 differs from pull request most recent head d964420. Consider uploading reports for the commit d964420 to get more accurate results

@@            Coverage Diff             @@
##           master    #3110      +/-   ##
==========================================
- Coverage   84.91%   84.87%   -0.04%     
==========================================
  Files         330      330              
  Lines       44337    44365      +28     
==========================================
+ Hits        37648    37657       +9     
- Misses       5172     5198      +26     
+ Partials     1517     1510       -7     
Files Coverage Δ
pkg/compiler/codegen.go 92.05% <100.00%> (ø)
pkg/consensus/consensus.go 75.96% <100.00%> (+1.07%) ⬆️
pkg/core/blockchain.go 78.63% <100.00%> (ø)
pkg/core/storage/boltdb_store.go 78.76% <100.00%> (ø)
pkg/core/native/native_neo.go 82.32% <84.61%> (+0.18%) ⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pkg/core/native/native_neo.go Outdated Show resolved Hide resolved
pkg/core/native/native_neo.go Outdated Show resolved Hide resolved
pkg/core/native/native_neo.go Show resolved Hide resolved
roman-khimov added a commit that referenced this pull request Aug 29, 2023
Inspired by #3110.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
@AnnaShaleva AnnaShaleva force-pushed the fix-tests branch 2 times, most recently from 5df29aa to 7db8a5a Compare August 30, 2023 11:08
@AnnaShaleva
Copy link
Member Author

@roman-khimov, I've fixed the latter problem, let's wait until the PR tests will finish. Unfortunately, I can't reproduce these tests locally even with -race.

@AnnaShaleva
Copy link
Member Author

AnnaShaleva commented Aug 30, 2023

OK, now we have race problem with config.Version between different tests runs, so the first fix doesn't work:

2023-08-30T11:14:47.3084055Z === RUN   TestNEP11_ND_OwnerOf_BalanceOf_Transfer
2023-08-30T11:14:47.3084315Z ==================
2023-08-30T11:14:47.3084541Z WARNING: DATA RACE
2023-08-30T11:14:47.3084827Z Write at 0x0000026ba670 by goroutine 347:
2023-08-30T11:14:47.3085338Z   github.com/nspcc-dev/neo-go/internal/testcli.NewExecutorWithConfig()
2023-08-30T11:14:47.3086038Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:198 +0x4bd
2023-08-30T11:14:47.3086557Z   github.com/nspcc-dev/neo-go/internal/testcli.NewExecutor()
2023-08-30T11:14:47.3087195Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:176 +0x54
2023-08-30T11:14:47.3087796Z   github.com/nspcc-dev/neo-go/cli/nep_test_test.TestNEP11_ND_OwnerOf_BalanceOf_Transfer()
2023-08-30T11:14:47.3088465Z       /home/runner/work/neo-go/neo-go/cli/nep_test/nep11_test.go:113 +0x3e
2023-08-30T11:14:47.3088801Z   testing.tRunner()
2023-08-30T11:14:47.3089276Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
2023-08-30T11:14:47.3089609Z   testing.(*T).Run.func1()
2023-08-30T11:14:47.3090085Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1493 +0x47
2023-08-30T11:14:47.3090305Z 
2023-08-30T11:14:47.3090458Z Previous read at 0x0000026ba670 by goroutine 104:
2023-08-30T11:14:47.3090928Z   github.com/nspcc-dev/neo-go/pkg/network.(*Server).Start()
2023-08-30T11:14:47.3091538Z       /home/runner/work/neo-go/neo-go/pkg/network/server.go:288 +0x670
2023-08-30T11:14:47.3092074Z   github.com/nspcc-dev/neo-go/internal/testcli.NewTestChain.func2()
2023-08-30T11:14:47.3092887Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:167 +0x39
2023-08-30T11:14:47.3093113Z 
2023-08-30T11:14:47.3093247Z Goroutine 347 (running) created at:
2023-08-30T11:14:47.3093526Z   testing.(*T).Run()
2023-08-30T11:14:47.3093995Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1493 +0x75d
2023-08-30T11:14:47.3094631Z   testing.runTests.func1()
2023-08-30T11:14:47.3095129Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1846 +0x99
2023-08-30T11:14:47.3095442Z   testing.tRunner()
2023-08-30T11:14:47.3096051Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
2023-08-30T11:14:47.3096382Z   testing.runTests()
2023-08-30T11:14:47.3096857Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1844 +0x7ec
2023-08-30T11:14:47.3097162Z   testing.(*M).Run()
2023-08-30T11:14:47.3097636Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1726 +0xa84
2023-08-30T11:14:47.3097952Z   main.main()
2023-08-30T11:14:47.3098250Z       _testmain.go:59 +0x2e9
2023-08-30T11:14:47.3098409Z 
2023-08-30T11:14:47.3098541Z Goroutine 104 (finished) created at:
2023-08-30T11:14:47.3099030Z   github.com/nspcc-dev/neo-go/internal/testcli.NewTestChain()
2023-08-30T11:14:47.3099683Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:167 +0x1071
2023-08-30T11:14:47.3100247Z   github.com/nspcc-dev/neo-go/internal/testcli.NewExecutorWithConfig()
2023-08-30T11:14:47.3100919Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:201 +0x524
2023-08-30T11:14:47.3101427Z   github.com/nspcc-dev/neo-go/internal/testcli.NewExecutor()
2023-08-30T11:14:47.3102061Z       /home/runner/work/neo-go/neo-go/internal/testcli/executor.go:176 +0x54
2023-08-30T11:14:47.3102588Z   github.com/nspcc-dev/neo-go/cli/nep_test_test.TestNEP11Import()
2023-08-30T11:14:47.3103218Z       /home/runner/work/neo-go/neo-go/cli/nep_test/nep11_test.go:35 +0x3e
2023-08-30T11:14:47.3103524Z   testing.tRunner()
2023-08-30T11:14:47.3104005Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
2023-08-30T11:14:47.3104338Z   testing.(*T).Run.func1()
2023-08-30T11:14:47.3104849Z       /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1493 +0x47
2023-08-30T11:14:47.3105148Z ==================

@AnnaShaleva AnnaShaleva marked this pull request as draft August 30, 2023 11:37
@AnnaShaleva AnnaShaleva force-pushed the fix-tests branch 4 times, most recently from a908829 to d27cc55 Compare August 30, 2023 11:58
pkg/core/native/native_neo.go Outdated Show resolved Hide resolved
pkg/core/native/native_neo.go Outdated Show resolved Hide resolved
pkg/core/native/native_neo.go Outdated Show resolved Hide resolved
@AnnaShaleva AnnaShaleva force-pushed the fix-tests branch 3 times, most recently from 9b83c52 to 4184e40 Compare August 30, 2023 15:37
internal/versionutil/init.go Outdated Show resolved Hide resolved
pkg/compiler/compiler_test.go Outdated Show resolved Hide resolved
pkg/consensus/consensus.go Show resolved Hide resolved
@@ -725,7 +725,7 @@ func (s *service) newBlockFromContext(ctx *dbft.Context) block.Block {
var err error
cfg := s.Chain.GetConfig().ProtocolConfiguration
if cfg.ShouldUpdateCommitteeAt(ctx.BlockIndex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one is definitely not the optimisation as far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here as well.

@AnnaShaleva AnnaShaleva force-pushed the fix-tests branch 2 times, most recently from 555e3e8 to 8103179 Compare August 30, 2023 17:15
@AnnaShaleva
Copy link
Member Author

OK, @roman-khimov, something is wrong with the last commit that optimizes cache usage. The rest of commits work fine, although I haven't tested them in hybrid network scenario. Could you take a look, please?

@AnnaShaleva AnnaShaleva added the bug Something isn't working label Oct 9, 2023
@AnnaShaleva
Copy link
Member Author

AnnaShaleva commented Oct 9, 2023

@roman-khimov, the overall solution that is shown in the Files changed section is valid and solves the problem. However, some of PR's commits are not valid if taken along and separately from the others. Is it better to squash these commits? At the same time, each commit represents an isolated change of the Neo cache.

@AnnaShaleva AnnaShaleva marked this pull request as ready for review October 9, 2023 17:02
@AnnaShaleva AnnaShaleva changed the title *: fix failing tests, part 1 Fix race caused by design of native Neo validators/committee cache Oct 9, 2023
@roman-khimov
Copy link
Member

e3fadde has some commented out code that is removed afterwards. But otherwise it seems to be OK.

@AnnaShaleva
Copy link
Member Author

OK, I've rechecked this code and the logic seems to be correct to me. Hope, that there's no bugs in the updated logic.

Blockchain passes his own pure unwrapped DAO to
(*Blockchain).ComputeNextBlockValidators which means that native
RW NEO cache structure stored inside this DAO can be modified by
anyone who uses exported ComputeNextBlockValidators Blockchain API,
and technically it's valid, and we should allow this, because it's
the only purpose of `validators` caching. However, at the same time
some RPC server is allowed to request a subsequent wrapped DAO for
some test invocation. It means that descendant wrapped DAO
eventually will request RW NEO cache and try to `Copy()`
the underlying's DAO cache which is in direct use of
ComputeNextBlockValidators. Here's the race:
ComputeNextBlockValidators called by Consensus service tries to
update cached `validators` value, and descendant wrapped DAO
created by the  RPC server tries to copy DAO's native cache and
read the cached `validators` value.

So the problem is that native cache not designated to handle
concurrent access between parent DAO layer and derived (wrapped)
DAO layer. I've carefully reviewed all the usages of native cache,
and turns out that the described situation is the only place where
parent DAO is used directly to modify its cache concurrently with
some descendant DAO that is trying to access the cache. All other
usages of native cache (not only NEO, but also all other native
contrcts) strictly rely on the hierarchical DAO structure and don't
try to perform these concurrent operations between DAO layers.
There's also persist operation, but it keeps cache RW lock taken,
so it doesn't have this problem as far. Thus, in this commit we rework
NEO's `validators` cache value so that it always contain the relevant
list for upper Blockchain's DAO and is updated every PostPersist (if
needed).

Note: we must be very careful extending our native cache in the
future, every usage of native cache must be checked against the
described problem.

Close #2989.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
We have two similar blockchain APIs: GetNextBlockValidators and GetValidators.
It's hard to distinguish them, thus renaming it to match the meaning, so what
we have now is:

GetNextBlockValidators literally just returns the top of the committee that
was elected in the start of batch of CommitteeSize blocks batch. It doesn't
change its valie every block.

ComputeNextBlockValidators literally computes the list of validators based on
the most fresh committee members information got from the NeoToken's storage
and based on the latest register/unregister/vote events. The list returned by
this method may be updated every block.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
Recalculate them once per epoch. Consensus is aware of it and must
call CalculateNextValidators exactly when needed.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
…ors`

Adjust all comments, make the field name match its meaning.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
No funcional changes, just refactoring. It doesn't need the whole cache,
only the set of committee keys with votes.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
Do not recalculate new committee/validators value in the start of every
subsequent epoch. Use values that was calculated in the PostPersist method
of the previously processed block in the end of the previous epoch.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
If it's the end of epoch, then it contains the updated validators list recalculated
during the last block's PostPersist. If it's middle of the epoch, then it contains
previously calculated value (value for the previous completed epoch) that is equal
to the current nextValidators cache value.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
@AnnaShaleva
Copy link
Member Author

Tests are checked, TestWalletClaimGas isn't failing any more (at least, at the several subsequent runs). Ready for review and merge.

@roman-khimov roman-khimov merged commit 8fcf1ee into master Oct 10, 2023
12 of 16 checks passed
@roman-khimov roman-khimov deleted the fix-tests branch October 10, 2023 11:29
@AnnaShaleva AnnaShaleva mentioned this pull request Oct 20, 2023
AnnaShaleva added a commit that referenced this pull request Nov 2, 2023
Refactored native NeoToken cache scheme introduced in #3110 sometimes requires
validators list recalculation during native cache initialization process (when
initializing with the existing storage from the block that preceeds each N-th block).
To recalculate validators from candidates, native NeoToken needs an access to
cached native Policy blocked accounts. By the moment of native Neo initialization,
the cache of native Policy is not yet initialized, thus we need a direct DAO access
for Policy to handle blocked account check.

Close #3181.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
AnnaShaleva added a commit that referenced this pull request Nov 2, 2023
Refactored native NeoToken cache scheme introduced in #3110 sometimes requires
validators list recalculation during native cache initialization process (when
initializing with the existing storage from the block that is preceded each N-th block).
To recalculate validators from candidates, native NeoToken needs an access to
cached native Policy blocked accounts. By the moment of native Neo initialization,
the cache of native Policy is not yet initialized, thus we need a direct DAO access
for Policy to handle blocked account check.

Close #3181.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race detected during TestWalletClaimGas test execution
2 participants