Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix beacon-chain/sync Test Race #390

Merged

Conversation

fgimenez
Copy link
Contributor

@fgimenez fgimenez commented Aug 8, 2018

Towards #377

Running the tests with --features=race I was getting:

WARNING: DATA RACE
Write at 0x00c42011d5a8 by goroutine 35:
  github.com/prysmaticlabs/prysm/beacon-chain/sync.TestSavingBlocksInSync()
      beacon-chain/sync/service_test.go:832 +0xb45
  testing.tRunner()
      GOROOT/src/testing/testing.go:777 +0x16d

Previous read at 0x00c42011d5a8 by goroutine 36:
  github.com/prysmaticlabs/prysm/beacon-chain/sync.(*Service).initialSync()
      bazel-out/k8-fastbuild/bin/proto/beacon/p2p/v1/linux_amd64_race_stripped/v1_go_proto~/github.com/prysmaticlabs/prysm/proto/beacon/p2p/v1/messages.pb.go:223 +0x946
  github.com/prysmaticlabs/prysm/beacon-chain/sync.TestSavingBlocksInSync.func1()
      beacon-chain/sync/service_test.go:759 +0x74

Goroutine 35 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:824 +0x564
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1063 +0xa4
  testing.tRunner()
      GOROOT/src/testing/testing.go:777 +0x16d
  testing.runTests()
      GOROOT/src/testing/testing.go:1061 +0x4e1
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:978 +0x2cd
  main.main()
      bazel-out/k8-fastbuild/bin/beacon-chain/sync/linux_amd64_race_stripped/go_default_test~/testmain.go:97 +0x28e

Goroutine 36 (running) created at:
  github.com/prysmaticlabs/prysm/beacon-chain/sync.TestSavingBlocksInSync()
      beacon-chain/sync/service_test.go:758 +0x212
  testing.tRunner()
      GOROOT/src/testing/testing.go:777 +0x16d

There was a problem with reusing msg1 in the test and changing members while the initialization goroutine was still active. It is fixed by generating different block response messages for each assertion.

@codecov
Copy link

codecov bot commented Aug 8, 2018

Codecov Report

Merging #390 into master will decrease coverage by 0.22%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
- Coverage   52.91%   52.69%   -0.23%     
==========================================
  Files          33       33              
  Lines        2695     2695              
==========================================
- Hits         1426     1420       -6     
- Misses       1098     1105       +7     
+ Partials      171      170       -1
Impacted Files Coverage Δ
client/contracts/sharding_manager.go 24.61% <0%> (-1.93%) ⬇️
beacon-chain/rpc/service.go 100% <0%> (+8.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2060d29...95e8de2. Read the comment docs.

@prestonvanloon
Copy link
Member

Thanks for this! Does this solve all of the race condition issues or just one?

If it solves all, can you add --features=race to the travis test?

@fgimenez
Copy link
Contributor Author

fgimenez commented Aug 9, 2018

Does this solve all of the race condition issues or just one?

@prestonvanloon just one, there are currently 4 race conditions on master:

INFO: 104 processes: 104 linux-sandbox.
//beacon-chain/blockchain:go_default_test                       (cached) PASSED in 1.2s
//beacon-chain/powchain:go_default_test                         (cached) PASSED in 1.2s
//beacon-chain/simulator:go_default_test                        (cached) PASSED in 1.2s
//beacon-chain/utils:go_default_test                            (cached) PASSED in 1.1s
//client/attester:go_default_test                               (cached) PASSED in 1.5s
//client/contracts:go_default_test                              (cached) PASSED in 296.9s
//client/mainchain:go_default_test                              (cached) PASSED in 3.2s
//client/observer:go_default_test                               (cached) PASSED in 1.8s
//client/params:go_default_test                                 (cached) PASSED in 1.1s
//client/txpool:go_default_test                                 (cached) PASSED in 1.1s
//client/types:go_default_test                                  (cached) PASSED in 1.2s
//contracts:go_default_test                                     (cached) PASSED in 1.4s
//shared:go_default_test                                        (cached) PASSED in 18.1s
//shared/cmd:go_default_test                                    (cached) PASSED in 1.2s
//shared/database:go_default_test                               (cached) PASSED in 1.4s
//shared/p2p:go_feed_concurrent_write_test                      (cached) PASSED in 1.7s
//beacon-chain/node:go_default_test                                      PASSED in 1.9s
//beacon-chain/rpc:go_default_test                                       PASSED in 1.1s
//client/node:go_default_test                                            PASSED in 1.8s
//client/rpcclient:go_default_test                                       PASSED in 1.1s
//beacon-chain/sync:go_default_test                                      FAILED in 5 out of 5 in 0.2s
  Stats over 5 runs: max = 0.2s, min = 0.1s, avg = 0.2s, dev = 0.0s
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/beacon-chain/sync/go_default_test/test_attempts/attempt_1.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/beacon-chain/sync/go_default_test/test_attempts/attempt_2.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/beacon-chain/sync/go_default_test/test_attempts/attempt_3.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/beacon-chain/sync/go_default_test/test_attempts/attempt_4.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/beacon-chain/sync/go_default_test/test.log
//client/proposer:go_default_test                                        FAILED in 5 out of 5 in 3.8s
  Stats over 5 runs: max = 3.8s, min = 3.7s, avg = 3.8s, dev = 0.1s
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/proposer/go_default_test/test_attempts/attempt_1.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/proposer/go_default_test/test_attempts/attempt_2.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/proposer/go_default_test/test_attempts/attempt_3.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/proposer/go_default_test/test_attempts/attempt_4.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/proposer/go_default_test/test.log
//client/syncer:go_default_test                                          FAILED in 5 out of 5 in 0.9s
  Stats over 5 runs: max = 0.9s, min = 0.9s, avg = 0.9s, dev = 0.0s
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/syncer/go_default_test/test_attempts/attempt_1.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/syncer/go_default_test/test_attempts/attempt_2.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/syncer/go_default_test/test_attempts/attempt_3.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/syncer/go_default_test/test_attempts/attempt_4.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/client/syncer/go_default_test/test.log
//shared/p2p:go_default_test                                             FAILED in 5 out of 5 in 1.7s
  Stats over 5 runs: max = 1.7s, min = 1.3s, avg = 1.5s, dev = 0.2s
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/shared/p2p/go_default_test/test_attempts/attempt_1.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/shared/p2p/go_default_test/test_attempts/attempt_2.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/shared/p2p/go_default_test/test_attempts/attempt_3.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/shared/p2p/go_default_test/test_attempts/attempt_4.log
  /home/fgimenez/.cache/bazel/_bazel_fgimenez/512843e3dd6bfc3af7fbee03058a93ee/execroot/__main__/bazel-out/k8-fastbuild/testlogs/shared/p2p/go_default_test/test.log

This PR solves //beacon-chain/sync:go_default_test. I think we can solve //client/syncer:go_default_test and part of //client/proposer:go_default_test too, will propose PRs for them next.

The rest of the issues come from the libp2p-go thirdparty library. The problem is tracked in this upstream issue libp2p/go-libp2p#257, they are aiming to switch the mdns library but seem to be blocked by grandcat/zeroconf#38

While this is not solved we could add the --features=race to travis command as you suggest and change the bazel config for the tests that are known to be racy, similar to what we did here https://github.com/prysmaticlabs/prysm/pull/321/files#diff-f127b9ba81d65ec476bc4edb33289780 but the other way around (setting race=off for the racy tests). This way the race detector would be enabled by default, WDYT?

@prestonvanloon
Copy link
Member

OK, this is great progress nonetheless.

I recommend adding a race testing exclusion on those targets for now:

    race = "off", # TODO(#377): fix issues with race detection testing.

And adding --features=race to the travis yaml. Maybe this will prevent future issues while we work through the existing ones.
Thanks!

SlotNumber: uint64(20),
CrystallizedStateHash: crystallizedStateHash[:],
}
getBlockResponseMsg := func(slotNumber uint64) p2p.Message {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat!

@fgimenez
Copy link
Contributor Author

fgimenez commented Aug 9, 2018

@prestonvanloon done, PTAL.

The racy tests are added to specific racy_tests.go files, and those are configured to be run with race=off in the respective BUILD.bazel files.

I've added the race feature to .travis-bazelrc so that it's easier to reproduce CI conditions locally, let me know what do you think.

@fgimenez
Copy link
Contributor Author

fgimenez commented Aug 9, 2018

just pushed changes to disable race detector on client/contracts, was timing out with it.

Copy link
Member

@prestonvanloon prestonvanloon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than lumping all of the package tests into one test, racy_test.go.

What do you think about having multiple so it’s easy to know where they go after the problem has been solved?

service_norace_test.go
discovery_norace_test.go

Etc

@fgimenez
Copy link
Contributor Author

@prestonvanloon great suggestion, done PTAL.

Copy link
Member

@terencechain terencechain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fgimenez
Copy link
Contributor Author

I have ready changes for preventing the client/syncer race, should I push them here or in a separate PR once this one is merged? The changes in this one for enabling the race detector by default would be required to verify that the fix works.

@fgimenez fgimenez force-pushed the fix-beaconchain-sync-test-race branch from cc5559d to ee29d53 Compare August 12, 2018 18:38
@prestonvanloon
Copy link
Member

@fgimenez In another PR please. I'll merge this now

@prestonvanloon prestonvanloon merged commit dd29481 into prysmaticlabs:master Aug 12, 2018
@fgimenez fgimenez deleted the fix-beaconchain-sync-test-race branch August 12, 2018 20:28
@fgimenez
Copy link
Contributor Author

With the removal of client/syncer in #395 the only remaining race is in shared/p2p caused by libp2p-go. I'll keep an eye on libp2p/go-libp2p#257 and grandcat/zeroconf#38.

This was referenced Aug 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants