Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data races and enable -race in CI #634

Merged
merged 8 commits into from
Jun 9, 2021
Merged

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Jun 9, 2021

Branched from #633
Fixes #622

Best viewed by individual commit. Full details in the commit messages.

  • updates memberlist (to pickup some data race fixes from that library)
  • skips some tests where the data race is caused by the test itself
  • fixes some data races (some in tests, and some outside of tests)

I fixed one of the data races in the cmd packages, but there are more. So for now we only run the race detector for the serf package.

dnephin added 5 commits June 9, 2021 13:24
These tests cause data races because they modify data that is passed to a goroutine. These are
not data races that exist in serf itself, so skipping them for now to get tests running with
the race detector.

Also fix one test that introduces a data race.
The memberState stored in failedMembers is a pointer, so the lock must
be held until after the fields are read from the struct.

```
WARNING: DATA RACE
Write at 0x00c0003facd8 by goroutine 15:
  github.com/hashicorp/serf/serf.(*Serf).handleNodeJoin()
      /home/daniel/pers/code/serf/serf/serf.go:962 +0x3d8
  github.com/hashicorp/serf/serf.(*eventDelegate).NotifyJoin()
      /home/daniel/pers/code/serf/serf/event_delegate.go:12 +0x55
  github.com/hashicorp/memberlist.(*Memberlist).aliveNode()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/state.go:1123 +0x957
  github.com/hashicorp/memberlist.(*Memberlist).handleAlive()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/net.go:677 +0x58e
  github.com/hashicorp/memberlist.(*Memberlist).packetHandler()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/net.go:452 +0x236

Previous read at 0x00c0003facd8 by goroutine 37:
  github.com/hashicorp/serf/serf.(*Serf).reconnect()
      /home/daniel/pers/code/serf/serf/serf.go:1645 +0x2c9
  github.com/hashicorp/serf/serf.(*Serf).handleReconnect()
      /home/daniel/pers/code/serf/serf/serf.go:1571 +0x57

Goroutine 15 (running) created at:
  github.com/hashicorp/memberlist.newMemberlist()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/memberlist.go:218 +0xa29
  github.com/hashicorp/memberlist.Create()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/memberlist.go:228 +0x3c
  github.com/hashicorp/serf/serf.Create()
      /home/daniel/pers/code/serf/serf/serf.go:410 +0x1494
  github.com/hashicorp/serf/serf.TestSerf_Coordinates()
      /home/daniel/pers/code/serf/serf/serf_test.go:2664 +0x3c4
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202

Goroutine 37 (running) created at:
  github.com/hashicorp/serf/serf.Create()
      /home/daniel/pers/code/serf/serf/serf.go:423 +0x1704
  github.com/hashicorp/serf/serf.TestSerf_Coordinates()
      /home/daniel/pers/code/serf/serf/serf_test.go:2664 +0x3c4
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202
```
Serf.state was being accessed without acquiring the stateLock, which
caused a data race when any other goroutine changes the state.
`handleNodeLeaveIntent` acquires the memberLock, so we need to be
careful of a deadlock here. From what I can tell there is no place where
we acquire both locks at this time.

Out of caution I have moved the stateLock acquire to before the
memberLock acquire, but maybe a better fix would be to check the state
after the memberLock is acquired?

```
WARNING: DATA RACE
Write at 0x00c0003b4138 by goroutine 60:
  github.com/hashicorp/serf/serf.(*Serf).Leave()
      /home/daniel/pers/code/serf/serf/serf.go:750 +0x3e9
  github.com/hashicorp/serf/serf.TestSerf_Coordinates()
      /home/daniel/pers/code/serf/serf/serf_test.go:2732 +0xc1a
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202

Previous read at 0x00c0003b4138 by goroutine 98:
  github.com/hashicorp/serf/serf.(*Serf).handleNodeLeaveIntent()
      /home/daniel/pers/code/serf/serf/serf.go:1111 +0xa39
  github.com/hashicorp/serf/serf.(*delegate).MergeRemoteState()
      /home/daniel/pers/code/serf/serf/delegate.go:255 +0x464
  github.com/hashicorp/memberlist.(*Memberlist).mergeRemoteState()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/net.go:1178 +0x779
  github.com/hashicorp/memberlist.(*Memberlist).pushPullNode()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/state.go:653 +0x244
  github.com/hashicorp/memberlist.(*Memberlist).Join()
      /home/daniel/go/pkg/mod/github.com/hashicorp/memberlist@v0.2.4/memberlist.go:264 +0x2ad
  github.com/hashicorp/serf/serf.(*Serf).reconnect()
      /home/daniel/pers/code/serf/serf/serf.go:1654 +0x4c4
  github.com/hashicorp/serf/serf.(*Serf).handleReconnect()
      /home/daniel/pers/code/serf/serf/serf.go:1571 +0x57

Goroutine 60 (running) created at:
  testing.(*T).Run()
      /usr/lib/go/src/testing/testing.go:1238 +0x5d7
  testing.runTests.func1()
      /usr/lib/go/src/testing/testing.go:1511 +0xa6
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202
  testing.runTests()
      /usr/lib/go/src/testing/testing.go:1509 +0x612
  testing.(*M).Run()
      /usr/lib/go/src/testing/testing.go:1417 +0x3b3
  main.main()
      _testmain.go:263 +0x236

Goroutine 98 (running) created at:
  github.com/hashicorp/serf/serf.Create()
      /home/daniel/pers/code/serf/serf/serf.go:423 +0x1704
  github.com/hashicorp/serf/serf.TestSerf_Coordinates()
      /home/daniel/pers/code/serf/serf/serf_test.go:2664 +0x3c4
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202
```
@dnephin dnephin requested a review from a team June 9, 2021 19:33
@dnephin dnephin changed the title Dnephin/fix data races Fix data races and enable -race in CI Jun 9, 2021
dnephin added 2 commits June 9, 2021 15:45
The i.stop bool was read from different goroutines without a lock.
because we are skipping a few tests when the race detector is enabled.
@dnephin dnephin force-pushed the dnephin/fix-data-races branch from 56b493f to 67ab3fe Compare June 9, 2021 19:46
There was a data race because a pointer to a struct was re-used in another Serf instance

Using t.Log as an io.Writer for logging is problematic, because goroutines can outlive the test, which
causes a panic. Tests should log to stderr, not t.Log
Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Base automatically changed from dnephin/remove-vendor to master June 9, 2021 21:51
@dnephin dnephin merged commit 79deb84 into master Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race: handleNodeLeaveIntent reads state field without acquiring state lock
2 participants