Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: QueryContext() and Scan() data race #23519

Closed
felixge opened this issue Jan 23, 2018 · 20 comments
Closed

database/sql: QueryContext() and Scan() data race #23519

felixge opened this issue Jan 23, 2018 · 20 comments

Comments

@felixge
Copy link

@felixge felixge commented Jan 23, 2018

What version of Go are you using (go version)?

I reproduced this issue with multiple versions:

go version devel +99e6e482f4 Tue Jan 23 07:52:44 2018 +0000 darwin/amd64
go version go1.9.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/felixge/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/felixge/code/go"
GORACE=""
GOROOT="/Users/felixge/code/golang"
GOTMPDIR=""
GOTOOLDIR="/Users/felixge/code/golang/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/fn/ln1qjq1927q88vpxvm7smkvm0000gn/T/go-build664804739=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a test case involving a QueryContext query that is canceled in the middle of a call to rows.Scan(). The best minimal reproduction I was able to come up with was to apply this test patch to the Go core:

diff --git a/src/database/sql/sql_test.go b/src/database/sql/sql_test.go
index 8137eff82b..fd6ed14086 100644
--- a/src/database/sql/sql_test.go
+++ b/src/database/sql/sql_test.go
@@ -3194,6 +3194,9 @@ func TestIssue18429(t *testing.T) {
 			if rows != nil {
 				// Call Next to test Issue 21117 and check for races.
 				for rows.Next() {
+					// Call Scan() to trigger additional race conditions.
+					var name string
+					_ = rows.Scan(&name)
 				}
 				rows.Close()
 			}

The race detector doesn't always pick it up, so I was testing this in a loop:

cd src/database/sql
while go test -v -race -run 'TestIssue18429'; do :; done

What did you expect to see?

No race condition.

What did you see instead?

=== RUN   TestIssue18429
--- PASS: TestIssue18429 (0.07s)
PASS
ok  	database/sql	1.083s
=== RUN   TestIssue18429
==================
WARNING: DATA RACE
Write at 0x00c420158105 by goroutine 62:
  database/sql.(*rowsCursor).Close()
      /Users/felixge/code/golang/src/database/sql/fakedb_test.go:1025 +0xda
  database/sql.(*Rows).close.func1()
      /Users/felixge/code/golang/src/database/sql/sql.go:2920 +0x6e
  database/sql.withLock()
      /Users/felixge/code/golang/src/database/sql/sql.go:3032 +0x74
  database/sql.(*Rows).close()
      /Users/felixge/code/golang/src/database/sql/sql.go:2919 +0x1af
  database/sql.(*Rows).awaitDone()
      /Users/felixge/code/golang/src/database/sql/sql.go:2588 +0x138

Previous read at 0x00c420158105 by goroutine 23:
  runtime.slicebytetostring()
      /Users/felixge/code/golang/src/runtime/string.go:72 +0x0
  database/sql.convertAssign()
      /Users/felixge/code/golang/src/database/sql/convert.go:239 +0x21f0
  database/sql.(*Rows).Scan()
      /Users/felixge/code/golang/src/database/sql/sql.go:2886 +0x280
  database/sql.TestIssue18429.func1()
      /Users/felixge/code/golang/src/database/sql/sql_test.go:3199 +0x27b

Goroutine 62 (running) created at:
  database/sql.(*Rows).initContextClose()
      /Users/felixge/code/golang/src/database/sql/sql.go:2572 +0xda
  database/sql.(*DB).queryDC()
      /Users/felixge/code/golang/src/database/sql/sql.go:1509 +0x51f
  database/sql.(*Tx).QueryContext()
      /Users/felixge/code/golang/src/database/sql/sql.go:2128 +0x19f
  database/sql.TestIssue18429.func1()
      /Users/felixge/code/golang/src/database/sql/sql_test.go:3193 +0x1fe

Goroutine 23 (running) created at:
  database/sql.TestIssue18429()
      /Users/felixge/code/golang/src/database/sql/sql_test.go:3176 +0x19f
  testing.tRunner()
      /Users/felixge/code/golang/src/testing/testing.go:777 +0x16d
==================
--- FAIL: TestIssue18429 (0.07s)
	testing.go:730: race detected during execution of test
FAIL
exit status 1
FAIL	database/sql	0.084s

I strongly suspect that the work done in 1126d14 / #21117 is still missing some locking.

@ALTree
Copy link
Member

@ALTree ALTree commented Jan 23, 2018

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 23, 2018

@felixge The fix that went into 1.9.2 was indeed a partial fix; it fixed the case that was actively causing problems. At the same time we put a complete fix in for 1.10. When testing locally I've been unable to trigger a data race with go1.10.

@felixge
Copy link
Author

@felixge felixge commented Jan 24, 2018

@kardianos did you test the patch I provided against the go master branch? I'm not seeing a race without the patch, but once I add the Scan() call it appears.

My colleague is also able to reproduce this on his machine and will post his test output here shortly.

@robertdxy
Copy link

@robertdxy robertdxy commented Jan 24, 2018

@kardianos I use the same code from @felixge to test it on master, but it seems the same issue happened again.

go env

➜  sql git:(master) ✗ go env
GOARCH="amd64"
GOBIN="/Users/robertdeng/App/goroot/go/bin"
GOCACHE="/Users/robertdeng/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/robertdeng/App/go"
GORACE=""
GOROOT="/Users/robertdeng/App/goroot/go"
GOTMPDIR=""
GOTOOLDIR="/Users/robertdeng/App/goroot/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/7y/chtf56s51v13dwy77t2hb90w0000gn/T/go-build348294305=/tmp/go-build -gno-record-gcc-switches -fno-common"

go version

➜  sql git:(master) ✗ go version
go version devel +157d8cfbc1 Wed Jan 24 07:18:36 2018 +0000 darwin/amd64

TestIssue18429 output

➜  sql git:(master) ✗ while go test -v -race -run 'TestIssue18429'; do :; done
=== RUN   TestIssue18429
--- PASS: TestIssue18429 (0.07s)
PASS
ok  	database/sql	1.083s
=== RUN   TestIssue18429
--- PASS: TestIssue18429 (0.07s)
PASS
ok  	database/sql	1.089s
=== RUN   TestIssue18429
==================
WARNING: DATA RACE
Write at 0x00c4201560f0 by goroutine 83:
  database/sql.(*rowsCursor).Close()
      /Users/robertdeng/App/goroot/go/src/database/sql/fakedb_test.go:1025 +0xda
  database/sql.(*Rows).close.func1()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2920 +0x6e
  database/sql.withLock()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:3032 +0x74
  database/sql.(*Rows).close()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2919 +0x1af
  database/sql.(*Rows).awaitDone()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2588 +0x138

Previous read at 0x00c4201560f0 by goroutine 25:
  runtime.slicebytetostring()
      /Users/robertdeng/App/goroot/go/src/runtime/string.go:72 +0x0
  database/sql.convertAssign()
      /Users/robertdeng/App/goroot/go/src/database/sql/convert.go:239 +0x21f0
  database/sql.(*Rows).Scan()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2886 +0x280
  database/sql.TestIssue18429.func1()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql_test.go:3199 +0x27b

Goroutine 83 (running) created at:
  database/sql.(*Rows).initContextClose()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2572 +0xda
  database/sql.(*DB).queryDC()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:1509 +0x51f
  database/sql.(*Tx).QueryContext()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql.go:2128 +0x19f
  database/sql.TestIssue18429.func1()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql_test.go:3193 +0x1fe

Goroutine 25 (running) created at:
  database/sql.TestIssue18429()
      /Users/robertdeng/App/goroot/go/src/database/sql/sql_test.go:3176 +0x19f
  testing.tRunner()
      /Users/robertdeng/App/goroot/go/src/testing/testing.go:777 +0x16d
==================
--- FAIL: TestIssue18429 (0.07s)
	testing.go:730: race detected during execution of test
FAIL
exit status 1
FAIL	database/sql	0.082s
@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 24, 2018

@felixge
Copy link
Author

@felixge felixge commented Jan 24, 2018

@kardianos the git sha hashes are included in the version numbers we reported. For your convenience:

Basically we both tested against HEAD of master and were able to see the race once the Scan() patch from above was applied.

Thank you so much for looking into this! Let me know if we can provide more information.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 24, 2018

I can see the mechanism for this race, even though I have been unable to reproduce on my Linux amd64 boxes.

It seems to be caused by the fakedb close corrupting the buffer (on purpose for 2622) while a scan is concurrently taking place. The "scan" method that contains a convert call for the buffer doesn't explicitly call the driver, but it does touch the provided buffer (which is corrupted in fakedb, thus the race where Close is called concurrently with Scan in that test.

I can see three ways of resolving this:

  1. Don't corrupt the buffer on close in fakedb. State in the driver docs that drivers shouldn't modify the buffer except in driver.Rows.Next.
  2. In Rows.Scan, do a RLock around the whole method, not just the is closed check.
  3. In the Rows.Scan method, add a dc lock around the convert columns loop.

I think (3) would provide the most correct solution if we say the row buffer could be corrupted any time we call into the driver. (2) is tempting because it wouldn't add any additional locking (it would increase the lock duration), but still resolve this case.

A part of me feels like this is a driver bug that we shouldn't try to protect against.
@bradfitz Do you have any opinions on this? If not, I think I would go with (2) for go1.10, though I'd be interested in perusing (1).

@felixge
Copy link
Author

@felixge felixge commented Jan 24, 2018

@kardianos FWIW we're seeing this race with lib/pq in our application test suite, so it seems real-world drivers are behaving similar to fakedb.

It seems like solution 2/3 would fix the problem without requiring the drivers to change (nice), but solution 1 sounds reasonable as well.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 24, 2018

@felixge Can you share the race you are seeing with lib/pq along with the revision of lib/pq?
It may not be a big deal, but I would ideally like to confirm that they are the same issue.

@kardianos kardianos added this to the Go1.10 milestone Jan 24, 2018
@felixge
Copy link
Author

@felixge felixge commented Jan 24, 2018

@kardianos sure. Here is the info:

lib/pq version: 61fe37aa2ee24fabcdbe5c4ac1d4ac566f88f345 (just upgraded to latest master/HEAD before re-running the test)

$ go version
go version devel +99e6e482f4 Tue Jan 23 07:52:44 2018 +0000 darwin/amd64
$ go test -v -race -run 'TestResponses'
=== RUN   TestResponses
==================
WARNING: DATA RACE
Write at 0x00c4201e4020 by goroutine 73:
  runtime.slicecopy()
      /Users/felixge/code/golang/src/runtime/slice.go:192 +0x0
  bufio.(*Reader).Read()
      /Users/felixge/code/golang/src/bufio/bufio.go:227 +0x229
  io.ReadAtLeast()
      /Users/felixge/code/golang/src/io/io.go:309 +0x93
  io.ReadFull()
      /Users/felixge/code/golang/src/io/io.go:327 +0x72
  our/project/vendor/github.com/lib/pq.(*conn).recvMessage()
      /Users/felixge/code/go/src/our/project/vendor/github.com/lib/pq/conn.go:971 +0x325
  our/project/vendor/github.com/lib/pq.(*conn).recv1Buf()
      /Users/felixge/code/go/src/our/project/vendor/github.com/lib/pq/conn.go:1007 +0x46
  our/project/vendor/github.com/lib/pq.(*rows).Next()
      /Users/felixge/code/go/src/our/project/vendor/github.com/lib/pq/conn.go:1399 +0x125
  our/project/vendor/github.com/lib/pq.(*rows).Close()
      /Users/felixge/code/go/src/our/project/vendor/github.com/lib/pq/conn.go:1356 +0x7b
  database/sql.(*Rows).close.func1()
      /Users/felixge/code/golang/src/database/sql/sql.go:2920 +0x6e
  database/sql.withLock()
      /Users/felixge/code/golang/src/database/sql/sql.go:3032 +0x74
  database/sql.(*Rows).close()
      /Users/felixge/code/golang/src/database/sql/sql.go:2919 +0x1af
  database/sql.(*Rows).awaitDone()
      /Users/felixge/code/golang/src/database/sql/sql.go:2588 +0x138

Previous read at 0x00c4201e4027 by goroutine 58:
  runtime.slicecopy()
      /Users/felixge/code/golang/src/runtime/slice.go:192 +0x0
  database/sql.convertAssign()
      /Users/felixge/code/golang/src/database/sql/convert.go:439 +0x2742
  database/sql.(*Rows).Scan()
      /Users/felixge/code/golang/src/database/sql/sql.go:2886 +0x280
  our/project/sqlplan.(*Query).execute()
      /Users/felixge/code/go/src/our/project/sqlplan/query.go:183 +0x72d
  our/project/sqlplan.(*Query).Execute()
      /Users/felixge/code/go/src/our/project/sqlplan/query.go:136 +0x191
  our/project/sqlplan.(*subGatherer).Execute()
      /Users/felixge/code/go/src/our/project/sqlplan/gatherer.go:254 +0xbcc
  our/project/sqlplan.(*FirstResponder).Execute.func1()
      /Users/felixge/code/go/src/our/project/sqlplan/first_responder.go:57 +0xdb
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 24, 2018

Change https://golang.org/cl/89596 mentions this issue: database/sql: guard Rows.lastcols from driver mutations during read

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 24, 2018

Ah, in lib/pq it will handle a pre-mature close by reading all the remaining rows into the buffer. Interesting choice. If you have a chance, please try out the linked CL; it should resolve your situation (I was unable to make the unit test display the race).

@felixge
Copy link
Author

@felixge felixge commented Jan 24, 2018

@kardianos thanks so much, it's getting late here, but I'll try the CI first thing tomorrow morning.

@felixge
Copy link
Author

@felixge felixge commented Jan 25, 2018

@kardianos I tested your CL (go version devel +f45bde7d91 Wed Jan 24 12:16:56 2018 -0800 darwin/amd64), and it fixes the race in my applications test suite (using lib/pq), as well as for the patched TestIssue18429 test from above 🎉 🙏 .

My only suggestion would be to include the Scan() call from my patch above in your CL. I know it wasn't enough to trigger the race on your machine, but IMO anything that could help to prevent regressions in the future should be beneficial.

Let me know if there is anything else you'd like me to check/verify on my end.

@felixge
Copy link
Author

@felixge felixge commented Jan 25, 2018

Ah, in lib/pq it will handle a pre-mature close by reading all the remaining rows into the buffer. Interesting choice.

If I had to guess: This is the only way to reuse the connection of a canceled query.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 25, 2018

Change https://golang.org/cl/89835 mentions this issue: database/sql: prevent close during row Scan to prevent race

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 25, 2018

Change https://golang.org/cl/89936 mentions this issue: database/sql: buffers provided to Rows.Next should not be modified by drivers

@gopherbot gopherbot closed this in 651ddbd Jan 25, 2018
@felixge
Copy link
Author

@felixge felixge commented Jan 25, 2018

@kardianos looks like you went with solution 1 in 651ddbd ? I.e. lib/pq will have to be changed in order to fix the race?

@kardianos
Copy link
Contributor

@kardianos kardianos commented Jan 25, 2018

@felixge I'll be opening some issues for various drivers.
lib/pq will need to stop writing to the same buffer during close. It should create a new separate buffer or not write to it at all.

felixge added a commit to felixge/pq that referenced this issue Jan 29, 2018
This test case shows a race condition we see in the test suite of our
application. We already created an issue with the [Go
project](golang/go#23519), but the conclusion was
that the [db drivers should fix
it](golang/go#23519 (comment)).

I was able to reproduce this issue with the following go versions:

- `go version go1.9.3 darwin/amd64` (output shown below)
- `go version devel +5af1e7d78e Sun Jan 28 19:54:48 2018 +0000 darwin/amd64` (latest master)

Please let me know if I can provide more information and/or if you'd
like me to attempt to create a patch to fix the issue.

```
$ while go test -v -race -run 'TestContextCancelQueryWhileScan'; do :; done
=== RUN   TestContextCancelQueryWhileScan
--- PASS: TestContextCancelQueryWhileScan (0.13s)
PASS
ok  	github.com/lib/pq	1.160s
=== RUN   TestContextCancelQueryWhileScan
==================
WARNING: DATA RACE
Read at 0x00c4201ae5ae by goroutine 23:
  runtime.slicebytetostring()
      /usr/local/Cellar/go/1.9.3/libexec/src/runtime/string.go:72 +0x0
  database/sql.convertAssign()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/convert.go:244 +0x2333
  database/sql.(*Rows).Scan()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2727 +0x295
  github.com/lib/pq.TestContextCancelQueryWhileScan.func1()
      /Users/felixge/code/go/src/github.com/lib/pq/go18_test.go:188 +0x38a

Previous write at 0x00c4201ae5a8 by goroutine 50:
  runtime.slicecopy()
      /usr/local/Cellar/go/1.9.3/libexec/src/runtime/slice.go:160 +0x0
  bufio.(*Reader).Read()
      /usr/local/Cellar/go/1.9.3/libexec/src/bufio/bufio.go:224 +0x240
  io.ReadAtLeast()
      /usr/local/Cellar/go/1.9.3/libexec/src/io/io.go:309 +0x93
  io.ReadFull()
      /usr/local/Cellar/go/1.9.3/libexec/src/io/io.go:327 +0x72
  github.com/lib/pq.(*conn).recvMessage()
      /Users/felixge/code/go/src/github.com/lib/pq/conn.go:971 +0x38a
  github.com/lib/pq.(*conn).recv1Buf()
      /Users/felixge/code/go/src/github.com/lib/pq/conn.go:1007 +0x46
  github.com/lib/pq.(*rows).Next()
      /Users/felixge/code/go/src/github.com/lib/pq/conn.go:1399 +0x13b
  github.com/lib/pq.(*rows).Close()
      /Users/felixge/code/go/src/github.com/lib/pq/conn.go:1356 +0x8f
  database/sql.(*Rows).close.func1()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2761 +0x6e
  database/sql.withLock()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2873 +0x76
  database/sql.(*Rows).close()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2760 +0x1ca
  database/sql.(*Rows).awaitDone()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2436 +0x14c

Goroutine 23 (running) created at:
  github.com/lib/pq.TestContextCancelQueryWhileScan()
      /Users/felixge/code/go/src/github.com/lib/pq/go18_test.go:174 +0x163
  testing.tRunner()
      /usr/local/Cellar/go/1.9.3/libexec/src/testing/testing.go:746 +0x16c

Goroutine 50 (running) created at:
  database/sql.(*Rows).initContextClose()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:2420 +0xdd
  database/sql.(*DB).queryDC()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:1334 +0xb7f
  database/sql.(*DB).query()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:1304 +0x193
  database/sql.(*DB).QueryContext()
      /usr/local/Cellar/go/1.9.3/libexec/src/database/sql/sql.go:1281 +0xe2
  github.com/lib/pq.TestContextCancelQueryWhileScan.func1()
      /Users/felixge/code/go/src/github.com/lib/pq/go18_test.go:183 +0x22f
==================
--- FAIL: TestContextCancelQueryWhileScan (0.12s)
	testing.go:699: race detected during execution of test
FAIL
exit status 1
FAIL	github.com/lib/pq	0.153s
```
@DoctorQ
Copy link

@DoctorQ DoctorQ commented Jan 3, 2019

I think Rows.close(err) should add the judgment:

if err != nil && err != io.EOF {
     rs.lasterr = err
     return nil
}

or

if context.DeadlineExceeded == err {
     rs.lasterr = err
     return nil
}

then code below:

func (rs *Rows) close(err error) error {
	rs.closemu.Lock()
	defer rs.closemu.Unlock()
        
        if err != nil && err != io.EOF {//throws context.DeadlineExceeded when QueryContext's context timeout
                rs.lasterr = err
                return nil
        }

	if rs.closed {
		return nil
	}
	rs.closed = true

	if rs.lasterr == nil {
		rs.lasterr = err
	}

	withLock(rs.dc, func() {
		err = rs.rowsi.Close()
	})
	if fn := rowsCloseHook(); fn != nil {
		fn(rs, &err)
	}
	if rs.cancel != nil {
		rs.cancel()
	}

	if rs.closeStmt != nil {
		rs.closeStmt.Close()
	}
	rs.releaseConn(err)
	return err
}

because buffers provided to Rows.Next should not be modified by Rows.Close

@golang golang locked and limited conversation to collaborators Jan 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.