Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query cancellation collateralizing future queries using the same connection #1000

Merged
merged 1 commit into from
Nov 24, 2020

Conversation

kahuang
Copy link
Contributor

@kahuang kahuang commented Oct 2, 2020

This change solves the following issue we were seeing in production:

  1. A context would get cancelled due to timeout
  2. watchCancel would get triggered, and start to send the cancellation
  3. Because the cancellation happens concurrently with the normal query path, and doesn't block the finish() path until its done cancelling the query on the postgres side, this connection could be returned to the pool before getting cancelled
  4. This connection then gets reused, and gets collateralized by the previous query's cancel request

The changes I've made to ensure we don't run into this collateral damage:

  1. When the context is cancelled, we immediately try to send the finished signal to the finished channel to prevent a race. Either we successfully send this, or the query we tried to cancel finished first. In the latter case, we just return, and let a future query (if one happens) trigger the cancellation.
  2. In the former case we set the connection state to bad so it doesn't get reused, and cancel the query.
  3. When the finish() function gets called, the querier goroutine closes the connection.

Note: We probably don't need to change bad to an atomic variable, but I did just to be safe.

go18_test.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash your commits and fix the test flake and I think I can merge this.

@kahuang
Copy link
Contributor Author

kahuang commented Nov 23, 2020

I've updated the tests and squashed the commits

@maddyblue
Copy link
Collaborator

Need to remove go 1.13 which doesn't support some of the sync stuff here. Doing that in #1014.

@maddyblue maddyblue merged commit aecc811 into lib:master Nov 24, 2020
@jwatte
Copy link

jwatte commented Dec 4, 2020

This change causes a public API change that breaks our app.
We start a transaction in a WithCancel() context, run a select, cancel the context, and run another select.
The error returned when transacting on a canceled context used to be:
"sql: transaction has already been committed or rolled back"
It is now instead "driver: bad connection" which is not just a breaking change, but arguably the wrong error.

@maddyblue
Copy link
Collaborator

If you submit a PR that reverts this change and adds a test to prevent this regression in the future I'll merge it.

@NOMORECOFFEE
Copy link

NOMORECOFFEE commented Jan 16, 2021

Hello.
It's possible that connections remain in the pool in the ErrBadConn state and
a connection drop out in the next request.
Should we implement the SessionResetter in this case?

craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Aug 10, 2021
68577: protectedts,kvserver: add UpdateTimestamp method to the storage interface r=ajwerner a=adityamaru

This change introduces a UpdateTimestamp method that can be used to
update timestamp protected by a protected time stamp record.

A next commit will add some logic to the verification code to ensure
that an update is picked up correctly during the Verify() call.

Release note: None

kvserver: modify pts verification to check Timestamp as well

Previously, when verifying a pts record we would find the record
we are attempting to verify from the cache on the basis of the
record ID. This was correct until we added a way to update a records
Timestamp in the previous commit.

If a record is updated with a new ts, and a verification request
is sent for this updated record, then we want to ensure that
request is not serviced on the basis of the old record. This
is possible if the cache is too stale.

To prevent this we now match both the record ID and the timestamp
to protect after, when finding the cache entry corresponding to
the request. In this way, if we do see the older record then the
verification will fail the first time around, the cache will refresh,
and we will see the new record the second time around.

Release note: None

68665: workload: ignore ErrBadConn if context has been canceled r=otan a=rafiss

fixes: #68574
fixes: #68585
refs: lib/pq#1000

lib/pq can return this error if the context has been canceled.

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@knz
Copy link

knz commented Feb 11, 2022

This change causes a public API change that breaks our app.

We've just run into the same issue inside CockroachDB. A query cancellation should not drop the connection.

@rafiss
Copy link
Collaborator

rafiss commented Aug 29, 2022

I think #1079 will address the issues with this PR

@evanj
Copy link
Contributor

evanj commented Jan 29, 2023

I have updated #1079 with the code review comments in case it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants