Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug for sending cancel request to terminate a long running query #1952

Conversation

xin-tsla
Copy link
Contributor

@xin-tsla xin-tsla commented Aug 3, 2022

Hi All,

Update:
The issue this PR tries to solve is that we observed some long running queries did not get cancelled.

Changes in the PR:
1. Fix a bug for sending cancel request to terminate a long running query
2. Update a var name RetryStatementTimeout to RetryCancelledQuery,
reference:https://www.postgresql.org/docs/current/errcodes-appendix.html
In the table, 57014 is query_canceled.
During a local experiment, the query which is cancelled by sending cancel request also returns 57014.

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Aug 3, 2022

Hi @vmihailenco @elliotcourant, when you have time, could you take a look the PR?

options.go Outdated
// Whether to retry queries cancelled because of statement_timeout.
RetryStatementTimeout bool
// Whether to retry queries cancelled because of statement_timeout or cancel request.
RetryCancelledQuery bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this would be a breaking change. I recommend we leave the name the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I will revert the change to keep as it is.

base.go Outdated
err := db.cancelRequest(cn.ProcessID, cn.SecretKey)
if err != nil {
internal.Logger.Printf(ctx, "cancelRequest failed: %s", err)
for {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in a for loop? I looks like the previous logic would work. This appears to have this for loop run forever, as there is no way to break out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! My bad. Reverted the change.

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Aug 4, 2022

In some times, case <-ctx.Done(): returns its element latter than case <-fnDone:

			case <-fnDone: // fn has finished, skip cancel
			case <-ctx.Done():

So, even ctx.Done() did have some element, but

case fnDone <- struct{}{}: // signal fn finish, skip cancel goroutine

runs faster.

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Aug 4, 2022

Also, found that during some times, some queries returns

read tcp [::1]:63295->[::1]:5432: i/o timeout

which is faster than ctx.Done() returns, so the cancel request does not have a chance to be sent to db.

I think the root cause is that the write/read timeout for each conn, some times, can be faster than ctx.Done() returns a struct{}{}.
In normal case, we expect ctx.Done() is faster than connection's timeout, so in the current logic, a cancel request is send.
But in some case, connection's timeout happens first and faster than ctx.Done(), then in the current logic the connection will be removed but not sending a cancel request.
In this PR, we add code to send a cancel request when there is a connection timeout due to write / read connection timeout settings.
Will add some test to cover this test case.

@elliotcourant
Copy link
Collaborator

I can take a look this week sorry for the slow response, but even so; you might be hard pressed in getting this merged as this library has pretty much stopped. It is mostly stable and still being used in production, but idk if Vladmir will merge anything else to it.

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Aug 8, 2022

@elliotcourant Thanks for taking a look!

@elliotcourant
Copy link
Collaborator

Code wise this looks fine to me; the mock seems to stray a bit from the patterns I've seen elsewhere in this repo though (but no need to change it).

I'm going to look through the code itself locally more tonight.

@elliotcourant
Copy link
Collaborator

elliotcourant commented Aug 10, 2022

Digging into it more;

cancelRequest seems to work just fine, I would maybe note in the log message though that this "cancelRequest" function can fail and that is not necessarily indicative of the problem, but probably just a symptom of it? Like if the Postgres db crashes we might end up in a "bad conn" state due to something like a timeout, which would mean that the cancelRequest function itself will also fail. Maybe the thing to do would be to also log the error that is being passed to releaseConn as the initial cause? Otherwise I could see people being mislead by the log output of cancelRequest.

Hopefully that makes sense. Please let me know if it does not.

@xin-tsla
Copy link
Contributor Author

@elliotcourant , thanks for taking a look! I agreed that the cancel log may be confusing if the db crashes.
The issue we observed was the db was running but the long running queries did not get cancelled.
In go-pg, when a query is executed, there are two deadlines set up for the query, one is from context, and the other is from conn object in the connection pool.

The following time series diagram shows that if the ctx.Done(green box) returns / closes earlier than connection's deadline (the red box) due to deadline exceeds, then the current library will send a cancel request and remove the connection from the connection pool.
investigation-long-running-query drawio (3)

The following time series diagram shows that if the connection's deadline (the red box) returns earlier than the ctx.Done() (green box), then the library remove the connection from the connection pool but does not send a cancel request to db.
investigation-long-running-query drawio (4)

Copy link
Collaborator

@elliotcourant elliotcourant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a perfect visualization of the problem, I think this would be fine to merge personally. Let me see what I can do.

@xin-tsla
Copy link
Contributor Author

@elliotcourant , thanks for reviewing!
For the isBadConn method, I am thinking if I just cancel only for 57014. Currently isBadConn returns true if 25P02 and 57014

@elliotcourant
Copy link
Collaborator

25P02 is the transaction is aborted right? Which means the database is not spinning on anything anyway and thus there is nothing to cancel, the conn should just be thrown out?

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Aug 10, 2022

@elliotcourant , yes, the 25P02 is aborted transaction, I think so, the database will not spinning on the query anyway.
So, it is no need to cancel for the 25P02. From the previous behavior, the connection with aborted transaction will be throw out.

@xin-tsla
Copy link
Contributor Author

@elliotcourant , I just changed the code a little bit to just cancel the case for 75014, I think it could fix the test fail?

@xin-tsla
Copy link
Contributor Author

@elliotcourant , wondering how could I trigger a test ci-pipeline?

@elliotcourant
Copy link
Collaborator

Just triggered one, this looks good to me.

@xin-tsla
Copy link
Contributor Author

@elliotcourant , thanks for reviewing!

@elliotcourant
Copy link
Collaborator

This looks good to me, I want to talk to @vmihailenco about merging though.

@xin-tsla
Copy link
Contributor Author

@elliotcourant , thanks so much!

@elliotcourant elliotcourant merged commit 6f1b5ca into go-pg:v10 Aug 12, 2022
@xin-tsla
Copy link
Contributor Author

Thanks @vmihailenco @elliotcourant @CyborgMaster !

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Oct 4, 2022

Hi @vmihailenco @elliotcourant , wondering will we release a new tag version for the change ?

@elliotcourant
Copy link
Collaborator

I will tag a release as soon as I return from a business trip this weekend, sorry for the delay!

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Oct 4, 2022

Thanks so much! @elliotcourant

@elliotcourant
Copy link
Collaborator

@xin-tsla Release has been tagged!

@xin-tsla
Copy link
Contributor Author

xin-tsla commented Oct 8, 2022

@elliotcourant , thanks so much! Much appreciate it!

mojtabacazi added a commit to jamscloud/pg that referenced this pull request Nov 3, 2023
* Warn on unknown tags

* Change Logger interface

* Add RealWorld example app

* using go.opentelemetry.io/otel @v0.11.0
see https://github.com/open-telemetry/opentelemetry-go/releases/tag/v0.11.0

* ran go mod tidy

* Add msgpack option

* Set span name. Updates go-pg#1699

* Add nopk field option

* Increase read buffer size to 1mb and reuse it via sync.Pool

* feat(otel): Use first word as span name without operation.

* Cleanup code

* Update sponsors

* Update

* Add docs badge

* Fix badge

* Extract pgext to separate Go module

* Add pgext to readme

* Update changelog

* Split channel and ping timeout

* Change chan Notification

* Fix test

* .

* Add ability to scan into map[string]interface{}

* Add json_use_number to the list of known options

* Test on Go 1.15

* Fix map update and insert

* Fix test on older Postgres

* Add support for scanning into map slice

* Cleanup todo

* Update changelog

* Update sponsors to changelog

* Restore column alloc

* Add types.Numeric

* Add Discord

* Update issue templates

* Update issue template

* Disable blank template

* Add stackoverflow linke

* Cleanup

* Better naming

* Cleanup

* Add self-validating rel tag option

* Update changelog

* Rework rel

* Cleanup example

* Update example

* Update example names

* Fix structs sharing in has-many

* Add basic example

* Update changelog

* Fix discord link

* orm: warn on tag name mistake

* Add renovate.json

* chore(deps): update golang.org/x/exp commit hash to ae8ad44

* chore(deps): update module onsi/ginkgo to v1.14.1

* Fix m2m table name containing a schema

* Update module onsi/gomega to v1.10.2

* Fix a typo in changelog

* Change hooks to be LIFO

* Make sure to call after hook on error

* Fix fk handling for has-one rel

* Use primary key as fallback in m2m

* Fix m2m using column without a prefix

* dont-swallow-before-query-errors

* Update golang.org/x/exp commit hash to 20d5ce0

* Update example names

* Fix example name

* Don't use retry backoff in listener

* Add go vet

* Update golang.org/x/exp commit hash to 18d7dbd

* Add discord badge

* Parse field tag only once when check for relation

* Implement soft delete for scanner

* Tweak scanning into a map

* Fix build

* Reset map slice len before scanning data

* Add HexEncoder

* Properly pass data len in scanner

* fix many2many for multi-tenant applications (table names with a param)

* dependency update and workaround for opentelemetry api change

* Cleanup

* Fix default + use_zero insert

* Improve known tag check

* Add treemux

* Update module vmihailenco/tagparser to v0.1.2

* Cleanup

* More resources

* Update module go.opentelemetry.io/otel to v0.13.0

* Update readme

* Update module onsi/ginkgo to v1.14.2

* Update module onsi/gomega to v1.10.3

* orm: try SQL name as fk. Fixes go-pg#1757

* Update dependencies

* Add alias to the list of known options

* Add extra modules

* Remove pgext

* chore(deps): update module go-pg/pg/v10 to v10.5.0

* Improve scanning null values

* Don't reuse column name

* Feat: Remove byte slice in ColumnAlloc

* Remove unused column

* Add an optional EmptyLine in log

* Move changelog to docs

* Tweak otel span and attrs name

* Fix build

* Try funding

* Otel v0.14

* orm: improve WherePK for custom types

* Fix batch update and delete

* Fix WherePK for slices

* Properly trim extra comma

* chore(deps): update module go.opentelemetry.io/otel to v0.15.0

* Add go-pg-monitor to README ecosystem section

* Bump dependencies

* Undone timeout and cancel in Rollback/Commit context

* Fix logger prefix

* (go-pg#1830) Fixed notice during startup error.

This adds handling of Notice messages during startup and will log the messages for the client to see so they may take action.

* Remove renovate

* Bumped otel up to v0.17.0

* Bumped otel up to v0.17.0
Bumped up testify t0 v1.7.0

* Update dependencies

* Add utm

* Add banner

* Otel v0.18

* Fix otel db.system name

According to the open [telemetry specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md#connection-level-attributes):

> db.system MUST be one of the following or, if none of the listed values apply, a custom value:

Down the list the name for postgres reads as "postgresql".

Add the remaining "ql" characters and conform the expectations of the spec.

* Add a link to discussions

* Fix discussions link

* Use unformatted query when over limit

* Update dependencies

* Remove confusing statement

* orm: improve template to omits columns when there are plenty

* Add DBI

* Move changelog back to the repo

* pgotel: add hard query limit

* Make extra modules to follow go-pg versioning

* Update changelog

* Add NewJSONProvider

* extra/pgotel: add readme and example

* Fix title

* OpenTelemetry v0.20.0

* Add link to Bun

* Mark more errors as fatal

* Remove OpenTelemetry instrumentation from the core (but keep pgotel as is)

* Update OpenTelemetry

* Always remove canceled conn from the pool

* Add big.Int link

* int[] type to array conversion

* Update example_model_test.go

* Add link to Bun

* Add possibility to specify custom `join_fk` for "has-one" relations. Will close go-pg#1812.

* Replace travis

* Add support for ScramSha256Plus

* gofmt and go mod tidy

* Release v10.10.4 (release.sh)

* Version v10.10.4 (tag.sh)

* Fix SASL authentication

* Release v10.10.5 (release.sh)

* Replace release script

* Update dependencies

* Update changelog

* Release v10.10.6 (release.sh)

* chore: tweak readme

* fix: Add `-race` flag to GitHub actions test workflow.

Added `-race` to the `test` target in the Makefile, this way tests that
are run in GitHub Actions will help reveal any potential race
conditions.

* chore: update links

* chore: update link

* chore: update readme

* chore: add monetr

* Fix a bug for sending cancel request to terminate a long running query (go-pg#1952)

* Add for loop for checking if the query is finished or is cancelled

* Rename a var from RetryStatementTimeout to RetryCancelledQuery and update a comment

* Send a cancel request when there is a bad connection error

* Rename a var back to RetryStatementTimeout

* Revert a change in withConn

* Revert changes for comments

* Add an unit test for testing sending a cancel request when a connection is timed out

* Change the return values of isBadConn to return error code if it has

* Send a cancel request when it is a bad conn expect 25P02 case

* chore(ci): Bumping CI versions for Go. (go-pg#1955)

* Add shortcut WhereInOr (go-pg#1947)

* Add shortcut WhereInOr

* Add tests for WhereInOr

Co-authored-by: p.vorontsov <p.vorontsov@norsi-trans.ru>

* chore: cleanup

* chore: add release scripts (go-pg#1957)

* fix(notify): Fixed race condition in listener. (go-pg#1963)

This resolves a race condition in listener where `ln.closed` was being
read without a mutex at the same time it could be written. Just move the
mutex unlock for `Unlisten` down, **but not defer it** so that the mutex
still holds the lock properly. But does not cause a deadlock with
release conn.

* Release v10.10.7 (release.sh) (go-pg#1964)

* chore: update

* chore: update

* Bump mellium.im/sasl from 0.2.1 to 0.3.1 (go-pg#1969)

* Bump mellium.im/sasl from 0.2.1 to 0.3.1

Bumps mellium.im/sasl from 0.2.1 to 0.3.1.

---
updated-dependencies:
- dependency-name: mellium.im/sasl
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Removing Go 1.16 support.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Elliot Courant <elliot@treasuryprime.com>

* chore: Make release scripts work on mac OS

* Release v10.11.0 (release.sh) (go-pg#1970)

* Bump golang.org/x/sys from 0.0.0-20210923061019-b8560ed6a9b7 to 0.1.0 (go-pg#1974)

* Bump golang.org/x/crypto from 0.0.0-20210921155107-089bfa567519 to 0.1.0 (go-pg#1975)

* Update err for single conn pool when context.Done() return first (go-pg#1981)

* Update err from ctx if reason is nil

* Add a unit test

* Move pool object to the test itself

* Rename a var from cl to cancel

* Add comment for explaining why ctx.Err() is used

* Release v10.11.1 (release.sh) (go-pg#1982)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Vladimir Mihailenco <vladimir.webdev@gmail.com>
Co-authored-by: burner-account <github1@depke.net>
Co-authored-by: j2gg0s <wangyanjie@rcrai.com>
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: ferhat elmas <elmas.ferhat@gmail.com>
Co-authored-by: Jeremy Mickelson <jmickelson@tesla.com>
Co-authored-by: Shoshin Nikita <shoshin_nikita@fastmail.com>
Co-authored-by: Martin Ottenwaelter <martin.ottenwaelter@gmail.com>
Co-authored-by: burner-account <gitlab1@depke.net>
Co-authored-by: j2gg0s <j2gg0s@gmail.com>
Co-authored-by: frederikhors <41120635+frederikhors@users.noreply.github.com>
Co-authored-by: Igor Zibarev <zibarev.i@gmail.com>
Co-authored-by: Anatoly Rugalev <anatoly.rugalev@gmail.com>
Co-authored-by: Elliot Courant <me@elliotcourant.dev>
Co-authored-by: Marvin Hansen <marvin.hansen@gmail.com>
Co-authored-by: Matthew Nibecker <hello@mattnibecker.com>
Co-authored-by: alexander <alexander@saporo.io>
Co-authored-by: 047801 <61029048+iam047801@users.noreply.github.com>
Co-authored-by: Vladimir Stolyarov <xakep6666@gmail.com>
Co-authored-by: xin-tsla <107882527+xin-tsla@users.noreply.github.com>
Co-authored-by: Elliot Courant <elliot@treasuryprime.com>
Co-authored-by: swad2007 <3888@list.ru>
Co-authored-by: p.vorontsov <p.vorontsov@norsi-trans.ru>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants