Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context timeout/deadline not very effective on Exec, Query #1313

Closed
kbolino opened this issue Sep 23, 2022 · 8 comments
Closed

Context timeout/deadline not very effective on Exec, Query #1313

kbolino opened this issue Sep 23, 2022 · 8 comments
Labels

Comments

@kbolino
Copy link

kbolino commented Sep 23, 2022

Describe the bug

The timeout/deadline specified on the context passed in to various operations does not seem to be very effective.

To Reproduce

  1. Execute a statement, given a context that has a deadline set (via WithDeadline or WithTimeout)
  2. Server doesn't respond for a while (not until well after the deadline; of course, there might be another bug at play here)
  3. Check for error and measure time elapsed

e.g.

startTime := time.Now()
deadline := startTime.Add(5 * time.Minute)
ctx := context.WithTimeout(context.Background(), deadline)
_, err := pgConn.Exec(ctx, `CREATE TABLE test;`) // have this hang for some reason
if err != nil {
    fmt.Println("ERROR: executing create table:", err)
    fmt.Println("time taken:", time.Since(startTime))
}

Expected behavior

Error is deadline exceeded, time elapsed is pretty close to timeout.

Actual behavior

Error is timeout: context already done: context deadline exceeded (as expected) but time elapsed can be much longer than expected, for example I've seen it take 2–4 hours to finally fail on a 5-minute timeout.

Version

  • Go:
    $ go version
    go version go1.19.1 linux/amd64
  • PostgreSQL:
    $ psql --no-psqlrc --tuples-only -c 'select version()'
    PostgreSQL 14.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit
  • pgx:
    $ grep 'github.com/jackc/pgx/v[0-9]' go.mod
    github.com/jackc/pgx/v5 v5.0.0

Additional context

I believe the root cause is that the context is only checked for cancellation in between network calls. If the problem manifests as being stuck inside a read/write operation against the network connection, then the timeout is only effective once (if ever!) the operation fails in the underlying implementation of Go's standard library.

The solution would seem to be to propagate the context's deadline down to the network connection. This may be as simple as doing something like the following in the appropriate place:

if deadline, ok := ctx.Deadline(); ok {
    rawConn.SetDeadline(deadline)
}
@kbolino kbolino added the bug label Sep 23, 2022
@kbolino
Copy link
Author

kbolino commented Sep 24, 2022

Ran into a similar problem in a different context, and now I'm a bit suspicious about how the context deadline is supposed to work. Go's standard net/http.Client doesn't seem to pass down the deadline to its connection, either.

@jackc
Copy link
Owner

jackc commented Sep 24, 2022

I'm unable to duplicate this.

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	"github.com/jackc/pgx/v5"
)

func main() {
	ctx := context.Background()

	conn, err := pgx.Connect(ctx, os.Getenv("DATABASE_URL"))
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close(ctx)

	ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
	defer cancel()

	t1 := time.Now()
	_, err = conn.Exec(ctx, `select 1, pg_sleep(60)`)
	if err == nil {
		log.Fatal("expected error but did not receive")
	}
	t2 := time.Now()

	fmt.Println(err, t2.Sub(t1))
}

Output:

jack@glados ~/dev/pgx_issues/pgx-1313 ±master⚡ » go run .
timeout: context deadline exceeded 2.001241155s

As far a how the deadline gets to the network connection see package ctxwatch as used by

func newContextWatcher(conn net.Conn) *ctxwatch.ContextWatcher {
.

In short, a separate goroutine watches the context and when it is canceled it calls SetDeadline.

@kbolino
Copy link
Author

kbolino commented Sep 24, 2022

I'm suspicious now that there's a kernel or hardware problem. I didn't see anything in dmesg before I posted, but that doesn't prove anything. Thanks for looking into it.

@kbolino
Copy link
Author

kbolino commented Sep 25, 2022

After updating the kernel, I haven't seen a recurrence of this problem, so I'm chalking it up to a spurious external issue.

@kbolino kbolino closed this as completed Sep 25, 2022
@alejandrodnm
Copy link
Contributor

Hi I have a question regarding this.

I tried your example and it works, but according to pg_stat_activity the query is still running, is this the intended behaviour?

I'm trying to avoid creating a batch per query creating an explicit transaction to be able to do set local statement_timeout=ctx remaining deadline for all the queries I'm doing. When testing context cancellation the request is cancelled immediately but the query (according to pg_stat_activity) continues executing.

@kbolino
Copy link
Author

kbolino commented Oct 3, 2022

@alejandrodnm It can take some time for a query to terminate gracefully. See, for example, this StackOverflow answer. In this case, I don't think pg_sleep can be interrupted, so it will run until completion. More practical queries may be interruptible at various stages before they complete fully, but unless they're very short-lived to begin with, they generally won't stop immediately.

You can forcibly terminate a query by terminating its associated process (which should be done through Postgres itself whenever possible), which will happen immediately under most circumstances, but this isn't recommended for typical cases. As far as I can tell, closing the session/connection that kicked off a query is equivalent to gracefully terminating it.

@jackc
Copy link
Owner

jackc commented Oct 4, 2022

Query cancellation does signal to server to cancel the query. Unfortunately, there is no guarantee the signal will be honored and PostgreSQL doesn't even provide any feedback regarding the state of the cancellation request.

See https://www.postgresql.org/docs/current/protocol-flow.html#id-1.10.5.7.10.

The upshot of all this is that for reasons of both security and efficiency, the frontend has no direct way to tell whether a cancel request has succeeded. It must continue to wait for the backend to respond to the query. Issuing a cancel simply improves the odds that the current query will finish soon, and improves the odds that it will fail with an error message instead of succeeding.

@alejandrodnm
Copy link
Contributor

alejandrodnm commented Oct 4, 2022

Thanks @kbolino and @jackc , my confusion came because I didn't realise that the request to cancel the query is send asynchronously:

pgx/pgconn/pgconn.go

Lines 583 to 598 in f803c79

func (pgConn *PgConn) asyncClose() {
if pgConn.status == connStatusClosed {
return
}
pgConn.status = connStatusClosed
go func() {
defer close(pgConn.cleanupDone)
defer pgConn.conn.Close()
deadline := time.Now().Add(time.Second * 15)
ctx, cancel := context.WithDeadline(context.Background(), deadline)
defer cancel()
pgConn.CancelRequest(ctx)

When running @jackc example the program exited before the request to cancel the query was made. I guess that's what happened when I was testing the query cancelation part.

When given some time to do the asyncClose I got:

2022-10-04 11:50:32.461 CEST,"db","db",56720,"::1:57249",420c01e8.dd90,1,"idle",2022-10-04 11:50:32 CEST,4/68760,0,LOG,00000,"statement: SELECT * FROM pg_sleep(60)",,,,,,,,,"","client backend",,0
2022-10-04 11:50:34.467 CEST,"db","db",56720,"::1:57249",420c01e8.dd90,2,"SELECT",2022-10-04 11:50:32 CEST,4/68760,0,ERROR,57014,"canceling statement due to user request",,,,,,"SELECT * FROM pg_sleep(60)",,,"","client backend",,0

THANK YOU ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants