Context timeout/deadline not very effective on Exec, Query #1313

kbolino · 2022-09-23T15:26:21Z

Describe the bug

The timeout/deadline specified on the context passed in to various operations does not seem to be very effective.

To Reproduce

Execute a statement, given a context that has a deadline set (via WithDeadline or WithTimeout)
Server doesn't respond for a while (not until well after the deadline; of course, there might be another bug at play here)
Check for error and measure time elapsed

e.g.

startTime := time.Now()
deadline := startTime.Add(5 * time.Minute)
ctx := context.WithTimeout(context.Background(), deadline)
_, err := pgConn.Exec(ctx, `CREATE TABLE test;`) // have this hang for some reason
if err != nil {
    fmt.Println("ERROR: executing create table:", err)
    fmt.Println("time taken:", time.Since(startTime))
}

Expected behavior

Error is deadline exceeded, time elapsed is pretty close to timeout.

Actual behavior

Error is timeout: context already done: context deadline exceeded (as expected) but time elapsed can be much longer than expected, for example I've seen it take 2–4 hours to finally fail on a 5-minute timeout.

Version

Go:

$ go version
go version go1.19.1 linux/amd64

PostgreSQL:

$ psql --no-psqlrc --tuples-only -c 'select version()'
PostgreSQL 14.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit

pgx:

$ grep 'github.com/jackc/pgx/v[0-9]' go.mod
github.com/jackc/pgx/v5 v5.0.0

Additional context

I believe the root cause is that the context is only checked for cancellation in between network calls. If the problem manifests as being stuck inside a read/write operation against the network connection, then the timeout is only effective once (if ever!) the operation fails in the underlying implementation of Go's standard library.

The solution would seem to be to propagate the context's deadline down to the network connection. This may be as simple as doing something like the following in the appropriate place:

if deadline, ok := ctx.Deadline(); ok {
    rawConn.SetDeadline(deadline)
}

The text was updated successfully, but these errors were encountered:

kbolino · 2022-09-24T13:26:43Z

Ran into a similar problem in a different context, and now I'm a bit suspicious about how the context deadline is supposed to work. Go's standard net/http.Client doesn't seem to pass down the deadline to its connection, either.

jackc · 2022-09-24T15:48:07Z

I'm unable to duplicate this.

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	"github.com/jackc/pgx/v5"
)

func main() {
	ctx := context.Background()

	conn, err := pgx.Connect(ctx, os.Getenv("DATABASE_URL"))
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close(ctx)

	ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
	defer cancel()

	t1 := time.Now()
	_, err = conn.Exec(ctx, `select 1, pg_sleep(60)`)
	if err == nil {
		log.Fatal("expected error but did not receive")
	}
	t2 := time.Now()

	fmt.Println(err, t2.Sub(t1))
}

Output:

jack@glados ~/dev/pgx_issues/pgx-1313 ±master⚡ » go run .
timeout: context deadline exceeded 2.001241155s

As far a how the deadline gets to the network connection see package ctxwatch as used by

pgx/pgconn/pgconn.go

Line 379 in 335c862

func newContextWatcher(conn net.Conn) *ctxwatch.ContextWatcher {

.

In short, a separate goroutine watches the context and when it is canceled it calls SetDeadline.

kbolino · 2022-09-24T17:33:08Z

I'm suspicious now that there's a kernel or hardware problem. I didn't see anything in dmesg before I posted, but that doesn't prove anything. Thanks for looking into it.

kbolino · 2022-09-25T12:30:49Z

After updating the kernel, I haven't seen a recurrence of this problem, so I'm chalking it up to a spurious external issue.

alejandrodnm · 2022-10-03T15:08:19Z

Hi I have a question regarding this.

I tried your example and it works, but according to pg_stat_activity the query is still running, is this the intended behaviour?

I'm trying to avoid creating a batch per query creating an explicit transaction to be able to do set local statement_timeout=ctx remaining deadline for all the queries I'm doing. When testing context cancellation the request is cancelled immediately but the query (according to pg_stat_activity) continues executing.

kbolino · 2022-10-03T18:21:59Z

@alejandrodnm It can take some time for a query to terminate gracefully. See, for example, this StackOverflow answer. In this case, I don't think pg_sleep can be interrupted, so it will run until completion. More practical queries may be interruptible at various stages before they complete fully, but unless they're very short-lived to begin with, they generally won't stop immediately.

You can forcibly terminate a query by terminating its associated process (which should be done through Postgres itself whenever possible), which will happen immediately under most circumstances, but this isn't recommended for typical cases. As far as I can tell, closing the session/connection that kicked off a query is equivalent to gracefully terminating it.

jackc · 2022-10-04T01:35:24Z

Query cancellation does signal to server to cancel the query. Unfortunately, there is no guarantee the signal will be honored and PostgreSQL doesn't even provide any feedback regarding the state of the cancellation request.

See https://www.postgresql.org/docs/current/protocol-flow.html#id-1.10.5.7.10.

The upshot of all this is that for reasons of both security and efficiency, the frontend has no direct way to tell whether a cancel request has succeeded. It must continue to wait for the backend to respond to the query. Issuing a cancel simply improves the odds that the current query will finish soon, and improves the odds that it will fail with an error message instead of succeeding.

alejandrodnm · 2022-10-04T10:51:08Z

Thanks @kbolino and @jackc , my confusion came because I didn't realise that the request to cancel the query is send asynchronously:

pgx/pgconn/pgconn.go

Lines 583 to 598 in f803c79

    
           func (pgConn *PgConn) asyncClose() { 
        
           	if pgConn.status == connStatusClosed { 
        
           		return 
        
           	} 
        
           	pgConn.status = connStatusClosed 
        
           	go func() { 
        
           		defer close(pgConn.cleanupDone) 
        
           		defer pgConn.conn.Close() 
        
           		deadline := time.Now().Add(time.Second * 15) 
        
           		ctx, cancel := context.WithDeadline(context.Background(), deadline) 
        
           		defer cancel() 
        
           		pgConn.CancelRequest(ctx)

When running @jackc example the program exited before the request to cancel the query was made. I guess that's what happened when I was testing the query cancelation part.

When given some time to do the asyncClose I got:

2022-10-04 11:50:32.461 CEST,"db","db",56720,"::1:57249",420c01e8.dd90,1,"idle",2022-10-04 11:50:32 CEST,4/68760,0,LOG,00000,"statement: SELECT * FROM pg_sleep(60)",,,,,,,,,"","client backend",,0
2022-10-04 11:50:34.467 CEST,"db","db",56720,"::1:57249",420c01e8.dd90,2,"SELECT",2022-10-04 11:50:32 CEST,4/68760,0,ERROR,57014,"canceling statement due to user request",,,,,,"SELECT * FROM pg_sleep(60)",,,"","client backend",,0

THANK YOU ❤️

kbolino added the bug label Sep 23, 2022

kbolino closed this as completed Sep 25, 2022

alejandrodnm mentioned this issue Oct 4, 2022

Add statement_timeout helpers timescale/promscale#1631

Closed

2 tasks

shawn111 mentioned this issue Nov 11, 2022

Once "context deadline exceeded" error happened, need to restart service. mattermost/mattermost#21641

Closed

miguelespinoza mentioned this issue May 29, 2023

Context cancellation puts transaction state to idle in transaction (aborted) #1623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context timeout/deadline not very effective on Exec, Query #1313

Context timeout/deadline not very effective on Exec, Query #1313

kbolino commented Sep 23, 2022

kbolino commented Sep 24, 2022

jackc commented Sep 24, 2022

kbolino commented Sep 24, 2022

kbolino commented Sep 25, 2022

alejandrodnm commented Oct 3, 2022

kbolino commented Oct 3, 2022 •

edited

jackc commented Oct 4, 2022

alejandrodnm commented Oct 4, 2022 •

edited

Context timeout/deadline not very effective on Exec, Query #1313

Context timeout/deadline not very effective on Exec, Query #1313

Comments

kbolino commented Sep 23, 2022

kbolino commented Sep 24, 2022

jackc commented Sep 24, 2022

kbolino commented Sep 24, 2022

kbolino commented Sep 25, 2022

alejandrodnm commented Oct 3, 2022

kbolino commented Oct 3, 2022 • edited

jackc commented Oct 4, 2022

alejandrodnm commented Oct 4, 2022 • edited

kbolino commented Oct 3, 2022 •

edited

alejandrodnm commented Oct 4, 2022 •

edited