-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[q] custom cancellation func #679
Comments
I suppose we could add something similar back... But I find it surprising that PgBouncer works that way. I would expect that it would only forward the cancellation request to the PostgreSQL server if the client side connection that received the cancel request had a query in progress. Seems to guarantee an error to do so otherwise. |
Aha! Didn't know that the strategy's changed, thanks. Here, PIDs mean true PID in Postgres (belongs to a conn between pgBouncer and DB). While pgx fires cancel and kills connections to pgBouncer, pgB reuses its connections to DB. |
That really surprises me that PgBouncer would forward cancellations in that manner. It seems that it would be impossible to use that functionality safely. I would have expected PgBouncer to send its own PID and secret key to the client and when PgBouncer receives a cancel request it would map it to the underlying PG connection only if that PgBouncer connection was busy. Anyway, as far as custom cancellation behavior, we are always going to close the connection, but I don't mind making the send cancellation request on close behavior customizable. |
@jackc please return support for CustomCancel. Since in pgx4 there is no support for CustomCancel, therefore, pgx4 always sends cancel, and for pgBouncer it leads to filling the max_client_conn limit and, as a result, blocking new connections. |
Yes, but it mapped by client->cancel_key: and this client->cancel_key is generated for each client, it is not server cancel_key:
BTW, you have this race condition in clean postgres too, because, as you said above — cancel query is linked by PID, not by some "query id", so you does not known which query will be running in postgres backend when it will handle the cancellation. |
I would expect pgx to only see the client cancel key then -- and pgbouncer be smart enough not to send the cancel to the Postgres server unless that client was busy.
This isn't a problem with regular pgx because any cancellation closes the connection. Sending the cancel signal to the server is merely a courtesy to prevent long running queries from continuing to consume server time. Maybe I don't understand the whole situation, but I still find it difficult to comprehend that this PgBouncer behavior is intentional and not a bug somewhere. I'm not opposed to adding an on cancel function -- though since connections are always closed on cancel the only thing it could really do is decide whether or not to send the cancel signal -- I'd prefer to solve the root problem rather than add an override to work around it. |
I just spent several hours digging into this and am unable to reproduce this... First, I wrote an example program that inspects that explicitly tries to create this problem. https://github.com/jackc/pgx_issues/blob/8eb2ebfc8b41c96c4a642e4d078c293f69af320c/main.go When run it creates two connections to a PgBouncer instance and displays the PID reported through the wire protocol and the PID reported by Then I start a long running query on one PgBouncer connection and cancel it on the other. PgBouncer correctly does not send the cancel to PostgreSQL.
But maybe somehow this only happens under high concurrency or when using context cancellation rather than explicitly invoking So I wrote another test https://github.com/jackc/pgx_issues/blob/976053da4b9972d6a6e5fed1fc7b7a8562b3e913/main.go This is a stress test that starts many workers each of which randomly executes queries, some of which are cancelled and some of which are not. It checks that no queries that are unexpectedly cancelled -- and they're not.
At this point I do not see a bug in PgBouncer or in pgx. I'm going to need a test case that reproduces this problem before investing any more time in this. |
@jackc can you show your |
Here it is with the comments removed.
|
Hi! I tried to look a bit deeper into the issue. I guess you're right, it's not possible to cancel someone's else query:
But the problem is still there. Here's what I see in pgbouncer when load testing one of our services which uses pgx/v4/pgxpool: blue lines - cl_active, red dots - cl_waiting (right scale) The pool is limited to 10 both on service and on DB. Maximum allowed client connections to pgbouncer - 100. As you can see, there are too many cl_waiting connections (almost up to 60, expected 0). I've put together a test case to demonstrate the issue: https://github.com/iosadchiy/pgx_pgbouncer_issue. It runs 100 workers sending queries that are always canceled, with pool size 10. If you run it, pgbouncer will show these metrics:
As you can see, there are too little cl_active and too many cl_waiting connections (expected 10 and 0 correspondingly). As far as I understand the problem lies in pgxpool - pgconn interaction:
Thus, in the situation when a lot of requests are being canceled by timeout (that is, under heavy load) pgx creates way more connections than maximum pool size allows and fills pgbouncer's pool with cancellation queries. Service's performance degrades dramatically due to the lack of cl_active connections. One possible solution could be just not to use query cancellation when working with pgbouncer. That is, to get CustomCancel back. It works pretty well for us with pgx v3. Another solution could be to allow pgxpool to control the query cancellation process and to know the total number of connections including cancellation requests. As of now, we're kind of blocked by this problem in the long run - we cannot migrate to pgx/v4 until there is a good workaround. Hope this was useful. Let me know what you think! |
I was able to duplicate the problem using your example. But I don't think the cancel query that occurs when closing the connection is the problem. I experimented with removing it. It still exhibited catastrophic behavior very rapidly. I think the root problem is that closing a connection is handled asynchronously. As far as pgxpool is concerned the connection is immediately closed, but it actually the connection lives for a little while. But there is now room in the pool so it establishes another connection. I think removing the query cancellation could even make it worse. The "closing" connections could live for much longer if they didn't actually close until their last query finished. It's a bit of a tricky situation. When a query is cancelled the function must return immediately. It can't wait for the connection to close cleanly. That is why it is done in the background. But that causes an unfortunate side effect for the pool. Okay. I think I have a solution. There were a few changes necessary.
With these changes cl_active + cl_waiting was never greater than 10. |
Works like a charm, at least for performance tests. Thank you! |
It turns out that while the issue is fixed for pgx/v4 + pgxpool, it still can be reproduced for pgx/v4 + database/sql. Check out the test case: https://github.com/iosadchiy/pgx_pgbouncer_issue/tree/database-sql If you let it run for 15-30 seconds, you will get something like this:
Pool size is limited at 10, but there are 100 connections. Simply closing the connection synchronously resolves the issue:
It works both with pgxpool and database/sql. I'm not sure what the reasons are behind closing asynchronously. Could you please clarify or point to an explanation? Do we really need to close the connection asynchronously? It seems that in case of pgxpool it is already being closed synchronously due to the waiting for CleanupDone(). And it works fine so far. |
Well, I guess it depends on how we define "need". IMO, there are two requirements for context cancellation.
These objectives are opposed to each other in that the second objective requires network activity which could take an arbitrary amount of time. That is why the query cancellation and connection closing is asynchronous.
The underlying resource pool implementation, puddle, calls the destructor which calls The only obvious way to not close asynchronously would be to violate one of the two requirements above. There might be some workaround possible in |
A notable factor on this type of problem which we've encountered at reddit with both Golang and Python clients: As demonstrated in your A mitigation for this is to ensure that your clients never cancel queries faster than pgbouncer's |
Hello @jackc! We had migrated several services from pgx/v3 to pgx/v4 and started to see the same issue with pgbouncer. When we roll back to pgx/v3, the issue disappears.
What can be done about that? |
I tested different versions of pgx and found out that this problem returned in 4.14.0. Version 4.13.0 works as expected.
Here is the test output for 4.13.0:
EDIT:
|
I'm currently using pq driver with the pgbouncer transaction pool and we had an issue passing ctx into the QueryContext, where ctx was passed from the http Request. The issue described above made us to use ctx.Background instead. Now we are migrating to pgx/v5 and I'm curious how to pass ctx safely considering the problem @alienth described. Is it still possible to encounter this issue with the pgx/v5 ? Can we pass ctx from the Request which can be closed when the client goes away, etc. ? Thanks for the help |
I am not aware of any changes regarding this issue between v4 and v5. But I haven't looked into either. |
Hi! I have another use case where being able to disable cancel query would be better. Also, some managed PostgreSQL from some cloud providers have really low connection rate capacity and I wasn't able to exceed 25 conn/s on a 16 cores replica. In those environments, I would rather keep the connection capacity for queries and applications rather than use it to send cancel requests. We usually rely on statement_timeout and idle_in_transaction_session_timeout to clean leftover queries and transactions running on servers and cancel requests is not necessary for us. |
@jackc Hi! We have same problem, on highload - pool fall to zero (using pool size greater then pgx.max_pool_size)
By versions: |
@roman-wb You probably want to use |
@jackc ok, thx! Please tell: what happen if we assign empty struct to BuildContextWatcherHandler? |
You would get a compile error... If you want to simply reduce the number of cancellation requests at the expense of context cancellations not actually doing anything right away then you want to have your |
@jackc i mean empty struct:
1 What will be with code above? |
The would make context cancellation a no-op. It would never send a cancel request and it would never set a deadline. A query could block forever. |
Hi! I was migrating to pgx/v4 while I noticed that
pgx.ConnConfig.CustomCancel
hook is no more. How do I set up a custom cancel func with v4?I'd like to disable cancellation signaling completely since it may kill wrong query when using pgBouncer.
The text was updated successfully, but these errors were encountered: