New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904
Comments
Hmm, that's not good. That seems like the 4th of type of such a crash that is caused by the introduction of these states in #717. These states are for query cancelation requests. And in theory you don't need to have sent a query to be able to send a cancelation request. In Go those usually get sent when a I took a quick look at the file+linenumber mentioned in the error, but I'm honestly not sure why one of these states would get there:
|
Recently a new DB has been added on all environments and an app Golang that connects to this new DB was also deployed on all envs.
Thank you |
We have run into a similar problem on PgBouncer 1.19.1 a couple of days ago:
It looks like #822 that has been marked as a clone of another (already fixed) issue. |
Fixes pgbouncer#822 Related to pgbouncer#904 In pgbouncer#904 @tnt-dev reported to still get the issue in pgbouncer#822 on 1.19.1 (on which it was expected to be fixed). ``` 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'. 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time. 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request ``` Looking again, I now understand that the switch in question was missing the SV_BEING_CANCELED case. I added it to this case to the same list that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the same state as SV_IDLE (except we don't allow reuse of the server yet). I did the same in a second switch case, for that one only warnings would have been shown instead of causing a fatal error. But still it seems good to avoid unnecessary warnings.
thanks for the report @tnt-dev. This is another issue than the one reported by @raymasson. I created a PR to fix it in #927. For the issue created by @raymasson I still don't know the exact cause or right fix. |
Actually I think I figured out what's happening and also how to correctly fix it. I'll try to reproduce tomorrow with a small Go program and confirm that the fix I have in mind works. |
The cancel request protocol is very simple: 1. Clients send a cancel key 2. Server cancels the query running on the matching backend 3. The server closes the socket to indicate the cancel request was handled At no point after sending the cancel key the client is required to send any data, and if it does it won't get a response anyway because the server would close the connection. So most clients don't, specifically all libpq based client don't. The Go client however, sends a Terminate ('X') packet before closing the cancel socket from its own side (usually due to a 10 second timeout). This would cause a crash in PgBouncer, because PgBouncer would still listen on the client socket, but on receiving any data we wouldreport a fatal error. This PR fixes that fatal error by stopping to listen on the client socket once we've received the cancel key (i.e. when we change the client state to CL_WAITING_CANCEL). I was able to reproduce both of the errors reported by @raymasson and confirmed that after this change the errors would not occur again. Reproducing this with libpq based clients is not possible, except when manually attaching to the client using `gdb` and manually running `send` on the cancel request its socket. So no test is added, since our test suite uses psycopg which uses libpq under the hood. Fixes pgbouncer#904 Related to pgbouncer#717
The cancel request protocol is very simple: 1. Clients send a cancel key 2. Server cancels the query running on the matching backend 3. The server closes the socket to indicate the cancel request was handled At no point after sending the cancel key the client is required to send any data, and if it does it won't get a response anyway because the server would close the connection. So most clients don't, specifically all libpq based client don't. The Go client however, sends a Terminate ('X') packet before closing the cancel socket from its own side (usually due to a 10 second timeout). This would cause a crash in PgBouncer, because PgBouncer would still listen on the client socket, but on receiving any data we wouldreport a fatal error. This PR fixes that fatal error by stopping to listen on the client socket once we've received the cancel key (i.e. when we change the client state to CL_WAITING_CANCEL). I was able to reproduce both of the errors reported by @raymasson and confirmed that after this change the errors would not occur again. Reproducing this with libpq based clients is not possible, except when manually attaching to the client using `gdb` and manually running `send` on the cancel request its socket. So no test is added, since our test suite uses psycopg which uses libpq under the hood. Fixes #904 Related to #717
Fixes #822 Related to #904 and #717 In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on which it was expected to be fixed). ``` 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'. 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time. 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request ``` Looking again, I now understand that the switch in question was missing the SV_BEING_CANCELED case. I now added it to the same case list that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the same state as SV_IDLE (except we don't allow reuse of the server yet). I did the same in a second switch case, for that one only warnings would have been shown instead of causing a fatal error. But still it seems good to avoid unnecessary warnings. One example where this fatal error might occur is when PgBouncer is still waiting for a response from the cancel request it sent to the server. But the query that's being canceled completes before that happens. This puts the server in SV_BEING_CANCELED. Then for whatever reason the server sends a NOTICE or ERROR message (still before the cancel request receives a response from Postgres). I was able to reproduce (and confirm that the patch resolves it), by adding some fake latency between PgBouncer and Postgres. And stopping Postgres while the server was in the SV_BEING_CANCELED state.
@tnt-dev @raymasson I merged both fixes. If you're able to, please try out the current master branch to confirm that it resolves the issue for you. |
Hello,
I have pgbouncer v1.19.1 in front of 10 DBs.
When I add a client that connects to one of the DBs, I start having one of the following errors and pgbouncer.service exits:
FATAL @src/client.c:1092 in function client_proto(): bad client state: 6
FATAL @src/client.c:1092 in function client_proto(): bad client state: 7
The new client (golang service) just connects to one of the DBs through pgbouncer but does not do any queries.
The client states 6 and 7 are CL_WAITING_CANCEL and CL_ACTIVE_CANCEL.
How can a client be in one of these states? Why could it make pgbouncer systemd service crash?
Is it just a matter of pgbouncer config (increase max_client_conn and file descriptor ulimit)?
Do you have an idea of what would cause this issue ?
Thank you very much
The text was updated successfully, but these errors were encountered: