Skip to content

Fix: FATAL in function disconnect_client(): bad client state: 0 #846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2023

Conversation

JelteF
Copy link
Member

@JelteF JelteF commented May 22, 2023

During normal operation, cancel clients get closed because their linked
server finished sending the cancel request. But this is not always the
case. It's possible for the client to disconnect itself, which would
cause the client object to be freed, before the linked server is. In
those cases the client was not unlinked from the server, thus when the
server was then closed it would try to close the already freed client
again. This would then result in a fatal error.

In passing this adds a test for cancel_wait_timeout. With the help of
that test and some hacky code changes (including ones in postgres) I was
able to reliably reproduce this issue. An actual test for this issue cannot
be implemented currently, because libpq (and thus also psycopg) requires
waiting for a cancel request to complete.

Fixes #801

Related to #717 and #815

@eulerto
Copy link
Member

eulerto commented May 22, 2023

Did you test with enable-cassert? (CI covers it but since you mentioned you have to modify Postgres to reliably reproduce it, I'm curious if it passes your tests.)

Shouldn't you disconnect_server() with send_term = true when state is CL_ACTIVE_CANCEL?

Add a comment explaining why you're cleaning the link (extract the words from the commit message). It is faster to read the code than to find the exact change reference in the commit logs.

During normal operation, cancel clients get closed because their linked
server finished sending the cancel request. But this is not always the
case. It's possible for the client to disconnect itself, which would
cause the client object to be freed, before the linked server is. In
those cases the client was not unlinked from the server, thus when the
server was then closed it would try to close the already freed client
again. This would then result in a fatal error.

In passing this adds a test for cancel_wait_timeout. With the help of
that test and some hacky code changes (including ones in postgres) I was
able to reliably reproduce this issue.
@JelteF JelteF force-pushed the fix-801-attempt-2 branch from 532dcc3 to 74c50bf Compare May 23, 2023 08:36
@JelteF
Copy link
Member Author

JelteF commented May 23, 2023

Did you test with enable-cassert? (CI covers it but since you mentioned you have to modify Postgres to reliably reproduce it, I'm curious if it passes your tests.)

Good call. I don't think I had cassert enabled. I enabled it now and reran my manual tests. All is good.

Shouldn't you disconnect_server() with send_term = true when state is CL_ACTIVE_CANCEL?

No, send_term is only part of the query protocol. The cancelation protocol would not understand it. It only expects 32 bits for the PID and 32bits for the secret, and then a connection close.

Add a comment explaining why you're cleaning the link (extract the words from the commit message). It is faster to read the code than to find the exact change reference in the commit logs.

Done

@eulerto
Copy link
Member

eulerto commented May 23, 2023

Good call. I don't think I had cassert enabled. I enabled it now and reran my manual tests. All is good.

Ok.

No, send_term is only part of the query protocol. The cancelation protocol would not understand it. It only expects 32 bits for the PID and 32bits for the secret, and then a connection close.

Hmm. Ok.

Copy link
Member

@eulerto eulerto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JelteF JelteF merged commit 3869a85 into pgbouncer:master May 23, 2023
JelteF added a commit to JelteF/pgbouncer that referenced this pull request May 30, 2023
…uncer#846)

During normal operation, cancel clients get closed because their linked
server finished sending the cancel request. But this is not always the
case. It's possible for the client to disconnect itself, which would
cause the client object to be freed, before the linked server is. In
those cases the client was not unlinked from the server, thus when the
server was then closed it would try to close the already freed client
again. This would then result in a fatal error.

In passing this adds a test for cancel_wait_timeout. With the help of
that test and some hacky code changes (including ones in postgres) I was
able to reliably reproduce this issue.

(cherry picked from commit 3869a85)
JelteF added a commit that referenced this pull request May 31, 2023
During normal operation, cancel clients get closed because their linked
server finished sending the cancel request. But this is not always the
case. It's possible for the client to disconnect itself, which would
cause the client object to be freed, before the linked server is. In
those cases the client was not unlinked from the server, thus when the
server was then closed it would try to close the already freed client
again. This would then result in a fatal error.

In passing this adds a test for cancel_wait_timeout. With the help of
that test and some hacky code changes (including ones in postgres) I was
able to reliably reproduce this issue.

(cherry picked from commit 3869a85)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FATAL @src/objects.c:1052 in function disconnect_client(): bad client state: 0
2 participants