Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set last_connect_time for pool only when it really failed to connect #127

Closed

Conversation

CyberDem0n
Copy link

and rename last_connect_time to last_connect_failed_time to make it
clearer its purpose.

Last couple of months we were observing very strange issues with pgbouncer.
From time to time it was starting to fail and producing following log lines:
'launch_new_connection: last failed, wait'
And clients were disconnected with the message: "pgbouncer cannot connect to server"

We run pgbouncer with server_login_retry = 15 seconds (default setting)
That means after 15 seconds it should try to establish connection to backend and after success reset last_connect_failed flag. But it was never happened (flag was never reset).
We had enabled debug logs (verbose = 2) and started investigating.
Every time problems were starting with:
WARNING sbuf_connect failed: Resource temporarily unavailable (EAGAIN, full backlog on unix domain socket)

laster there were a lot of messages "launch_new_connection: last failed, wait" and after 15 seconds:
2016-04-06 13:43:29.221 75936 DEBUG S-0x1734db0: db_name/user_name@unix:5432 launching new connection to server
2016-04-06 13:43:29.221 75936 DEBUG S-0x1734db0: db_name/user_name@unix:5432 S: connect ok
2016-04-06 13:43:29.221 75936 DEBUG S-0x1734db0: db_name/user_name@unix:5432 use it for pending cancel req

That means, connection was opened successfully, but this connection was used for forwarding "cancel request" and closed right after. As a consequence, LOGIN phase was not complete and
last_connect_failed flag was not reset. BUT! last_connect_time value was updated much earlier, before it even tried to establish connection with the postgres.

And again we have "launch_new_connection: last failed, wait" during 15 seconds.

Obviously we have some issues with our setup:

  1. pool_size is too small and pgbouncer is opening new connections too often.
  2. net.core.somaxconn = 128 is too small and definitely must be increased (this should help to mitigate backlog problems)
  3. Somehow we have too much "cancel requests". But I expect this to be a consequence of first failure to open a new connection. Because Our pgbouncer is used only by plproxy and for every query executed through plproxy we have 8 connections to pgbouncer (to the same database).

Sure, we already fixed 1 and 2, but in my opinion pgbouncer should handle such case (when there are pending cancel requests) correctly and this pull request intended to fix it.

and rename last_connect_time to last_connect_failed_time to make it
clearer its purpose.
@CyberDem0n
Copy link
Author

@PJMODOS, please have a look on it.

@Firefishy
Copy link

Any progress?

@CyberDem0n
Copy link
Author

@Firefishy from my side I can only tell that pgbouncer with this fix is working successfully in production since April 2016.

@PJMODOS
Copy link
Contributor

PJMODOS commented Nov 4, 2017

If you chose to refactor this way, why keep the last_connect_failed? We could get same info from the fact that last_connect_failed_time is non-zero.

@CyberDem0n
Copy link
Author

Hi @PJMODOS, thank you for your feedback!

This is not a refactoring, but attempt to fix a real problem.
Regarding getting rid from last_connect_failed: I believe your suggestion make sense. When I was doing that fix my goal was to touch a small lines of code as possible.

@CyberDem0n CyberDem0n force-pushed the bugfix/last_connect_failed_time branch from 18255fe to ee11bdb Compare November 4, 2017 17:26
@CyberDem0n
Copy link
Author

After 2.5 years this problem was finally solved by #329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants