Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to retry connection after reconnect_failed callback #438

Closed
dshick-nydig opened this issue Nov 17, 2022 · 9 comments
Closed

Not able to retry connection after reconnect_failed callback #438

dshick-nydig opened this issue Nov 17, 2022 · 9 comments
Labels
pool The issue is related to the psycopg_pool package

Comments

@dshick-nydig
Copy link
Contributor

There are circumstances where connections fail that can be resolved by updating conninfo, e.g. timed out temporary credentials (AWS RDS IAM Auth, etc.) which can be refetched

if attempt.time_to_give_up(now):
logger.warning(
"reconnection attempt in pool %r failed after %s sec",
self.name,
self.reconnect_timeout,
)
with self._lock:
self._nconns -= 1
# If we have given up with a growing attempt, allow a new one.
if growing and self._growing:
self._growing = False
self.reconnect_failed()
else:
attempt.update_delay(now)
self.schedule_task(
AddConnection(self, attempt, growing=growing),
attempt.delay,
)
return

The successive call to AddConnection takes place in the else block, and doesn't fire regardless of any corrective actions present in the reconnect_failed callback

Looking at all of the methods that invoke AddConnection for a public candidate on self that can be used within the reconnect_failed callback:

  • self.check() - won't work because an empty pool is not addressed here
  • self.resize() - could work if the min size changes, but would mean altering the desired pool params

I've found that calling _maybe_grow_pool in the reconnect_failed callback is currently the best way to retry AddConnection after modifying self.conninfo, but this is likely not public for a reason.

One suggestion would be to allow the callback to return a value indicating whether or not AddConnection should be retried, and possibly supply a maximum number of times the callback can be invoked between successful connections to avoid an endless retry loop.

Another option would be to add a public method for modifying conninfo on-the-fly, that also verifies the new configuration allows for successful connections.

If there's a more appropriate way to deal with this situation, please let me know.

@dvarrazzo
Copy link
Member

dvarrazzo commented Nov 17, 2022

I think I see what you mean. The pool only reconnects if connection can be re-established and when a new connection is requested, because that's the moment in which we call _maybe_grow_pool().

In [1]: import logging
In [2]: logging.basicConfig(level=logging.INFO)
In [3]: logging.getLogger("psycopg.pool").setLevel(logging.INFO)

In [4]: from psycopg_pool import ConnectionPool
In [11]: p = ConnectionPool("port=54314", reconnect_timeout=5, open=True)
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool

# Stop the server

In [12]: with p.connection() as conn:
    ...:     conn.execute("select 1")

INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:connection given by 'pool-3'
INFO:psycopg.pool:returning connection to 'pool-3'
WARNING:psycopg.pool:discarding closed connection: <psycopg.Connection [BAD] at 0x7fcd55416a00>
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
OperationalError: consuming input failed: terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

# Check will pretty much empty the pool

In [13]: p.check()

WARNING:psycopg.pool:discarding broken connection: <psycopg.Connection [BAD] at 0x7fcd67f8ef10>
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
WARNING:psycopg.pool:reconnection attempt in pool 'pool-3' failed after 5 sec

# Restart the server

In [14]: with p.connection() as conn:
    ...:     print(conn.execute("select 1").fetchone()[0])
    ...: 
INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:growing pool 'pool-3' to 1
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?

# This will be blocked 30s before failing
# Restart the server in the meantime, however, this will not save this connection attempt.

...
PoolTimeout: couldn't get a connection after 30.0 sec

# Following connection attempt will succeed immediately, after reconnection.
# Furthermore, the grow attempt triggered by `connection()` will replenish the pool.

In [15]: with p.connection() as conn:
    ...:     print(conn.execute("select 1").fetchone()[0])
    ...: 
INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:growing pool 'pool-3' to 1
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-3' to 2
INFO:psycopg.pool:connection given by 'pool-3'
1
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-3' to 3
INFO:psycopg.pool:returning connection to 'pool-3'

INFO:psycopg.pool:adding new connection to the pool
In [16]: INFO:psycopg.pool:growing pool 'pool-3' to 4
INFO:psycopg.pool:adding new connection to the pool

We could definitely call _maybe_grow_pool() in check(). It seems an useful part of maintaining the pool state.


The latter connection failure, by the way, only happens because in this test I have set a reconnect timeout (5s) shorter than the getconn timeout (30s). In a different experiment, with a reconnect timeout of 1 minute:

In [6]: p = ConnectionPool("port=54314", reconnect_timeout=60, open=True)

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool

 # Stop the server

In [8]: p.check()
WARNING:psycopg.pool:discarding broken connection: <psycopg.Connection [BAD] at 0x7fd16c4a7310>
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
WARNING:psycopg.pool:reconnection attempt in pool 'pool-1' failed after 60 sec

In [9]: with p.connection() as conn:
   ...:     print(conn.execute("select 1").fetchone()[0])
   ...: 
INFO:psycopg.pool:connection requested from 'pool-1'
INFO:psycopg.pool:growing pool 'pool-1' to 1
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?

# This will wait for 30 seconds. In the meantime, restart the server:

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 2
INFO:psycopg.pool:connection given by 'pool-1'
1
INFO:psycopg.pool:returning connection to 'pool-1'

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 3
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 4
INFO:psycopg.pool:adding new connection to the pool

# The getconn worked, and the pool is replenished.

I believe this is a sane behaviour, which is provided by the current default (5m for reconnect timeout).

Do you think the pool should do anything different, apart of triggering a maybe-grow on check?

@dshick-nydig
Copy link
Contributor Author

I think that's like 90% of the problem fixed! This feels like the more correct behavior of check in the case of an empty pool.


The only other part of this is more related to avoiding a loop when there's no hope of recovery...

should_retry = True

def _update_connection_config(self: psycopg_pool.ConnectionPool):
    # we've failed to reconnect

    # and we might have been here before for this pool instance
    nonlocal should_retry
    if not should_retry:
        return

    # but if we haven't, try getting new creds
    connection_config = get_new_creds()

    # update conninfo with new creds
    self.conninfo = connection_config

    # don't retry recursively while growing the pool
    # because if the new creds are bad, we'll end up back here
    should_retry = False

    # forcefully try to reconnect after updating creds
    # we won't need this after your change
    # pylint: disable=W0212
    self._maybe_grow_pool()

    # check each connection and throw out bad ones
    self.check()

    # if we have valid connections now, it's okay to retry in the future
    pool_size = self.get_stats()["pool_size"]
    logger.debug("pool size is %s after trying new creds", pool_size)
    
    ###################################
    # this has always been true
    # I think the AddConnection jobs are race-y
    ###################################
    if pool_size > 0:
        should_retry = True
        logger.debug("will retry in the future")

    return

return psycopg_pool.ConnectionPool(
    connection_config,
    reconnect_failed=_update_connection_config,
    min_size=1,
    max_size=1,
    reconnect_timeout=1,
)

If I'm swapping in different conninfo inside of a reconnect_failed and that new conninfo also triggers a failure when running check, I could see that causing a never ending loop of failure. But I suppose that's kind of the author's problem to handle in the callback implementation.

I've been trying to use pool_size after _maybe_grow_pool to see if that resulted in any valid connections, but it seems to always return a non-zero value, even if the AddConnection calls failed. Might be due to the fact that those AddConnection calls are async. I think I probably need to just synchronously test the credentials before calling check, but if you have a better idea, I'm interested to hear it.

@dvarrazzo
Copy link
Member

If I'm swapping in different conninfo inside of a reconnect_failed and that new conninfo also triggers a failure when running check, I could see that causing a never ending loop of failure. But I suppose that's kind of the author's problem to handle in the callback implementation.

Uhm... this use case is not supported, and not tested, but I see it being somewhat valid... I'll take a look if it's possible to support it.

I've been trying to use pool_size after _maybe_grow_pool to see if that resulted in any valid connections, but it seems to always return a non-zero value, even if the AddConnection calls failed. Might be due to the fact that those AddConnection calls are async. I think I probably need to just synchronously test the credentials before calling check, but if you have a better idea, I'm interested to hear it.

As you are testing in this area, maybe you would like to propose a MR to add a maybe-grow in check?

If not, I will work on these points, but it will have to wait a few days/weeks.

@dshick-nydig
Copy link
Contributor Author

The way I've been testing this, btw, is to start a while loop that queries and then sleeps, and while it's sleeping after a successful query, in a psql session I run

alter role tester with password 'tester2'; select pg_terminate_backend(pid) from pg_stat_activity where datname='database_name' and application_name<>'psql';

to toggle the password to something else and drop existing pool connections. This forces the reconnect_failed callback to fire, and get_new_creds returns a new password.

@dshick-nydig
Copy link
Contributor Author

I'll look into getting a PR out for the _maybe_grow check. I may have time over the weekend, but I'm just starting to familiarize myself with this codebase, so it could be longer.

Thanks for all of your help and all of your great work in general on this project!!!

@dvarrazzo
Copy link
Member

The way I've been testing this...

This test would be difficult to implement in a portable way in the test suite. What I would test is to use a different application_name in the new connection string, and verify that it's taken in consideration using SHOW application_name in a connection returned by the pool.

@dvarrazzo
Copy link
Member

Also note that a valid way to switch connection target could be to usee a connection env var. For instance, if you want to switch host, you can leave the host unspecified in the connection string and change the PGHOST env var instead.

@dvarrazzo
Copy link
Member

    ###################################
    # this has always been true
    # I think the AddConnection jobs are race-y
    ###################################
    if pool_size > 0:

pool._nconns, which is exposed by the pooL_size measure, counts connections in the pool, out of the pool, and being attempted too. You will see, for instance, the value being increased in maybe_grow (after passing the "maybe" guard) and decreased only when AddConnection gives up. I don't believe it's subject to race conditions, and changing the value seems always correctly guarded by pool._lock, but of course bugs are still possible and I don't mind if you take a look at the code 😄

@dshick-nydig
Copy link
Contributor Author

Makes sense. I think the code is fine as is. I just meant "race-y" in the sense that AddConnection running in a scheduled task means this value is likely unpredictable from the reconnect_failed callback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pool The issue is related to the psycopg_pool package
Projects
None yet
Development

No branches or pull requests

2 participants