Not able to retry connection after reconnect_failed callback #438

dshick-nydig · 2022-11-17T16:31:15Z

There are circumstances where connections fail that can be resolved by updating conninfo, e.g. timed out temporary credentials (AWS RDS IAM Auth, etc.) which can be refetched

psycopg/psycopg_pool/psycopg_pool/pool.py

Lines 522 to 540 in a816886

    
           if attempt.time_to_give_up(now): 
        
               logger.warning( 
        
                   "reconnection attempt in pool %r failed after %s sec", 
        
                   self.name, 
        
                   self.reconnect_timeout, 
        
               ) 
        
               with self._lock: 
        
                   self._nconns -= 1 
        
                   # If we have given up with a growing attempt, allow a new one. 
        
                   if growing and self._growing: 
        
                       self._growing = False 
        
               self.reconnect_failed() 
        
           else: 
        
               attempt.update_delay(now) 
        
               self.schedule_task( 
        
                   AddConnection(self, attempt, growing=growing), 
        
                   attempt.delay, 
        
               ) 
        
           return

The successive call to AddConnection takes place in the else block, and doesn't fire regardless of any corrective actions present in the reconnect_failed callback

Looking at all of the methods that invoke AddConnection for a public candidate on self that can be used within the reconnect_failed callback:

self.check() - won't work because an empty pool is not addressed here
self.resize() - could work if the min size changes, but would mean altering the desired pool params

I've found that calling _maybe_grow_pool in the reconnect_failed callback is currently the best way to retry AddConnection after modifying self.conninfo, but this is likely not public for a reason.

One suggestion would be to allow the callback to return a value indicating whether or not AddConnection should be retried, and possibly supply a maximum number of times the callback can be invoked between successful connections to avoid an endless retry loop.

Another option would be to add a public method for modifying conninfo on-the-fly, that also verifies the new configuration allows for successful connections.

If there's a more appropriate way to deal with this situation, please let me know.

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2022-11-17T20:24:40Z

I think I see what you mean. The pool only reconnects if connection can be re-established and when a new connection is requested, because that's the moment in which we call _maybe_grow_pool().

In [1]: import logging
In [2]: logging.basicConfig(level=logging.INFO)
In [3]: logging.getLogger("psycopg.pool").setLevel(logging.INFO)

In [4]: from psycopg_pool import ConnectionPool
In [11]: p = ConnectionPool("port=54314", reconnect_timeout=5, open=True)
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool

# Stop the server

In [12]: with p.connection() as conn:
    ...:     conn.execute("select 1")

INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:connection given by 'pool-3'
INFO:psycopg.pool:returning connection to 'pool-3'
WARNING:psycopg.pool:discarding closed connection: <psycopg.Connection [BAD] at 0x7fcd55416a00>
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
OperationalError: consuming input failed: terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

# Check will pretty much empty the pool

In [13]: p.check()

WARNING:psycopg.pool:discarding broken connection: <psycopg.Connection [BAD] at 0x7fcd67f8ef10>
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
WARNING:psycopg.pool:reconnection attempt in pool 'pool-3' failed after 5 sec

# Restart the server

In [14]: with p.connection() as conn:
    ...:     print(conn.execute("select 1").fetchone()[0])
    ...: 
INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:growing pool 'pool-3' to 1
WARNING:psycopg.pool:error connecting in 'pool-3': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?

# This will be blocked 30s before failing
# Restart the server in the meantime, however, this will not save this connection attempt.

...
PoolTimeout: couldn't get a connection after 30.0 sec

# Following connection attempt will succeed immediately, after reconnection.
# Furthermore, the grow attempt triggered by `connection()` will replenish the pool.

In [15]: with p.connection() as conn:
    ...:     print(conn.execute("select 1").fetchone()[0])
    ...: 
INFO:psycopg.pool:connection requested from 'pool-3'
INFO:psycopg.pool:growing pool 'pool-3' to 1
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-3' to 2
INFO:psycopg.pool:connection given by 'pool-3'
1
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-3' to 3
INFO:psycopg.pool:returning connection to 'pool-3'

INFO:psycopg.pool:adding new connection to the pool
In [16]: INFO:psycopg.pool:growing pool 'pool-3' to 4
INFO:psycopg.pool:adding new connection to the pool

We could definitely call _maybe_grow_pool() in check(). It seems an useful part of maintaining the pool state.

The latter connection failure, by the way, only happens because in this test I have set a reconnect timeout (5s) shorter than the getconn timeout (30s). In a different experiment, with a reconnect timeout of 1 minute:

In [6]: p = ConnectionPool("port=54314", reconnect_timeout=60, open=True)

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:adding new connection to the pool

 # Stop the server

In [8]: p.check()
WARNING:psycopg.pool:discarding broken connection: <psycopg.Connection [BAD] at 0x7fd16c4a7310>
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
...
WARNING:psycopg.pool:reconnection attempt in pool 'pool-1' failed after 60 sec

In [9]: with p.connection() as conn:
   ...:     print(conn.execute("select 1").fetchone()[0])
   ...: 
INFO:psycopg.pool:connection requested from 'pool-1'
INFO:psycopg.pool:growing pool 'pool-1' to 1
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?
WARNING:psycopg.pool:error connecting in 'pool-1': connection is bad: No such file or directory
	Is the server running locally and accepting connections on that socket?

# This will wait for 30 seconds. In the meantime, restart the server:

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 2
INFO:psycopg.pool:connection given by 'pool-1'
1
INFO:psycopg.pool:returning connection to 'pool-1'

INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 3
INFO:psycopg.pool:adding new connection to the pool
INFO:psycopg.pool:growing pool 'pool-1' to 4
INFO:psycopg.pool:adding new connection to the pool

# The getconn worked, and the pool is replenished.

I believe this is a sane behaviour, which is provided by the current default (5m for reconnect timeout).

Do you think the pool should do anything different, apart of triggering a maybe-grow on check?

dshick-nydig · 2022-11-17T21:20:29Z

I think that's like 90% of the problem fixed! This feels like the more correct behavior of check in the case of an empty pool.

The only other part of this is more related to avoiding a loop when there's no hope of recovery...

should_retry = True

def _update_connection_config(self: psycopg_pool.ConnectionPool):
    # we've failed to reconnect

    # and we might have been here before for this pool instance
    nonlocal should_retry
    if not should_retry:
        return

    # but if we haven't, try getting new creds
    connection_config = get_new_creds()

    # update conninfo with new creds
    self.conninfo = connection_config

    # don't retry recursively while growing the pool
    # because if the new creds are bad, we'll end up back here
    should_retry = False

    # forcefully try to reconnect after updating creds
    # we won't need this after your change
    # pylint: disable=W0212
    self._maybe_grow_pool()

    # check each connection and throw out bad ones
    self.check()

    # if we have valid connections now, it's okay to retry in the future
    pool_size = self.get_stats()["pool_size"]
    logger.debug("pool size is %s after trying new creds", pool_size)
    
    ###################################
    # this has always been true
    # I think the AddConnection jobs are race-y
    ###################################
    if pool_size > 0:
        should_retry = True
        logger.debug("will retry in the future")

    return

return psycopg_pool.ConnectionPool(
    connection_config,
    reconnect_failed=_update_connection_config,
    min_size=1,
    max_size=1,
    reconnect_timeout=1,
)

If I'm swapping in different conninfo inside of a reconnect_failed and that new conninfo also triggers a failure when running check, I could see that causing a never ending loop of failure. But I suppose that's kind of the author's problem to handle in the callback implementation.

I've been trying to use pool_size after _maybe_grow_pool to see if that resulted in any valid connections, but it seems to always return a non-zero value, even if the AddConnection calls failed. Might be due to the fact that those AddConnection calls are async. I think I probably need to just synchronously test the credentials before calling check, but if you have a better idea, I'm interested to hear it.

dvarrazzo · 2022-11-17T21:26:05Z

If I'm swapping in different conninfo inside of a reconnect_failed and that new conninfo also triggers a failure when running check, I could see that causing a never ending loop of failure. But I suppose that's kind of the author's problem to handle in the callback implementation.

Uhm... this use case is not supported, and not tested, but I see it being somewhat valid... I'll take a look if it's possible to support it.

I've been trying to use pool_size after _maybe_grow_pool to see if that resulted in any valid connections, but it seems to always return a non-zero value, even if the AddConnection calls failed. Might be due to the fact that those AddConnection calls are async. I think I probably need to just synchronously test the credentials before calling check, but if you have a better idea, I'm interested to hear it.

As you are testing in this area, maybe you would like to propose a MR to add a maybe-grow in check?

If not, I will work on these points, but it will have to wait a few days/weeks.

dshick-nydig · 2022-11-17T21:26:26Z

The way I've been testing this, btw, is to start a while loop that queries and then sleeps, and while it's sleeping after a successful query, in a psql session I run

alter role tester with password 'tester2'; select pg_terminate_backend(pid) from pg_stat_activity where datname='database_name' and application_name<>'psql';

to toggle the password to something else and drop existing pool connections. This forces the reconnect_failed callback to fire, and get_new_creds returns a new password.

dshick-nydig · 2022-11-17T21:27:56Z

I'll look into getting a PR out for the _maybe_grow check. I may have time over the weekend, but I'm just starting to familiarize myself with this codebase, so it could be longer.

Thanks for all of your help and all of your great work in general on this project!!!

dvarrazzo · 2022-11-17T21:31:56Z

The way I've been testing this...

This test would be difficult to implement in a portable way in the test suite. What I would test is to use a different application_name in the new connection string, and verify that it's taken in consideration using SHOW application_name in a connection returned by the pool.

dvarrazzo · 2022-11-17T21:36:51Z

Also note that a valid way to switch connection target could be to usee a connection env var. For instance, if you want to switch host, you can leave the host unspecified in the connection string and change the PGHOST env var instead.

dvarrazzo · 2022-11-17T21:46:22Z

    ###################################
    # this has always been true
    # I think the AddConnection jobs are race-y
    ###################################
    if pool_size > 0:

pool._nconns, which is exposed by the pooL_size measure, counts connections in the pool, out of the pool, and being attempted too. You will see, for instance, the value being increased in maybe_grow (after passing the "maybe" guard) and decreased only when AddConnection gives up. I don't believe it's subject to race conditions, and changing the value seems always correctly guarded by pool._lock, but of course bugs are still possible and I don't mind if you take a look at the code 😄

dshick-nydig · 2022-11-17T22:43:49Z

Makes sense. I think the code is fine as is. I just meant "race-y" in the sense that AddConnection running in a scheduled task means this value is likely unpredictable from the reconnect_failed callback.

Close #438.

dvarrazzo added the pool The issue is related to the psycopg_pool package label Dec 13, 2022

dvarrazzo added a commit that referenced this issue Dec 16, 2022

fix(pool): make sure that check() fills an empty pool

d2b3d88

Close #438.

dvarrazzo closed this as completed in f196fb4 Dec 16, 2022

dshick-nydig mentioned this issue Mar 2, 2023

reconnect_failed callback should be an async function for async connection pools #520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to retry connection after reconnect_failed callback #438

Not able to retry connection after reconnect_failed callback #438

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022 •

edited

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

Not able to retry connection after reconnect_failed callback #438

Not able to retry connection after reconnect_failed callback #438

Comments

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022 • edited

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022

dshick-nydig commented Nov 17, 2022

dvarrazzo commented Nov 17, 2022 •

edited