ConnectionPool#check method failures will cause an infinite loop #709

dabdine · 2023-12-29T03:45:36Z

In my use case, I'm using the check method to set some parameters for Postgres RLS (specifically setting the ROLE and the tenant parameters on the connection) before the connection is handed to application code (similarly, using reset to erase / remove state when it is returned to the pool).

I noticed that if a connection's check method fails, psycopg3 will just loop infinitely. In my case, there was an error I made dealing with query parameters (irrelevant, but that was the underlying cause that caused the issue in my case). I think either a sensible default or a maximum retry would be helpful for debugging purposes (otherwise, I was wondering why my application was hanging).

I did originally try to use configure, but I need information that comes from a ContextVar in the context of a FastAPI request thread (a tenant identifier), so that wouldn't work, since configure is run at moments when that information is not available.

Minimal example:

import os
from psycopg import Connection
from psycopg_pool import ConnectionPool

def bugged_check(conn):
    with conn.cursor() as cursor:
        try:
            print("performing something that will except")
            cursor.execute("blah")
            conn.commit()
        except Exception as e:
           raise e

connection_pool = ConnectionPool(
    conninfo=os.environ["DATABASE_URL"],
    min_size=1,
    max_size=10,
    check=bugged_check,
)

with connection_pool.getconn() as conn:
    print("never gets here")

execution:

poetry run python test.py > foo  0.55s user 0.20s system 27% cpu 2.711 total

output:

$ wc -l foo
3010 foo
$ tail -n10 foo
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2023-12-29T12:01:50Z

Hm... connect() should have a 30s timeout, which I would have expected to kick in here, but your test seems to prove that it is not respected if check() is failing... Let me try and reproduce that.

dvarrazzo · 2023-12-29T12:30:08Z

There is a bug indeed, in the chain getconn() -> _getconn_unchecked() -> _get_ready_connection(): getconn() calculates a deadline (by default 30s as documented), which is passed down to _get_ready_connection()... which does exactly nothing with it.

psycopg/psycopg_pool/psycopg_pool/pool.py

Lines 229 to 313 in aa4d15b

    
               def getconn(self, timeout: Optional[float] = None) -> CT: 
        
                   """Obtain a connection from the pool. 
        
                   You should preferably use `connection()`. Use this function only if 
        
                   it is not possible to use the connection as context manager. 
        
                   After using this function you *must* call a corresponding `putconn()`: 
        
                   failing to do so will deplete the pool. A depleted pool is a sad pool: 
        
                   you don't want a depleted pool. 
        
                   """ 
        
                   t0 = monotonic() 
        
                   if timeout is None: 
        
                       timeout = self.timeout 
        
                   deadline = t0 + timeout 
        
                   logger.info("connection requested from %r", self.name) 
        
                   self._stats[self._REQUESTS_NUM] += 1 
        
                   self._check_open_getconn() 
        
                   try: 
        
                       while True: 
        
                           conn = self._getconn_unchecked(deadline - monotonic()) 
        
                           try: 
        
                               self._check_connection(conn) 
        
                           except Exception: 
        
                               self._putconn(conn, from_getconn=True) 
        
                           else: 
        
                               logger.info("connection given by %r", self.name) 
        
                               return conn 
        
                   # Re-raise the timeout exception presenting the user the global 
        
                   # timeout, not the per-attempt one. 
        
                   except PoolTimeout: 
        
                       raise PoolTimeout( 
        
                           f"couldn't get a connection after {timeout:.2f} sec" 
        
                       ) from None 
        
               def _getconn_unchecked(self, timeout: float) -> CT: 
        
                   # Critical section: decide here if there's a connection ready 
        
                   # or if the client needs to wait. 
        
                   with self._lock: 
        
                       conn = self._get_ready_connection(timeout) 
        
                       if not conn: 
        
                           # No connection available: put the client in the waiting queue 
        
                           t0 = monotonic() 
        
                           pos: WaitingClient[CT] = WaitingClient() 
        
                           self._waiting.append(pos) 
        
                           self._stats[self._REQUESTS_QUEUED] += 1 
        
                           # If there is space for the pool to grow, let's do it 
        
                           self._maybe_grow_pool() 
        
                   # If we are in the waiting queue, wait to be assigned a connection 
        
                   # (outside the critical section, so only the waiting client is locked) 
        
                   if not conn: 
        
                       try: 
        
                           conn = pos.wait(timeout=timeout) 
        
                       except Exception: 
        
                           self._stats[self._REQUESTS_ERRORS] += 1 
        
                           raise 
        
                       finally: 
        
                           t1 = monotonic() 
        
                           self._stats[self._REQUESTS_WAIT_MS] += int(1000.0 * (t1 - t0)) 
        
                   # Tell the connection it belongs to a pool to avoid closing on __exit__ 
        
                   # Note that this property shouldn't be set while the connection is in 
        
                   # the pool, to avoid to create a reference loop. 
        
                   conn._pool = self 
        
                   return conn 
        
               def _get_ready_connection(self, timeout: Optional[float]) -> Optional[CT]: 
        
                   """Return a connection, if the client deserves one.""" 
        
                   conn: Optional[CT] = None 
        
                   if self._pool: 
        
                       # Take a connection ready out of the pool 
        
                       conn = self._pool.popleft() 
        
                       if len(self._pool) < self._nconns_min: 
        
                           self._nconns_min = len(self._pool) 
        
                   elif self.max_waiting and len(self._waiting) >= self.max_waiting: 
        
                       self._stats[self._REQUESTS_ERRORS] += 1 
        
                       raise TooManyRequests( 
        
                           f"the pool {self.name!r} has already" 
        
                           + f" {len(self._waiting)} requests waiting" 
        
                       ) 
        
                   return conn

This change seems enough to solve it:

diff --git a/psycopg_pool/psycopg_pool/pool.py b/psycopg_pool/psycopg_pool/pool.py
index 288473fc..0ab3c5d9 100644
--- a/psycopg_pool/psycopg_pool/pool.py
+++ b/psycopg_pool/psycopg_pool/pool.py
@@ -266,8 +266,11 @@ class ConnectionPool(Generic[CT], BasePool):
     def _getconn_unchecked(self, timeout: float) -> CT:
         # Critical section: decide here if there's a connection ready
         # or if the client needs to wait.
+        if timeout <= 0.0:
+            raise PoolTimeout()
+
         with self._lock:
-            conn = self._get_ready_connection(timeout)
+            conn = self._get_ready_connection()
             if not conn:
                 # No connection available: put the client in the waiting queue
                 t0 = monotonic()
@@ -296,7 +299,7 @@ class ConnectionPool(Generic[CT], BasePool):
         conn._pool = self
         return conn
 
-    def _get_ready_connection(self, timeout: Optional[float]) -> Optional[CT]:
+    def _get_ready_connection(self) -> Optional[CT]:
         """Return a connection, if the client deserves one."""
         conn: Optional[CT] = None
         if self._pool:

With this changeset your test should fail after exactly 30 seconds.

I will add a test to verify that the issue is fixed and make a new release of the pool package. If you can test it on your side too, that would be welcome.

Close #709.

dvarrazzo · 2023-12-29T15:56:12Z

The proper fix is a bit different to take into account the NullPool, but tests now pass for me. Will merge this changeset and release psycopg_pool 3.2.1 in the next few days.

dabdine · 2023-12-29T18:12:19Z

That's great @dvarrazzo, thank you!

Is it correct to assume though, that with this test case, the loop will eat CPU for 30s until it is finally killed by the timeout? Not sure if there's a decent way around that outside of defining check to return a bool instead, and bailing out completely without retry if check ever raises.

The main issue would be that there is some unrecoverable condition within the check method that will always cause it to fail, so every attempt at a connection grab will timeout at 30s, while check loops "infinitely" within that timeframe (potentially bombarding the db if it's actually executing a query that is not successful). i.e., is there a way to communicate to the application code that a connection grab would never be successful?

dvarrazzo · 2023-12-29T22:11:05Z

@dabdine I don't think it is practical to make the protocol between the check function and the caller more complicated, not to mention that backward compatibility should be kept into account. However I see your point. I think the best strategy would be to introduce an exponential backoff to avoid a busyloop around a failing check function: we do something similar with the reconnect loop so actually we should have thought about it.

dvarrazzo · 2023-12-29T22:43:30Z

This is a behaviour we might like more, isn't it?

$ time python bug-709.py 
2023-12-29 23:38:11,415 INFO performing something that will except
2023-12-29 23:38:11,416 INFO performing something that will except
2023-12-29 23:38:12,328 INFO performing something that will except
2023-12-29 23:38:14,153 INFO performing something that will except
2023-12-29 23:38:17,799 INFO performing something that will except
2023-12-29 23:38:25,095 INFO performing something that will except
2023-12-29 23:38:39,683 INFO performing something that will except
Traceback (most recent call last):
  File "/home/piro/dev/psycopg3/bug-709.py", line 29, in <module>
    with connection_pool.getconn() as conn:
  File "/home/piro/dev/psycopg3/psycopg_pool/psycopg_pool/pool.py", line 276, in getconn
    raise PoolTimeout(
psycopg_pool.PoolTimeout: couldn't get a connection after 30.00 sec

real	0m30.123s
user	0m0.115s
sys	0m0.012s

Cleaning this up and testing it more thoroughly, we will probably include this improvement too in the next bugfix release.

dvarrazzo · 2023-12-29T22:56:49Z

I have a question for you, as you understand you have put some logic in your check function.

The Pool.check() method internally uses Pool.check_connection(). According to your use case, does it make more sense that it used the user-provided check if specified?

This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709

dvarrazzo · 2023-12-30T01:56:05Z

MR #711 should solve both the problem of respecting the getconn timeout and also throttle the number of re-checking with an exponential backoff.

Testing is welcome 🙂

Close #709.

This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709

dabdine · 2023-12-31T03:19:16Z

I have a question for you, as you understand you have put some logic in your check function.

The Pool.check() method internally uses Pool.check_connection(). According to your use case, does it make more sense that it used the user-provided check if specified?

If I'm understanding it correctly, the pool is just validating that the connections are still valid by running a SELECT 1 to test whether it still has a wire connection to the db. If that's the case, I don't think that fits my use case, and I think it makes sense the way it operates right now.

I'm trying to mimic what my go-based backend server does right now using pgxpool. Specifically, it uses BeforeAcquire and AfterRelease (https://pkg.go.dev/github.com/jackc/pgx/v4/pgxpool#Config) to set RLS parameters / unset them respectively before/after the application code gets a connection.

It seems like the check method works for that use case (along with reset). the check_connection on the pool seems like a reasonable internal check for a pool to test its child connection liveness with.

dabdine · 2023-12-31T03:21:15Z

@dabdine I don't think it is practical to make the protocol between the check function and the caller more complicated, not to mention that backward compatibility should be kept into account. However I see your point. I think the best strategy would be to introduce an exponential backoff to avoid a busyloop around a failing check function: we do something similar with the reconnect loop so actually we should have thought about it.

I think this solution -- along with me properly fixing my check function implementation -- is pretty solid!

dabdine · 2023-12-31T03:49:24Z

MR #711 should solve both the problem of respecting the getconn timeout and also throttle the number of re-checking with an exponential backoff.

Testing is welcome 🙂

Tested locally and works like a charm, great job, and thank you!

Close #709.

This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709

dvarrazzo · 2024-01-07T16:11:32Z

Psycog-pool 3.2.1 released with this improvement. Testing is welcome. Cheers!

dabdine changed the title ~~ConnectionPool#check method failures can cause an infinite loop~~ ConnectionPool#check method failures will cause an infinite loop Dec 29, 2023

dvarrazzo added a commit that referenced this issue Dec 29, 2023

fix(pool): respect timeout on getconn() when the check function fails

6b23437

Close #709.

dvarrazzo mentioned this issue Dec 30, 2023

Respect getconn timeout with failed connection checks, delay checks with backoff #711

Merged

dvarrazzo linked a pull request Dec 30, 2023 that will close this issue

Respect getconn timeout with failed connection checks, delay checks with backoff #711

Merged

dvarrazzo added a commit that referenced this issue Dec 31, 2023

fix(pool): respect timeout on getconn() when the check function fails

00ae034

Close #709.

dvarrazzo added a commit that referenced this issue Jan 6, 2024

fix(pool): respect timeout on getconn() when the check function fails

4ce737d

Close #709.

dvarrazzo closed this as completed in 0c2d649 Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConnectionPool#check method failures will cause an infinite loop #709

ConnectionPool#check method failures will cause an infinite loop #709

dabdine commented Dec 29, 2023 •

edited

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dabdine commented Dec 29, 2023 •

edited

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 30, 2023 •

edited

dabdine commented Dec 31, 2023

dabdine commented Dec 31, 2023 •

edited

dabdine commented Dec 31, 2023 •

edited

dvarrazzo commented Jan 7, 2024

ConnectionPool#check method failures will cause an infinite loop #709

ConnectionPool#check method failures will cause an infinite loop #709

Comments

dabdine commented Dec 29, 2023 • edited

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dabdine commented Dec 29, 2023 • edited

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 29, 2023

dvarrazzo commented Dec 30, 2023 • edited

dabdine commented Dec 31, 2023

dabdine commented Dec 31, 2023 • edited

dabdine commented Dec 31, 2023 • edited

dvarrazzo commented Jan 7, 2024

dabdine commented Dec 29, 2023 •

edited

dabdine commented Dec 29, 2023 •

edited

dvarrazzo commented Dec 30, 2023 •

edited

dabdine commented Dec 31, 2023 •

edited

dabdine commented Dec 31, 2023 •

edited