Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionPool#check method failures will cause an infinite loop #709

Closed
dabdine opened this issue Dec 29, 2023 · 12 comments · Fixed by #711
Closed

ConnectionPool#check method failures will cause an infinite loop #709

dabdine opened this issue Dec 29, 2023 · 12 comments · Fixed by #711

Comments

@dabdine
Copy link

dabdine commented Dec 29, 2023

In my use case, I'm using the check method to set some parameters for Postgres RLS (specifically setting the ROLE and the tenant parameters on the connection) before the connection is handed to application code (similarly, using reset to erase / remove state when it is returned to the pool).

I noticed that if a connection's check method fails, psycopg3 will just loop infinitely. In my case, there was an error I made dealing with query parameters (irrelevant, but that was the underlying cause that caused the issue in my case). I think either a sensible default or a maximum retry would be helpful for debugging purposes (otherwise, I was wondering why my application was hanging).

I did originally try to use configure, but I need information that comes from a ContextVar in the context of a FastAPI request thread (a tenant identifier), so that wouldn't work, since configure is run at moments when that information is not available.

Minimal example:

import os
from psycopg import Connection
from psycopg_pool import ConnectionPool

def bugged_check(conn):
    with conn.cursor() as cursor:
        try:
            print("performing something that will except")
            cursor.execute("blah")
            conn.commit()
        except Exception as e:
           raise e

connection_pool = ConnectionPool(
    conninfo=os.environ["DATABASE_URL"],
    min_size=1,
    max_size=10,
    check=bugged_check,
)

with connection_pool.getconn() as conn:
    print("never gets here")

execution:

poetry run python test.py > foo  0.55s user 0.20s system 27% cpu 2.711 total

output:

$ wc -l foo
3010 foo
$ tail -n10 foo
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
performing something that will except
@dabdine dabdine changed the title ConnectionPool#check method failures can cause an infinite loop ConnectionPool#check method failures will cause an infinite loop Dec 29, 2023
@dvarrazzo
Copy link
Member

Hm... connect() should have a 30s timeout, which I would have expected to kick in here, but your test seems to prove that it is not respected if check() is failing... Let me try and reproduce that.

@dvarrazzo
Copy link
Member

There is a bug indeed, in the chain getconn() -> _getconn_unchecked() -> _get_ready_connection(): getconn() calculates a deadline (by default 30s as documented), which is passed down to _get_ready_connection()... which does exactly nothing with it.

def getconn(self, timeout: Optional[float] = None) -> CT:
"""Obtain a connection from the pool.
You should preferably use `connection()`. Use this function only if
it is not possible to use the connection as context manager.
After using this function you *must* call a corresponding `putconn()`:
failing to do so will deplete the pool. A depleted pool is a sad pool:
you don't want a depleted pool.
"""
t0 = monotonic()
if timeout is None:
timeout = self.timeout
deadline = t0 + timeout
logger.info("connection requested from %r", self.name)
self._stats[self._REQUESTS_NUM] += 1
self._check_open_getconn()
try:
while True:
conn = self._getconn_unchecked(deadline - monotonic())
try:
self._check_connection(conn)
except Exception:
self._putconn(conn, from_getconn=True)
else:
logger.info("connection given by %r", self.name)
return conn
# Re-raise the timeout exception presenting the user the global
# timeout, not the per-attempt one.
except PoolTimeout:
raise PoolTimeout(
f"couldn't get a connection after {timeout:.2f} sec"
) from None
def _getconn_unchecked(self, timeout: float) -> CT:
# Critical section: decide here if there's a connection ready
# or if the client needs to wait.
with self._lock:
conn = self._get_ready_connection(timeout)
if not conn:
# No connection available: put the client in the waiting queue
t0 = monotonic()
pos: WaitingClient[CT] = WaitingClient()
self._waiting.append(pos)
self._stats[self._REQUESTS_QUEUED] += 1
# If there is space for the pool to grow, let's do it
self._maybe_grow_pool()
# If we are in the waiting queue, wait to be assigned a connection
# (outside the critical section, so only the waiting client is locked)
if not conn:
try:
conn = pos.wait(timeout=timeout)
except Exception:
self._stats[self._REQUESTS_ERRORS] += 1
raise
finally:
t1 = monotonic()
self._stats[self._REQUESTS_WAIT_MS] += int(1000.0 * (t1 - t0))
# Tell the connection it belongs to a pool to avoid closing on __exit__
# Note that this property shouldn't be set while the connection is in
# the pool, to avoid to create a reference loop.
conn._pool = self
return conn
def _get_ready_connection(self, timeout: Optional[float]) -> Optional[CT]:
"""Return a connection, if the client deserves one."""
conn: Optional[CT] = None
if self._pool:
# Take a connection ready out of the pool
conn = self._pool.popleft()
if len(self._pool) < self._nconns_min:
self._nconns_min = len(self._pool)
elif self.max_waiting and len(self._waiting) >= self.max_waiting:
self._stats[self._REQUESTS_ERRORS] += 1
raise TooManyRequests(
f"the pool {self.name!r} has already"
+ f" {len(self._waiting)} requests waiting"
)
return conn

This change seems enough to solve it:

diff --git a/psycopg_pool/psycopg_pool/pool.py b/psycopg_pool/psycopg_pool/pool.py
index 288473fc..0ab3c5d9 100644
--- a/psycopg_pool/psycopg_pool/pool.py
+++ b/psycopg_pool/psycopg_pool/pool.py
@@ -266,8 +266,11 @@ class ConnectionPool(Generic[CT], BasePool):
     def _getconn_unchecked(self, timeout: float) -> CT:
         # Critical section: decide here if there's a connection ready
         # or if the client needs to wait.
+        if timeout <= 0.0:
+            raise PoolTimeout()
+
         with self._lock:
-            conn = self._get_ready_connection(timeout)
+            conn = self._get_ready_connection()
             if not conn:
                 # No connection available: put the client in the waiting queue
                 t0 = monotonic()
@@ -296,7 +299,7 @@ class ConnectionPool(Generic[CT], BasePool):
         conn._pool = self
         return conn
 
-    def _get_ready_connection(self, timeout: Optional[float]) -> Optional[CT]:
+    def _get_ready_connection(self) -> Optional[CT]:
         """Return a connection, if the client deserves one."""
         conn: Optional[CT] = None
         if self._pool:

With this changeset your test should fail after exactly 30 seconds.

I will add a test to verify that the issue is fixed and make a new release of the pool package. If you can test it on your side too, that would be welcome.

@dvarrazzo
Copy link
Member

The proper fix is a bit different to take into account the NullPool, but tests now pass for me. Will merge this changeset and release psycopg_pool 3.2.1 in the next few days.

@dabdine
Copy link
Author

dabdine commented Dec 29, 2023

That's great @dvarrazzo, thank you!

Is it correct to assume though, that with this test case, the loop will eat CPU for 30s until it is finally killed by the timeout? Not sure if there's a decent way around that outside of defining check to return a bool instead, and bailing out completely without retry if check ever raises.

The main issue would be that there is some unrecoverable condition within the check method that will always cause it to fail, so every attempt at a connection grab will timeout at 30s, while check loops "infinitely" within that timeframe (potentially bombarding the db if it's actually executing a query that is not successful). i.e., is there a way to communicate to the application code that a connection grab would never be successful?

@dvarrazzo
Copy link
Member

@dabdine I don't think it is practical to make the protocol between the check function and the caller more complicated, not to mention that backward compatibility should be kept into account. However I see your point. I think the best strategy would be to introduce an exponential backoff to avoid a busyloop around a failing check function: we do something similar with the reconnect loop so actually we should have thought about it.

@dvarrazzo
Copy link
Member

This is a behaviour we might like more, isn't it?

$ time python bug-709.py 
2023-12-29 23:38:11,415 INFO performing something that will except
2023-12-29 23:38:11,416 INFO performing something that will except
2023-12-29 23:38:12,328 INFO performing something that will except
2023-12-29 23:38:14,153 INFO performing something that will except
2023-12-29 23:38:17,799 INFO performing something that will except
2023-12-29 23:38:25,095 INFO performing something that will except
2023-12-29 23:38:39,683 INFO performing something that will except
Traceback (most recent call last):
  File "/home/piro/dev/psycopg3/bug-709.py", line 29, in <module>
    with connection_pool.getconn() as conn:
  File "/home/piro/dev/psycopg3/psycopg_pool/psycopg_pool/pool.py", line 276, in getconn
    raise PoolTimeout(
psycopg_pool.PoolTimeout: couldn't get a connection after 30.00 sec

real	0m30.123s
user	0m0.115s
sys	0m0.012s

Cleaning this up and testing it more thoroughly, we will probably include this improvement too in the next bugfix release.

@dvarrazzo
Copy link
Member

I have a question for you, as you understand you have put some logic in your check function.

The Pool.check() method internally uses Pool.check_connection(). According to your use case, does it make more sense that it used the user-provided check if specified?

dvarrazzo added a commit that referenced this issue Dec 29, 2023
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
dvarrazzo added a commit that referenced this issue Dec 30, 2023
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
dvarrazzo added a commit that referenced this issue Dec 30, 2023
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
@dvarrazzo
Copy link
Member

dvarrazzo commented Dec 30, 2023

MR #711 should solve both the problem of respecting the getconn timeout and also throttle the number of re-checking with an exponential backoff.

Testing is welcome 🙂

dvarrazzo added a commit that referenced this issue Dec 31, 2023
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
@dabdine
Copy link
Author

dabdine commented Dec 31, 2023

I have a question for you, as you understand you have put some logic in your check function.

The Pool.check() method internally uses Pool.check_connection(). According to your use case, does it make more sense that it used the user-provided check if specified?

If I'm understanding it correctly, the pool is just validating that the connections are still valid by running a SELECT 1 to test whether it still has a wire connection to the db. If that's the case, I don't think that fits my use case, and I think it makes sense the way it operates right now.

I'm trying to mimic what my go-based backend server does right now using pgxpool. Specifically, it uses BeforeAcquire and AfterRelease (https://pkg.go.dev/github.com/jackc/pgx/v4/pgxpool#Config) to set RLS parameters / unset them respectively before/after the application code gets a connection.

It seems like the check method works for that use case (along with reset). the check_connection on the pool seems like a reasonable internal check for a pool to test its child connection liveness with.

@dabdine
Copy link
Author

dabdine commented Dec 31, 2023

@dabdine I don't think it is practical to make the protocol between the check function and the caller more complicated, not to mention that backward compatibility should be kept into account. However I see your point. I think the best strategy would be to introduce an exponential backoff to avoid a busyloop around a failing check function: we do something similar with the reconnect loop so actually we should have thought about it.

I think this solution -- along with me properly fixing my check function implementation -- is pretty solid!

@dabdine
Copy link
Author

dabdine commented Dec 31, 2023

MR #711 should solve both the problem of respecting the getconn timeout and also throttle the number of re-checking with an exponential backoff.

Testing is welcome 🙂

Tested locally and works like a charm, great job, and thank you!

dvarrazzo added a commit that referenced this issue Jan 6, 2024
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
dvarrazzo added a commit that referenced this issue Jan 6, 2024
This is a consistent behaviour with with the exponential backoff used in
reconnection attempts, and actually we could reuse the same object
implementing the backoff + jitter.

See also #709
@dvarrazzo
Copy link
Member

Psycog-pool 3.2.1 released with this improvement. Testing is welcome. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants