New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConnectionPool#check method failures will cause an infinite loop #709
Comments
Hm... |
There is a bug indeed, in the chain psycopg/psycopg_pool/psycopg_pool/pool.py Lines 229 to 313 in aa4d15b
This change seems enough to solve it: diff --git a/psycopg_pool/psycopg_pool/pool.py b/psycopg_pool/psycopg_pool/pool.py
index 288473fc..0ab3c5d9 100644
--- a/psycopg_pool/psycopg_pool/pool.py
+++ b/psycopg_pool/psycopg_pool/pool.py
@@ -266,8 +266,11 @@ class ConnectionPool(Generic[CT], BasePool):
def _getconn_unchecked(self, timeout: float) -> CT:
# Critical section: decide here if there's a connection ready
# or if the client needs to wait.
+ if timeout <= 0.0:
+ raise PoolTimeout()
+
with self._lock:
- conn = self._get_ready_connection(timeout)
+ conn = self._get_ready_connection()
if not conn:
# No connection available: put the client in the waiting queue
t0 = monotonic()
@@ -296,7 +299,7 @@ class ConnectionPool(Generic[CT], BasePool):
conn._pool = self
return conn
- def _get_ready_connection(self, timeout: Optional[float]) -> Optional[CT]:
+ def _get_ready_connection(self) -> Optional[CT]:
"""Return a connection, if the client deserves one."""
conn: Optional[CT] = None
if self._pool: With this changeset your test should fail after exactly 30 seconds. I will add a test to verify that the issue is fixed and make a new release of the pool package. If you can test it on your side too, that would be welcome. |
The proper fix is a bit different to take into account the NullPool, but tests now pass for me. Will merge this changeset and release psycopg_pool 3.2.1 in the next few days. |
That's great @dvarrazzo, thank you! Is it correct to assume though, that with this test case, the loop will eat CPU for 30s until it is finally killed by the timeout? Not sure if there's a decent way around that outside of defining check to return a bool instead, and bailing out completely without retry if check ever The main issue would be that there is some unrecoverable condition within the check method that will always cause it to fail, so every attempt at a connection grab will timeout at 30s, while check loops "infinitely" within that timeframe (potentially bombarding the db if it's actually executing a query that is not successful). i.e., is there a way to communicate to the application code that a connection grab would never be successful? |
@dabdine I don't think it is practical to make the protocol between the check function and the caller more complicated, not to mention that backward compatibility should be kept into account. However I see your point. I think the best strategy would be to introduce an exponential backoff to avoid a busyloop around a failing check function: we do something similar with the reconnect loop so actually we should have thought about it. |
This is a behaviour we might like more, isn't it?
Cleaning this up and testing it more thoroughly, we will probably include this improvement too in the next bugfix release. |
I have a question for you, as you understand you have put some logic in your check function. The |
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
MR #711 should solve both the problem of respecting the getconn timeout and also throttle the number of re-checking with an exponential backoff. Testing is welcome 🙂 |
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
If I'm understanding it correctly, the pool is just validating that the connections are still valid by running a SELECT 1 to test whether it still has a wire connection to the db. If that's the case, I don't think that fits my use case, and I think it makes sense the way it operates right now. I'm trying to mimic what my go-based backend server does right now using It seems like the check method works for that use case (along with reset). the check_connection on the pool seems like a reasonable internal check for a pool to test its child connection liveness with. |
I think this solution -- along with me properly fixing my check function implementation -- is pretty solid! |
Tested locally and works like a charm, great job, and thank you! |
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
This is a consistent behaviour with with the exponential backoff used in reconnection attempts, and actually we could reuse the same object implementing the backoff + jitter. See also #709
Psycog-pool 3.2.1 released with this improvement. Testing is welcome. Cheers! |
In my use case, I'm using the
check
method to set some parameters for Postgres RLS (specifically setting the ROLE and the tenant parameters on the connection) before the connection is handed to application code (similarly, usingreset
to erase / remove state when it is returned to the pool).I noticed that if a connection's check method fails, psycopg3 will just loop infinitely. In my case, there was an error I made dealing with query parameters (irrelevant, but that was the underlying cause that caused the issue in my case). I think either a sensible default or a maximum retry would be helpful for debugging purposes (otherwise, I was wondering why my application was hanging).
I did originally try to use
configure
, but I need information that comes from aContextVar
in the context of a FastAPI request thread (a tenant identifier), so that wouldn't work, since configure is run at moments when that information is not available.Minimal example:
execution:
poetry run python test.py > foo 0.55s user 0.20s system 27% cpu 2.711 total
output:
The text was updated successfully, but these errors were encountered: