New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken connections in pool become busy indefinitely #221
Comments
Normally what happens is:
|
Solely looking at the source code, I think the error is raised here: python-oracledb/src/oracledb/impl/thin/packet.pyx Lines 204 to 206 in e861b2b
Just before, the socket is closed, but I don't believe that is enough to close the connection (and return it to the pool / remove it from the busy list) |
So just to be clear: you are saying that you create the pool and populate it with connections successfully. After that some network glitch occurs and the ping fails and the connection is never released back to the pool? What is the full callstack of the exception? Can you replicate this easily? If so, does it fail without the use of asyncio? Without a call timeout? If you can supply the full callstack I might be able to simluate the error by adjusting the code to force an exception in that location even without the network glitch -- although that might be important. But let's start with the full callstack. Thanks! |
|
Can you try the new asyncio support that was just announced in #258? If not, can you provide the callstack as requested earlier? Thanks! |
We don't have time to investigate and have been using thick mode. Glad to hear asyncio support is around the corner - I may give it a shot at some point and report any issues I encounter. |
Hi @anthony-tuininga and @cjbj I have started using thin mode + asyncio and the issue persists. This is the relevant part of the call stack:
What appears to happen is
Initially I thought I can just catch the error and close the dead connection which should remove it from the busy list, however pool = oracledb.create_pool_async(...)
conn = pool.acquire()
try:
conn = await conn
# do stuff with conn
finally:
await pool.release(conn) Shouldn't the pool be cleaning up this dead connection if it's unable to ping on it? And return another, working connection. |
https://python-oracledb.readthedocs.io/en/latest/api_manual/module.html |
Further reproduction steps: it has something to do with how connections are recycled. min=2, max=12
This was over a period of 10 minutes or so. The busy connections just stay there and once the the network issue reoccurs, it happens again eventually saturating the pool with busy connections that never become free. |
Looking at the code and your description I think this may be the source of the issue. Are you able to build from source and verify this for me? --- a/src/oracledb/impl/thin/pool.pyx
+++ b/src/oracledb/impl/thin/pool.pyx
@@ -256,7 +256,6 @@ cdef class BaseThinPoolImpl(BasePoolImpl):
Called when a new connection is created on acquire with the lock held.
"""
if orig_conn_impl is not None:
- self._busy_conn_impls.remove(orig_conn_impl)
self._busy_conn_impls.append(new_conn_impl)
else:
new_conn_impl._is_pool_extra = True
@@ -621,7 +620,6 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
temp_conn_impl = None
break
temp_conn_impl = <ThinConnImpl> result
- self._busy_conn_impls.append(temp_conn_impl)
if must_reconnect:
break
@@ -630,9 +628,11 @@ cdef class ThinPoolImpl(BaseThinPoolImpl):
if requires_ping:
try:
temp_conn_impl.ping()
- except exceptions.DatabaseError:
+ except exceptions.Error:
temp_conn_impl._force_close()
if temp_conn_impl._protocol._transport is not None:
+ with self._condition:
+ self._busy_conn_impls.append(temp_conn_impl)
return temp_conn_impl
# a new connection needs to be created
@@ -710,7 +710,6 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
temp_conn_impl = None
break
temp_conn_impl = <AsyncThinConnImpl> result
- self._busy_conn_impls.append(temp_conn_impl)
if must_reconnect:
break
@@ -719,9 +718,11 @@ cdef class AsyncThinPoolImpl(BaseThinPoolImpl):
if requires_ping:
try:
await temp_conn_impl.ping()
- except exceptions.DatabaseError:
+ except exceptions.Error:
temp_conn_impl._force_close()
if temp_conn_impl._protocol._transport is not None:
+ async with self._condition:
+ self._busy_conn_impls.append(temp_conn_impl)
return temp_conn_impl
# a new connection needs to be created Essentially, if the ping fails AND the creation of the new connection fails, then the connection may remain in the busy list permanently. This patch only adds it to the busy list after it is known to be good. I don't have a good way to test this directly but hopefully you are able to use your existing test case to test this for me! |
It appears to fix this issue. I could reproduce the issue on main, but not on main + patch. 👍 However, there is a separate issue which looking back at my recent comment was also present then (step 4).
I suspect it's the ping that hangs indefinitely, because the connection should already be open and POOL_GETMODE_TIMEDWAIT is set which should also prevent a connection attempt to hang indefinitely. I'll try to get a stack trace for this |
Traceback for the other issue. Codebase = main + patch, Python 3.12.
|
Do you think you'll get the first bug fix into the upcoming release? |
unavailable for use permanently (#221).
Yes, this patch will be included in the upcoming release of 2.2. I plan to close this issue after that release has been completed. Thanks for your help in verifying the fix! Regarding the second issue, the "hang" should not be indefinite as the underlying network will eventually give up and raise an exception. That could, however, be a rather long time! A separate enhancement is needed here -- the addition of |
Thank you! |
This was included in version 2.2.0 which was just released. |
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.16.0.0.0
Is it an error or a hang or a crash?
Hang (acquiring a connection is not possible if all connections are busy)
What error(s) or behavior you are seeing?
Some network interruptions cause pool connection(s) to remain in busy list indefinitely.
Unfortunately I was unable to simulate the network interruption that reproduces the issue.
No
Following call is made every 10s:
The text was updated successfully, but these errors were encountered: