Skip to content

Conversation

TylerWilley
Copy link
Contributor

gevent.Timeout extends BaseException, so we still sometimes lose the lock.

Perhaps we should instead have this as an empty except?

…tch BaseException here too. Perhaps this should be an empty except to guarentee release.
@ShaneHarvey
Copy link
Member

Thanks for reporting this issue. It reminds me of https://jira.mongodb.org/browse/PYTHON-2334 which you also reported.

It's somewhat strange that gevent extends BaseException here. The Python docs say not to do this:

The built-in exception classes can be subclassed to define new exceptions; programmers are encouraged to derive new exceptions from the Exception class or one of its subclasses, and not from BaseException.

https://docs.python.org/3/library/exceptions.html#built-in-exceptions

I'll need to think through this change a bit and research where else (and why) we catch Exception. I suspect it's so that we don't perform unneeded work for KeyboardInterrupt and SystemExit.

@ShaneHarvey ShaneHarvey self-requested a review May 28, 2021 19:44
pymongo/pool.py Outdated
@@ -1415,7 +1415,7 @@ def _get_socket(self, all_credentials):
self._pending -= 1
self._max_connecting_cond.notify()
sock_info.check_auth(all_credentials)
except Exception:
except BaseException:
Copy link
Member

@ShaneHarvey ShaneHarvey May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really make sense to me. Sure we now catch gevent.Timeout, but won't gevent again raise another timeout when we try to acquire the lock below on line 1422? In that case we're in the same boat and we might as well have not caught the exception to begin with. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, I was on the old 3.11 branch...

The latest master uses contexts instead for acquiring / releasing, which should prevent the lock from remaining acquired... Let me try reproducing on master, and if it's fixed on master, we can just use an RC build from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(then we can just close this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as followup, it still causes problems on 3.11.4 because the except block isn't executed. BaseException, or a bare except: still fix this, because we don't nest timeouts currently.

Considering a worst case scenario where you have infinite timeouts waiting to fire, you would have a timeout fire at each:

  • close_socket
  • with self.size_cond
  • self.size_cond.notify (not 100%, this may not block and therefore not be an eligible point to fire a timeout)
  • event_listeners.publish_connection_check_out_failed, if the listener executes blocking code that would call the hub's switch() method

Given we only have 1 active at a time, this isn't an issue for us in particular, but would be if someone did use stacked timeouts and they all fired in a worst-case scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be necessary to subclass Pool as GeventPool, I see that custom Pool classes can be provided, so that's what I'll try next. Thanks for being responsive though :)

@TylerWilley
Copy link
Contributor Author

TylerWilley commented Jun 1, 2021

Ok, hopefully this makes sense....

Instead I simply sub-classed Pool, as GeventPool. When the exception handler is entered, I start a greenlet to run the exception handling. This makes the exception handling code un-interruptible from gevent's perspective so that even multiple successive Timeouts being raised in application code will not prevent releasing the resources properly.

I don't know if you'd want to integrate this somehow... But since I can put this into our application as-is without updating pymongo using the _pool_class=GeventPool argument to the MongoClient, it's not necessary.

If you don't, we can close this. Thanks!

(although I would be interested if you have any concerns over this method... It has fixed losing our connections in load tests that raise Timeouts, however)

@ShaneHarvey
Copy link
Member

As the name suggests, _pool_class is a private and undocumented argument to MongoClient which we may remove at any time. The main reason it still exists is simply tech debt (to remove it we need to refactor some internal tests). So unfortunately, I don't suggest relying on this feature.

I would still like to improve our compatibly with gevent.Timeout's quirks. Would you be able to point me to the docs for how gevent.Timeout works? Also would you be able to write a test case or other reproducer for the original issue? You might get some ideas from the test I added in PYTHON-2334: c99254f

@TylerWilley
Copy link
Contributor Author

TylerWilley commented Jun 1, 2021

https://www.gevent.org/api/gevent.timeout.html

see the Caution in the Use As A Context Manager section

gevent works by patching the system libraries (socket, threading, sleep, locks etc) that execute I/O or blocking with libuv / libev functionality to run as coroutines. Instead of blocking when you call a socket method, gevent.switch() is called and execution, on the same thread, proceeds on a different coroutine (they call them greenlets). When the blocking is resolved, control is then returned to the coroutine.

As such, gevent.Timeout may be raised on any function that blocks, e.g. acquiring a lock, reading a socket, joining a thread, etc.

(This resolves a lot of the GIL contention with python)

@TylerWilley
Copy link
Contributor Author

I could also see _pool_class being useful for implementing an asyncio pool to run queries from as well, which would be the more built-in way of running coroutines in python.

@ShaneHarvey
Copy link
Member

Thanks for the link. While I do think the thread-cleanup idea is clever, I am not a fan of adding a special pool class just for Gevent support. Looking into this more, in PYTHON-1663 (PyMongo 3.8) the pool was changed from catching all exceptions using a bare-except to catching only Exception:

-    except:
+    except Exception:

I am open to changing it back and adding a comment on why we have to use bare except so the next maintain doesn't accidentally make the same mistake. This would at least deal with the single Timeout issue while we consider how (and if) we want to support multiple timeouts.

@pooooodles pooooodles added the tracked-in-jira Ticket filed in Mongo's Jira system label Jun 4, 2021
@pooooodles
Copy link

@ShaneHarvey ShaneHarvey changed the title Switch from Exception to BaseException PYTHON-2743 Fix compatibility with gevent.Timeout Jun 4, 2021
@TylerWilley
Copy link
Contributor Author

TylerWilley commented Jun 4, 2021

I added a unit test that will break if we add Exception back to the except clause.

In order for it to function, the Timeout has to raise at a specific point, I used mock to do this, I don't know if you have a more standard "pymongo" way of doing that... It was previously raising at the condition check prior to entering the try, thus not incrementing / decrementing active_count at all.

This requires reaching down the call tree though to patch at runtime because an actual connection is in use.

I could also test this by instantiating a Pool object without connecting to mongo and executing _get_sockets directly, which would require much less invasive mocking, if you'd prefer. That would be less "real world" but more controlled.

ShaneHarvey pushed a commit that referenced this pull request Jun 15, 2021
gevent.Timeout extends BaseException, not Exception.
ShaneHarvey pushed a commit that referenced this pull request Jun 15, 2021
gevent.Timeout extends BaseException, not Exception.

(cherry picked from commit 9c1ff6a)
@ShaneHarvey
Copy link
Member

Thanks @TylerWilley! I've merged this fix in 9c1ff6a and a2d687b.

@TylerWilley
Copy link
Contributor Author

Thank you @ShaneHarvey ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracked-in-jira Ticket filed in Mongo's Jira system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants