New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42317: Make cloned Butler instances threadsafe #935
Conversation
d22addd
to
f85bdf0
Compare
54bae23
to
ed9dc32
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #935 +/- ##
==========================================
+ Coverage 88.33% 88.37% +0.04%
==========================================
Files 301 301
Lines 38813 38945 +132
Branches 8182 8213 +31
==========================================
+ Hits 34285 34418 +133
+ Misses 3339 3337 -2
- Partials 1189 1190 +1 ☔ View full report in Codecov by Sentry. |
1d4824a
to
a99fc00
Compare
config: DatastoreConfig, | ||
bridgeManager: DatastoreRegistryBridgeManager, | ||
butlerRoot: ResourcePathExpression | None, | ||
) -> Datastore: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this and Datastore.clone
are brand new interfaces, let's make the arguments snake_case
.
@@ -276,7 +276,7 @@ def isWriteable(self) -> bool: | |||
|
|||
def copy(self, defaults: RegistryDefaults | None = None) -> SqlRegistry: | |||
"""Create a new `SqlRegistry` backed by the same data repository | |||
and connection as this one, but independent defaults. | |||
as this one, but independent defaults and database connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're sharing the sqlalchemy.Engine
, I'm not sure that's necessarily true; I think we're sharing a connection pool and hence potentially sharing connections (with the thread-safety details of that left to SQLAlchemy), but we think that at any particular time a particular connection can only be held by one Database
instance.
I'm asking partly to make sure I understand how copying the Database
object actually gives us thread-safety, not just to make this docstring more precise. I think connections not being shared "at any particular time" is sufficient, but I wanted to see if that matches your mental model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what you described matches my understanding. sqlalchemy.Engine
guarantees that only one caller has a given "logical" connection checked out at a time, though you're correct that it doesn't necessarily map to a separate network connection.
To be clear, the part that is now independent is that each Database instance will use a different sqlalchemy.engine.Connection
object, whereas before they would sometimes be accidentally shared via the Database._session_connection
instance variable. This was problematic for a few reasons:
Connection
is specifically documented to not be threadsafe- The transaction state would get mixed together between the threads because this is managed via the
Connection
object - Depending on the timing, one thread could close the
Connection
object before the other thread was done using it, or replace it with a different connection while another thread was in the middle of using it.
Perhaps a better phrasing would be "independent database session"... I'll tweak the comment.
@@ -120,6 +120,11 @@ def finish(self, registry: SqlRegistry) -> None: | |||
Raised if a non-governor dimension was included in ``**kwargs`` | |||
at construction. | |||
""" | |||
# Skip re-initialization if it's already been completed. | |||
# Can't just say 'self._finished' because this class is immutable. | |||
if hasattr(self, "_finished"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to set
_finished: bool = False
at class scope instead (i.e. so it starts out as a ClassVar, and then gets shadowed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it but it didn't work unfortunately.
919ef6f
to
5c1b302
Compare
Added tests to make sure that DirectButlers created via DirectButler._clone() actually work.
Make Database cloneable so that we can copy it when creating Butler instances for multiple threads. This is not yet used in SqlRegistry because the manager objects also need to be cloned, so that everything using Database is pointing at the same object.
Instead of cloning its internal cache of OpaqueTableStorage instances, we now create them on-demand. A cache of which tables have been created is shared between instances.
To facilitate this, altered the Datastore interface to decouple the constructor from Datastore.fromConfig(). Datastores now have to implement a clone() method and _create_from_config() that can be called by Datastore.fromConfig().
When cloning a DirectButler instance, clone all of its inner objects instead of just copying references to the top-level properties. This makes it safe to use the cloned objects concurrently from different threads. This is necessary for the Butler server and other services using DirectButler as a backend, since they have a separate thread for each incoming request.
StorageClassFactory is a singleton, and there are a few places where storage classes are added at runtime. So add locking to make its accesses to internal data structures threadsafe. Removed a few unused functions instead of making them threadsafe.
Both DimensionUniverse and DimensionGroup had a bug where instances could become visible in the cache before their construction was complete.
Fix a bug where passing 'None' to SqlRegistry.copy() would throw an exception when RegistryDefaults.finish() was called for the second time on the defaults object.
Previously, if a DatasetType was not found in the cache during a call to DatasetRecordStorageManager.find(), it would return None. Instead, we now go to the DB and fetch the data. This change is necessary because the cache is no longer shared when cloning a Butler instance. ctrl_mpexec registers DatasetType in one instance of a cloned Butler, but does work with the DatasetType in another instance. It will also allow the server to use DatasetTypes added by an external process without needing to restart the server.
In DM-42324, the dimension record cache was changed so that it is no longer shared between Butler instances. This cache is somewhat expensive to fetch and is used for almost every Butler operation. LabeledButlerFactory now downloads this cache into its 'template' DirectButler instances when they are first created, so it is available for copying into clones.
5c1b302
to
1fab958
Compare
Checklist
doc/changes