DM-42317: Make cloned Butler instances threadsafe #935

dhirving · 2024-01-03T00:21:36Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2024-01-03T22:15:10Z

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (5f0743b) 88.33% compared to head (1fab958) 88.37%.

Files	Patch %	Lines
python/lsst/daf/butler/_storage_class.py	81.39%	5 Missing and 3 partials ⚠️
...ython/lsst/daf/butler/registry/databases/sqlite.py	73.91%	2 Missing and 4 partials ⚠️
python/lsst/daf/butler/datastore/_datastore.py	80.00%	3 Missing ⚠️
...hon/lsst/daf/butler/datastores/chainedDatastore.py	87.50%	1 Missing and 1 partial ⚠️
python/lsst/daf/butler/datastores/fileDatastore.py	89.47%	1 Missing and 1 partial ⚠️
python/lsst/daf/butler/tests/_dummyRegistry.py	50.00%	2 Missing ⚠️
python/lsst/daf/butler/dimensions/_universe.py	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #935      +/-   ##
==========================================
+ Coverage   88.33%   88.37%   +0.04%     
==========================================
  Files         301      301              
  Lines       38813    38945     +132     
  Branches     8182     8213      +31     
==========================================
+ Hits        34285    34418     +133     
+ Misses       3339     3337       -2     
- Partials     1189     1190       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo · 2024-01-17T12:57:20Z

python/lsst/daf/butler/datastore/_datastore.py

+        config: DatastoreConfig,
+        bridgeManager: DatastoreRegistryBridgeManager,
+        butlerRoot: ResourcePathExpression | None,
+    ) -> Datastore:


Since this and Datastore.clone are brand new interfaces, let's make the arguments snake_case.

TallJimbo · 2024-01-17T13:03:41Z

python/lsst/daf/butler/registry/sql_registry.py

@@ -276,7 +276,7 @@ def isWriteable(self) -> bool:

    def copy(self, defaults: RegistryDefaults | None = None) -> SqlRegistry:
        """Create a new `SqlRegistry` backed by the same data repository
-        and connection as this one, but independent defaults.
+        as this one, but independent defaults and database connection.


Since we're sharing the sqlalchemy.Engine, I'm not sure that's necessarily true; I think we're sharing a connection pool and hence potentially sharing connections (with the thread-safety details of that left to SQLAlchemy), but we think that at any particular time a particular connection can only be held by one Database instance.

I'm asking partly to make sure I understand how copying the Database object actually gives us thread-safety, not just to make this docstring more precise. I think connections not being shared "at any particular time" is sufficient, but I wanted to see if that matches your mental model.

Yeah, what you described matches my understanding. sqlalchemy.Engine guarantees that only one caller has a given "logical" connection checked out at a time, though you're correct that it doesn't necessarily map to a separate network connection.

To be clear, the part that is now independent is that each Database instance will use a different sqlalchemy.engine.Connection object, whereas before they would sometimes be accidentally shared via the Database._session_connection instance variable. This was problematic for a few reasons:

Connection is specifically documented to not be threadsafe

The transaction state would get mixed together between the threads because this is managed via the Connection object

Depending on the timing, one thread could close the Connection object before the other thread was done using it, or replace it with a different connection while another thread was in the middle of using it.

Perhaps a better phrasing would be "independent database session"... I'll tweak the comment.

TallJimbo · 2024-01-17T13:48:13Z

python/lsst/daf/butler/registry/_defaults.py

@@ -120,6 +120,11 @@ def finish(self, registry: SqlRegistry) -> None:
            Raised if a non-governor dimension was included in ``**kwargs``
            at construction.
        """
+        # Skip re-initialization if it's already been completed.
+        # Can't just say 'self._finished' because this class is immutable.
+        if hasattr(self, "_finished"):


Would it work to set

_finished: bool = False

at class scope instead (i.e. so it starts out as a ClassVar, and then gets shadowed)?

Tried it but it didn't work unfortunately.

Added tests to make sure that DirectButlers created via DirectButler._clone() actually work.

Make Database cloneable so that we can copy it when creating Butler instances for multiple threads. This is not yet used in SqlRegistry because the manager objects also need to be cloned, so that everything using Database is pointing at the same object.

Instead of cloning its internal cache of OpaqueTableStorage instances, we now create them on-demand. A cache of which tables have been created is shared between instances.

To facilitate this, altered the Datastore interface to decouple the constructor from Datastore.fromConfig(). Datastores now have to implement a clone() method and _create_from_config() that can be called by Datastore.fromConfig().

When cloning a DirectButler instance, clone all of its inner objects instead of just copying references to the top-level properties. This makes it safe to use the cloned objects concurrently from different threads. This is necessary for the Butler server and other services using DirectButler as a backend, since they have a separate thread for each incoming request.

StorageClassFactory is a singleton, and there are a few places where storage classes are added at runtime. So add locking to make its accesses to internal data structures threadsafe. Removed a few unused functions instead of making them threadsafe.

Both DimensionUniverse and DimensionGroup had a bug where instances could become visible in the cache before their construction was complete.

Fix a bug where passing 'None' to SqlRegistry.copy() would throw an exception when RegistryDefaults.finish() was called for the second time on the defaults object.

Previously, if a DatasetType was not found in the cache during a call to DatasetRecordStorageManager.find(), it would return None. Instead, we now go to the DB and fetch the data. This change is necessary because the cache is no longer shared when cloning a Butler instance. ctrl_mpexec registers DatasetType in one instance of a cloned Butler, but does work with the DatasetType in another instance. It will also allow the server to use DatasetTypes added by an external process without needing to restart the server.

In DM-42324, the dimension record cache was changed so that it is no longer shared between Butler instances. This cache is somewhat expensive to fetch and is used for almost every Butler operation. LabeledButlerFactory now downloads this cache into its 'template' DirectButler instances when they are first created, so it is available for copying into clones.

dhirving force-pushed the tickets/DM-42189 branch 2 times, most recently from d22addd to f85bdf0 Compare January 3, 2024 19:47

Base automatically changed from tickets/DM-42189 to main January 3, 2024 20:03

dhirving force-pushed the tickets/DM-42317 branch from 54bae23 to ed9dc32 Compare January 3, 2024 22:04

dhirving force-pushed the tickets/DM-42317 branch 7 times, most recently from 1d4824a to a99fc00 Compare January 11, 2024 20:25

dhirving marked this pull request as ready for review January 12, 2024 17:29

TallJimbo approved these changes Jan 17, 2024

View reviewed changes

dhirving force-pushed the tickets/DM-42317 branch from 919ef6f to 5c1b302 Compare January 17, 2024 22:04

dhirving added 15 commits January 17, 2024 15:14

Add test for cloned Butler

3bee36b

Added tests to make sure that DirectButlers created via DirectButler._clone() actually work.

Make Database cloneable

9db6d29

Make Database cloneable so that we can copy it when creating Butler instances for multiple threads. This is not yet used in SqlRegistry because the manager objects also need to be cloned, so that everything using Database is pointing at the same object.

Make ButlerAttributeManager cloneable

949516f

Remove unused CollectionManager dimensions param

0f93161

Make CollectionManager cloneable

82fb32c

Make OpaqueTableStorageManager cloneable

09e882c

Instead of cloning its internal cache of OpaqueTableStorage instances, we now create them on-demand. A cache of which tables have been created is shared between instances.

Make DatasetRecordStorageManager cloneable

e52f1b6

Make ObsCoreTableManager cloneable

ccfd85b

Make RegistryBridgeManager cloneable

7955287

Make DimensionRecordStorageManager cloneable

bb5b92a

Make RegistryManagerInstances cloneable

92696f2

Make Datastore cloneable

058fa87

To facilitate this, altered the Datastore interface to decouple the constructor from Datastore.fromConfig(). Datastores now have to implement a clone() method and _create_from_config() that can be called by Datastore.fromConfig().

Make StorageClassFactory threadsafe

d6c50d9

StorageClassFactory is a singleton, and there are a few places where storage classes are added at runtime. So add locking to make its accesses to internal data structures threadsafe. Removed a few unused functions instead of making them threadsafe.

Fix DimensionUniverse threadsafety issue

0c32045

Both DimensionUniverse and DimensionGroup had a bug where instances could become visible in the cache before their construction was complete.

dhirving added 8 commits January 17, 2024 15:14

Document thread-safety guarantees

e0435b9

Fix SqlRegistry.copy(None)

7e49400

Fix a bug where passing 'None' to SqlRegistry.copy() would throw an exception when RegistryDefaults.finish() was called for the second time on the defaults object.

Add test for ObsCoreTableManager cloning

c0d4910

Add test for NameKeyCollectionManager cloning

6d73197

Add towncrier

d9828db

Clarify how DB state is shared between Butlers

1fab958

dhirving force-pushed the tickets/DM-42317 branch from 5c1b302 to 1fab958 Compare January 17, 2024 22:14

dhirving merged commit acb4c37 into main Jan 18, 2024
18 checks passed

dhirving deleted the tickets/DM-42317 branch January 18, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42317: Make cloned Butler instances threadsafe #935

DM-42317: Make cloned Butler instances threadsafe #935

dhirving commented Jan 3, 2024 •

edited

codecov bot commented Jan 3, 2024 •

edited

TallJimbo Jan 17, 2024

TallJimbo Jan 17, 2024

dhirving Jan 17, 2024

TallJimbo Jan 17, 2024

dhirving Jan 17, 2024

DM-42317: Make cloned Butler instances threadsafe #935

DM-42317: Make cloned Butler instances threadsafe #935

Conversation

dhirving commented Jan 3, 2024 • edited

Checklist

codecov bot commented Jan 3, 2024 • edited

Codecov Report

TallJimbo Jan 17, 2024

Choose a reason for hiding this comment

TallJimbo Jan 17, 2024

Choose a reason for hiding this comment

dhirving Jan 17, 2024

Choose a reason for hiding this comment

TallJimbo Jan 17, 2024

Choose a reason for hiding this comment

dhirving Jan 17, 2024

Choose a reason for hiding this comment

dhirving commented Jan 3, 2024 •

edited

codecov bot commented Jan 3, 2024 •

edited