Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42317: Make cloned Butler instances threadsafe #935

Merged
merged 23 commits into from Jan 18, 2024
Merged

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Jan 3, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@dhirving dhirving force-pushed the tickets/DM-42189 branch 2 times, most recently from d22addd to f85bdf0 Compare January 3, 2024 19:47
Base automatically changed from tickets/DM-42189 to main January 3, 2024 20:03
Copy link

codecov bot commented Jan 3, 2024

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (5f0743b) 88.33% compared to head (1fab958) 88.37%.

Files Patch % Lines
python/lsst/daf/butler/_storage_class.py 81.39% 5 Missing and 3 partials ⚠️
...ython/lsst/daf/butler/registry/databases/sqlite.py 73.91% 2 Missing and 4 partials ⚠️
python/lsst/daf/butler/datastore/_datastore.py 80.00% 3 Missing ⚠️
...hon/lsst/daf/butler/datastores/chainedDatastore.py 87.50% 1 Missing and 1 partial ⚠️
python/lsst/daf/butler/datastores/fileDatastore.py 89.47% 1 Missing and 1 partial ⚠️
python/lsst/daf/butler/tests/_dummyRegistry.py 50.00% 2 Missing ⚠️
python/lsst/daf/butler/dimensions/_universe.py 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #935      +/-   ##
==========================================
+ Coverage   88.33%   88.37%   +0.04%     
==========================================
  Files         301      301              
  Lines       38813    38945     +132     
  Branches     8182     8213      +31     
==========================================
+ Hits        34285    34418     +133     
+ Misses       3339     3337       -2     
- Partials     1189     1190       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-42317 branch 7 times, most recently from 1d4824a to a99fc00 Compare January 11, 2024 20:25
@dhirving dhirving marked this pull request as ready for review January 12, 2024 17:29
config: DatastoreConfig,
bridgeManager: DatastoreRegistryBridgeManager,
butlerRoot: ResourcePathExpression | None,
) -> Datastore:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and Datastore.clone are brand new interfaces, let's make the arguments snake_case.

@@ -276,7 +276,7 @@ def isWriteable(self) -> bool:

def copy(self, defaults: RegistryDefaults | None = None) -> SqlRegistry:
"""Create a new `SqlRegistry` backed by the same data repository
and connection as this one, but independent defaults.
as this one, but independent defaults and database connection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're sharing the sqlalchemy.Engine, I'm not sure that's necessarily true; I think we're sharing a connection pool and hence potentially sharing connections (with the thread-safety details of that left to SQLAlchemy), but we think that at any particular time a particular connection can only be held by one Database instance.

I'm asking partly to make sure I understand how copying the Database object actually gives us thread-safety, not just to make this docstring more precise. I think connections not being shared "at any particular time" is sufficient, but I wanted to see if that matches your mental model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what you described matches my understanding. sqlalchemy.Engine guarantees that only one caller has a given "logical" connection checked out at a time, though you're correct that it doesn't necessarily map to a separate network connection.

To be clear, the part that is now independent is that each Database instance will use a different sqlalchemy.engine.Connection object, whereas before they would sometimes be accidentally shared via the Database._session_connection instance variable. This was problematic for a few reasons:

  1. Connection is specifically documented to not be threadsafe
  2. The transaction state would get mixed together between the threads because this is managed via the Connection object
  3. Depending on the timing, one thread could close the Connection object before the other thread was done using it, or replace it with a different connection while another thread was in the middle of using it.

Perhaps a better phrasing would be "independent database session"... I'll tweak the comment.

@@ -120,6 +120,11 @@ def finish(self, registry: SqlRegistry) -> None:
Raised if a non-governor dimension was included in ``**kwargs``
at construction.
"""
# Skip re-initialization if it's already been completed.
# Can't just say 'self._finished' because this class is immutable.
if hasattr(self, "_finished"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to set

_finished: bool = False

at class scope instead (i.e. so it starts out as a ClassVar, and then gets shadowed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it but it didn't work unfortunately.

Added tests to make sure that DirectButlers created via DirectButler._clone() actually work.
Make Database cloneable so that we can copy it when creating Butler instances for multiple threads.  This is not yet used in SqlRegistry because the manager objects also need to be cloned, so that everything using Database is pointing at the same object.
Instead of cloning its internal cache of OpaqueTableStorage instances, we now create them on-demand.  A cache of which tables have been created is shared between instances.
To facilitate this, altered the Datastore interface to decouple the constructor from Datastore.fromConfig().  Datastores now have to implement a clone() method and _create_from_config() that can be called by Datastore.fromConfig().
When cloning a DirectButler instance, clone all of its inner objects instead of just copying references to the top-level properties.  This makes it safe to use the cloned objects concurrently from different threads.  This is necessary for the Butler server and other services using DirectButler as a backend, since they have a separate thread for each incoming request.
StorageClassFactory is a singleton, and there are a few places where storage classes are added at runtime.  So add locking to make its accesses to internal data structures threadsafe.

Removed a few unused functions instead of making them threadsafe.
Both DimensionUniverse and DimensionGroup had a bug where instances could become visible in the cache before their construction was complete.
Fix a bug where passing 'None' to SqlRegistry.copy() would throw an exception when RegistryDefaults.finish() was called for the second time on the defaults object.
Previously, if a DatasetType was not found in the cache during a call to DatasetRecordStorageManager.find(), it would return None.  Instead, we now go to the DB and fetch the data.

This change is necessary because the cache is no longer shared when cloning a Butler instance.
 ctrl_mpexec registers DatasetType in one instance of a cloned Butler, but does work with the DatasetType in another instance.

It will also allow the server to use DatasetTypes added by an external process without needing to restart the server.
In DM-42324, the dimension record cache was changed so that it is no longer shared between Butler instances.  This cache is somewhat expensive to fetch and is used for almost every Butler operation.

LabeledButlerFactory now downloads this cache into its 'template' DirectButler instances when they are first created, so it is available for copying into clones.
@dhirving dhirving merged commit acb4c37 into main Jan 18, 2024
18 checks passed
@dhirving dhirving deleted the tickets/DM-42317 branch January 18, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants