Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-43315: Implement additional collection chain operations #990

Merged
merged 12 commits into from Apr 9, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Apr 4, 2024

Added additional collection chain methods to the top-level Butler interface: extend_collection_chain, remove_from_collection_chain, and redefine_collection_chain. These methods are all "atomic" functions that can safely be used concurrently from multiple processes.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 98.20359% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 88.97%. Comparing base (20e04d2) to head (3ffbb0e).

Files Patch % Lines
python/lsst/daf/butler/script/collectionChain.py 91.66% 1 Missing and 1 partial ⚠️
...thon/lsst/daf/butler/registry/collections/_base.py 98.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #990      +/-   ##
==========================================
+ Coverage   88.95%   88.97%   +0.02%     
==========================================
  Files         341      341              
  Lines       44042    44123      +81     
  Branches     9069     9078       +9     
==========================================
+ Hits        39177    39260      +83     
+ Misses       3550     3549       -1     
+ Partials     1315     1314       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

) -> None:
"""Add children to the end of a CHAINED collection.

If any of the children already existed in the chain, they will be moved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving existing children is less obviously useful/correct than it is for prepend. Not sure what would be better though, and this is at least consistent.

# registry.setCollectionChain(
# parent,
# registry.getCollectionChain(parent)
# )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether me not doing anything to address this is "efficient" or merely "lazy". The most realistic scenario where this could happen is someone trying to use a collection chain as a queue... after around 32,000 prepends this will happen if you only remove from the end of the chain.

I should probably just write a migration script to make that column 32-bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection chains are very frequently LIFO stacks, but I can't think of any context in which one would be used as a FIFO queue. So I think that migration script can be very low priority.

@dhirving dhirving marked this pull request as ready for review April 5, 2024 21:40
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
@@ -1770,3 +1803,69 @@ def prepend_collection_chain(
transactions short.
"""
raise NotImplementedError()

@abstractmethod
def extend_collection_chain(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which I prefer, but I'll throw it out there as food for thought: what about merging the prepend_, extend_, redefine_, and remove_from_ methods into a single modify_ method with a mode: Literal["prepend", "extend", "remove", "set"] argument? That's closer to the CLI and it slightly reduces the number of Butler methods.

Copy link
Contributor Author

@dhirving dhirving Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had that thought... but I'm concerned we're going to eventually add options that only apply to some subset of the modes. Then we'll end up in the situation we're at with a lot of the other Butler methods where it takes multiple pages of documentation/checks to explain which sets of parameters are legal together.

One other idea I had was maybe naming them collection_chain_* instead of *_collection_chain, so they group together a little better in the docs. But the rest of the butler methods are verb-first.

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
# registry.setCollectionChain(
# parent,
# registry.getCollectionChain(parent)
# )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection chains are very frequently LIFO stacks, but I can't think of any context in which one would be used as a FIFO queue. So I think that migration script can be very low priority.


child_records = self.resolve_wildcard(
CollectionWildcard.from_names(child_collection_names), flatten_chains=False
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little wasteful to query for child rows both here and in the sanity check, though I don't know if that's likely to matter.

But more importantly, it seems like we need to move these queries after the row-level locks. As is, couldn't another process modify the chain in between the queries for its contents and the point at which we acquire the lock? Or is it that the atomic operations are precisely those that don't use the results of those queries? If that's the case, I think some comments would help.

Copy link
Contributor Author

@dhirving dhirving Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little wasteful to query for child rows both here and in the sanity check, though I don't know if that's likely to matter.

It's definitely wasteful but I think not to an extent that is worth worrying about... in any case it's no slower than the previous version because all of this was re-used from the existing implementation.

The sanity check one is restricted to only type==CHAINED and is recursive so it's sort of a different query. I think the way these two queries are done will change up some more when we add the recursive query thing to resolve_wildcard... all we need from the second query is the IDs corresponding to the names and it does more work than necessary right now.

As is, couldn't another process modify the chain in between the queries for its contents and the point at which we acquire the lock?

These aren't queries for the chain contents -- they're queries about the user-provided list of new child collections, and do not involve the parent collection.

The lock is only on the parent collection. So moving these inside the lock wouldn't buy us anything and would make the critical section longer.

For the child name -> ID query, the only concurrent change that affects anything is deleting one of the children (or delete and replace with a new collection of the same name). In those cases the foreign key constraint will prevent us from inserting a child that existed at the time we looked up its ID but no longer exists.

The sanity check thing is hairier, and as I mentioned in the comment does not guarantee that cycles cannot exist. To guarantee no cycles we'd have to recursively lock all child chained collections as well. The locking order is tricky to get right and locking multiple collections makes the availability issues with this approach worse. It's isomorphic to the locking for the denormalized, flattened chain table we discussed the other week. (That table would actually be a good way to enforce this... we could have a check constraint that parent != child.) But in any case just locking the parent doesn't help any -- this query wants to know if the parent is a recursive child of any of the children. Locking the parent only prevents its children from changing and we don't care about the parent's children.

I think some comments would help.

Yeah I'll add a bit more info here.

dhirving and others added 12 commits April 9, 2024 11:21
Reshuffle which parts of the logic are executed in setCollectionChain vs update_chain to make update_chain more similar to prepend_collection_chain.
Factor out the common pieces of update_chain and prepend_collection_chain to a context manager method.
Group all of the top-level collection chain methods together
This makes it somewhat more concurrency safe -- at least now it won't nuke any concurrent changes by doing a complete re-write of the collection.

Also added a nicer error message for out-of-bound indexes.
To complete the new collection chain modification interface for Butler, add a function to replace setCollectionChain.
Co-authored-by: Jim Bosch <jbosch@astro.princeton.edu>
@dhirving dhirving merged commit 39b4492 into main Apr 9, 2024
18 checks passed
@dhirving dhirving deleted the tickets/DM-43315 branch April 9, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants