DM-42324: refactor dimension record storage classes #936

TallJimbo · 2024-01-03T19:08:51Z

Checklist

ran Jenkins
~~added a release note for user-visible changes to doc/changes~~ (not user-visible)

codecov · 2024-01-03T19:48:20Z

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (d0dcdfc) 88.36% compared to head (0cb8b8e) 88.37%.

Files	Patch %	Lines
...thon/lsst/daf/butler/registry/dimensions/static.py	94.58%	7 Missing and 6 partials ⚠️
python/lsst/daf/butler/dimensions/record_cache.py	86.95%	4 Missing and 2 partials ⚠️
...ython/lsst/daf/butler/registry/obscore/_manager.py	33.33%	2 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_elements.py	85.71%	2 Missing ⚠️
python/lsst/daf/butler/tests/_testRepo.py	50.00%	0 Missing and 1 partial ⚠️
python/lsst/daf/butler/transfers/_context.py	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #936      +/-   ##
==========================================
+ Coverage   88.36%   88.37%   +0.01%     
==========================================
  Files         306      301       -5     
  Lines       39167    38850     -317     
  Branches     8233     8185      -48     
==========================================
- Hits        34608    34333     -275     
+ Misses       3371     3343      -28     
+ Partials     1188     1174      -14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo · 2024-01-04T15:19:54Z

python/lsst/daf/butler/dimensions/record_cache.py

+    Notes
+    -----
+    The nested `DimensionRecordSet` objects should never be modified in place
+    except when returned by the `modifying` context manager.


I was considering adding a non-mutable ABC for DimensionRecordSet that could be used as the value type of this Mapping, to enforce this requirement with static typing (probably on DM-42344 instead of this ticket). The internal dict would still use the mutable full type, as would the modifying method. I'm open to opinions on whether that's a useful addition or an excessive one.

I like that kind of thing in general as well as the pattern like the modifying method here. It's definitely helpful knowing that a thing can only be mutated in a small number of easily-identified places.

Doesn't seem like the DimensionRecordSet that comes out of the cache travels very far though, so I'd be inclined to say not totally necessary. (It also doesn't protect you from the other direction, where the cache is modified intentionally while there is a dangling reference to one of the inner DimensionRecordSet objects hanging out somewhere.)

Better might be to avoid exposing DimensionRecordCache's DimensionRecordSet objects at all. It looks like the only operation that is ever called on them is find. So you could add find to DimensionRecordCache and pass DimensionRecordCache around, instead of pulling out and passing around its inner DimensionRecordSet objects. This would have the advantage that you could right-click -> find references to locate everywhere that is using data from this cache. It also makes visible in record_cache.py all allowed access patterns.

👍 to just exposing find; I'll do that on DM-42344.

It also occurs to me that being able to modify the cache at all (rather that just reset it) may be unnecessary complexity, making this question a moot point.

dhirving

Looks good to me. A couple of nitpicks that you can take or leave. My preference would be for this to merge down with zero or minimal changes, so I can pick it up on my branch.

Most of the comments I left are more notes for myself than requested changes from you -- though I'd appreciate any thoughts or additional context you want to share related to them.

dhirving · 2024-01-05T18:01:13Z

python/lsst/daf/butler/registry/dimensions/static.py

+            # Database transaction succeeded; update the cache to keep them
+            # consistent.
+            if cache_records is not None:
+                cache_records.update(records, replace=not skip_existing)


(not for this ticket -- matches existing behavior) There's a potential consistency issue here, because we may be in an outer transaction that could still be rolled back. So we can end up with data in the cache that isn't actually in the DB.

Good catch. I am inclined to just remove all of the logic about updating the cache on insert or sync in favor of just resetting whenever those are called. I'll do that on DM-42344, after taking a quick look at the higher-level code that inserts these these to make sure it doesn't do some kind of insert-query-insert-query interleaving that would be slowed down a lot by that.

Yeah I'm in favor of removing that complexity, would definitely simplify reasoning about this. The refetch is still slightly tricky in theory because of cross-transaction visibility, but that's not a huge issue if the cache isn't shared between Butler instances.

dhirving · 2024-01-05T18:07:03Z

python/lsst/daf/butler/registry/dimensions/static.py

+            except LookupError:
+                return None


(not for this ticket, matches existing behavior) If we made this pass instead of return None it would allow us to read through on a cache miss. So it would make this work for the case where someone did an insert into the DB in the background. Because the cache is used directly with this same pattern a few other places it wouldn't totally solve the issue though, and there's nothing that can be done if someone does a sync in the background instead of an insert.

dhirving · 2024-01-05T18:09:49Z

python/lsst/daf/butler/registry/dimensions/static.py

+    ) -> tuple[
+        list[dict[str, Any]],
+        list[dict[str, Any]],
+        list[dict[str, Any]],
+        dict[str, list[dict[str, Any]]],
+    ]:


nitpick: declaring a NamedTuple or dataclass to use for this return type might improve clarity and remove some duplicated tuple unpacking elsewhere.

Yeah, given that I've already forgotten which items in that tuple are which just a few days later, this is worth doing. I'll do it on DM-42344.

dhirving · 2024-01-05T18:19:55Z

python/lsst/daf/butler/registry/sql_registry.py

-        return type(self)(self._db, defaults, self._managers)
+        result = SqlRegistry(self._db, defaults, self._managers)
+        result.dimension_record_cache.load_from(self.dimension_record_cache)
+        return result


This introduces a change in behavior. If you do:

butler1 = Butler() butler2 = Butler(butler=butler1) butler2.registry.insertDimensionData(aCacheableDimension, ...) # do something with butler1

Previously it would have updated the cache in both butlers. But after this PR, butler1 has a stale cache, as do any subsequent butlers copied from it. So they may throw errors if they try to do things with the missing dimension data, since cache misses are not handled gracefully.

This may be mostly a theoretical issue given that the cacheable dimensions are "mostly immutable."

I hadn't actually realized the CachingDimensionRecordStorage instance was being shared between the Butler instances before; mostly I hadn't thought about what this sort of Butler construction does at all since until very recently it's mostly been for unit testing.

But I do think this is consistent with your plan for having different Butler instances each get their own cache when constructed by the factory, right?

Yeah I think whatever happens to copy or share the cache happens right here, so this is fine structurally. As long as nothing is depending on that behavior currently (which it probably isn't) I think it's fine to leave this as-is.

dhirving · 2024-01-05T19:07:21Z

python/lsst/daf/butler/registry/dimensions/static.py

+            else:
+                # MyPy says this isn't a real collections.abc.Mapping, but it
+                # sure behaves like one.
+                mapping = row._mapping  # type: ignore


I think it's actually complaining because mapping is declared as a dict in the other branch of the if and this is a different type. If you predeclare mapping: typing.Mapping above the if element.temporal block it seems to satisfy mypy without the ignore. (Or possibly this was fixed in the very latest version of sqlalchemy that updated a bunch of types.)

Ah, the issue is the key type; I had tried that with Mapping[str, Any], and that still gets rejected, but just Mapping does work. I guess it's probably something like Mapping[str | sqlalchemy.<something>, Any], but I'm not sure what the <something> is; in any case this will do.

dhirving · 2024-01-05T21:03:45Z

python/lsst/daf/butler/registry/sql_registry.py

-            # TODO: this clears the caches sometimes when we wouldn't actually
-            # need to.  Can we avoid that?
-            self._managers.dimensions.clearCaches()
-            raise


(not on this ticket -- mostly theoretical issue) Yeah it looks like this removed code was trying to mitigate the issue I noticed above with cache consistency after a rollback.

dhirving · 2024-01-05T21:07:29Z

python/lsst/daf/butler/dimensions/record_cache.py

+    Notes
+    -----
+    The nested `DimensionRecordSet` objects should never be modified in place
+    except when returned by the `modifying` context manager.


I like that kind of thing in general as well as the pattern like the modifying method here. It's definitely helpful knowing that a thing can only be mutated in a small number of easily-identified places.

Doesn't seem like the DimensionRecordSet that comes out of the cache travels very far though, so I'd be inclined to say not totally necessary. (It also doesn't protect you from the other direction, where the cache is modified intentionally while there is a dangling reference to one of the inner DimensionRecordSet objects hanging out somewhere.)

Better might be to avoid exposing DimensionRecordCache's DimensionRecordSet objects at all. It looks like the only operation that is ever called on them is find. So you could add find to DimensionRecordCache and pass DimensionRecordCache around, instead of pulling out and passing around its inner DimensionRecordSet objects. This would have the advantage that you could right-click -> find references to locate everywhere that is using data from this cache. It also makes visible in record_cache.py all allowed access patterns.

dhirving · 2024-01-05T22:05:34Z

python/lsst/daf/butler/dimensions/record_cache.py

+    def load_from(self, other: DimensionRecordCache) -> None:
+        """Load records from another cache, but do nothing if it doesn't
+        currently have any records.
+
+        Parameters
+        ----------
+        other : `DimensionRecordCache`
+            Other cache to potentially copy records from.
+        """
+        self._records = copy.deepcopy(other._records)


This cache is a bit problematic for the service backend use case in its current form, but I'm happy to bang on it myself in the ticket I'm working on. This refactor is a huge improvement over the existing code already.

The most immediate (and easy) issue is that services instantiate a "template" Butler object that never does any work, so this copy never has anything to copy. Thus we end up re-fetching the cache on every request. Not necessarily worth worrying about, but I would also prefer that this not do so much copying -- was hoping that cloning a Butler would be on the order of "instantiating a few empty dicts and small objects and copying a few pointers".

Two possible ways to fix that:

Force a 'prefetch' in LabeledButlerFactory so the data is there to copy.

Make this cache threadsafe and shareable between instances.

I might go ahead and remove __getitem__ and add find as I described in the comment above. This would allow us to recover from someone inserting dimension records in the background, without a server restart. With find called from this class instead of directly on the DimensionRecordSet, we can detect the cache miss and force a refresh.

There are some problems with the interaction between multiple Butler instances and this cache. This is "mostly theoretical" because these dimensions are "mostly immutable", but I think worth mentioning because the same issues come up with any Butler-related cache. These issues can be ignored at the moment for read-only Butler server, but I'll need to make sure whatever I do here doesn't cause breakage for usage of DirectButler cloned through the same function.

If the cache is NOT shared, any mutation of the dimension records causes the cache in any other Butler instance to become stale. (Of course this also applies to a mutation done by a separate process entirely.)

If the cache IS shared, the big issue is that we allow long-lived user-initiated "outer" transactions, so it's tough to know when it's safe to update the cache. If we have mutated the data in one Butler instance, it may be rolled back before becoming "real", and we end up out of sync with the server.

Even doing the initial fetch or a refresh on a cache miss is potentially problematic if the cache is shared. Depending how old the transaction is, we may "move backwards in time" from another instance's point of view if we force-refresh the entire cache. Even if there are no transactions, with multiple threads a delayed response from the database could have the same effect.

So I think my approach might be:

Keep the cache copied between instances instead of shared.

For the server, prefetch everything when the "template" instance is created.

Move find onto this class

Read through to fetch_one on a cache miss instead of throwing an error, and update our non-shared copy of the cache.

We're still hosed if someone updates a cached dimension record, but at least inserts work with only a small performance penalty.

I was indeed copying here precisely because I didn't want to deal with the complexity of having a shareable cache, though I hadn't thought through all of the implications as thoroughly as you have.

And I agree with your preferred approach in the bullets. You're welcome to add find to this class on your ticket branch instead of me doing it on DM-42344. My own plans for building atop this branch involve adding new methods to DimensionRecordStorageManager, so don't think they'll conflict with yours.

A few comments on issues you brought up:

I eventually want to get rid of the possibility of user-defined outer transactions in favor of having a way for users to batch up a number of operations to be performed atomically in a more declarative sense. That's not a priority, but I've prototyped it out a bit on a technote and convinced myself it satisfies our use cases. Certainly we'll never try to support user-defined outer transactions in RemoteButler.

One thing to look out for if we switch to refreshing the cache on misses is the possibility of someone repeatedly asking for records that do not exist. This cache is primarily used by Registry.expandDataId and when processing query result objects. There's no real chance of this problem in the latter, but the former is what's called in Butler.put and other dataset-writing methods to both expand the data ID and check that it's valid, so invalid data IDs resulting in cache misses is a real possibility - but for now, that's mostly in DirectButler, I think, because RemoteButler doesn't do dataset writes. But expandDataId calls are all over the place, and it might be worthwhile to do little survey of them to see which ones might appear in loops where invalid data IDs are possible.

We really want this cache to be client-side in RemoteButler for the query-result-objects use case (where misses are not a concern); this will significantly reduce the complexity of serializing the query results compactly, especially if we use Arrow. That had made me assume it'd also be used client-side in Registry.expandDataId. But I'm not so sure now; we really want at most one server call in the implementation of that method, but we also want to avoid any server calls when everything is in the cache, and that'll be tricky to implement either way. It might depend on whether we can get user code to start using a vectorized replacement for expandDataId instead of calling it inside their own loops.

You're welcome to add find to this class on your ticket branch instead of me doing it on DM-42344.

I think I'll probably hold off on this for now, since this cache is truly immutable for the "fully read-only data release DB" use case I'm trying to get working right now.

* I eventually want to get rid of the possibility of user-defined outer transactions in favor of having a way for users to batch up a number of operations to be performed atomically in a more declarative sense.

Yeah I like that approach, should be a huge win if we can make it work.

* `expandDataId` calls are all over the place

Thanks for calling this out -- I was vaguely aware of the existence of the expanded data ID thing but hadn't done any thinking on the implications yet. I'll leave the change to reading through on a miss for later as well then, since I don't need it for my immediate use case.

* We really want this cache to be client-side in `RemoteButler` for the query-result-objects use case (where misses are not a concern);

I think we can allow this. For cases like this where it's like "~50MB or smaller, almost never changes" we can use a pattern like:

Client downloads the full cache on first use.

Server sends a hash or version number for the cache with every query response, so the client knows when it needs to refresh the cache. (In this case I think we would pre-generate a sorted, JSON-serialized version of the cache data at startup and generate an SHA1 of the JSON string.)

Handle cache invalidation on the server "somehow". (For this particular case we can probably just manually restart the server, since these changes are so infrequent.)

Because the cache is versioned we can safely share it between threads or even write it to local disk for use between runs. This same pattern would work OK for the DimensionUniverse cache as well.

dhirving · 2024-01-05T22:24:06Z

Oh, any idea why there isn't coverage of lines 280-282 in static.py? https://app.codecov.io/gh/lsst/daf_butler/pull/936?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lsst#8a186ac41ff123f73d869ca0f23cdc4d-R279

TallJimbo · 2024-01-08T16:30:40Z

Oh, any idea why there isn't coverage of lines 280-282 in static.py?

Ah, interesting - it's because the only dimension that uses the "implied union" stuff is band, but that's also configured to be cached in the default dimension universe, so it always gets fetched via the cache. It'd definitely be nice to have some tests that use something other than the default dimension universe to cover cases like this, but I don't think it's worth prioritizing right now.

This is an extremely edgy edge case: when we materialize a query with no columns, we have to add a dummy column to the temp table, just as we do to the query. But for some reason we were giving that column a default, and those behave strangely in SQLAlchemy (they don't go through the usual conversion logic), and it didn't like that we were giving a bool column a bool default value. Easy fix is to just not give it a default value. The test that was supposed to exercise this code was being thwarted by some aggressive query simplification; we could tell at query-build time that the test query would have exactly one row, and that triggered a different code path that skipped the temporary table entirely.

This adds three new properties that replace two old properties and one old method, conveying the same information in a more convenient, better-named, and snake_case form. I plan to deprecate the old ones eventually, but I'm going to aggregate a bunch of these very minor, "barely escaped into the public API" deprecations into one RFC, and I think I'll find more shortly.

This isn't used by anything yet, so the cache-fetching code added here isn't being invoked anywhere yet, either.

TallJimbo force-pushed the tickets/DM-42324 branch from 4661688 to c8df5a5 Compare January 3, 2024 19:14

TallJimbo force-pushed the tickets/DM-42324 branch 2 times, most recently from f29dd40 to 5cf03f5 Compare January 3, 2024 20:05

TallJimbo commented Jan 4, 2024

View reviewed changes

TallJimbo force-pushed the tickets/DM-42324 branch from 5cf03f5 to 7273536 Compare January 4, 2024 16:07

dhirving approved these changes Jan 5, 2024

View reviewed changes

TallJimbo added 8 commits January 8, 2024 14:15

Add DimensionRecord.is_cached attribute.

abf34c7

Add simple class for caching dimension records.

fce5507

Add DimensionRecordCache attribute to SqlRegistry.

35ec02b

This isn't used by anything yet, so the cache-fetching code added here isn't being invoked anywhere yet, either.

Drop never-implemented interfaces for spatial overlap precalculation.

de347a1

Remove DimensionRecordStorage.digestTables.

5beb2e9

Refactor dimension record storage and caching.

0cb8b8e

TallJimbo force-pushed the tickets/DM-42324 branch from e8fd68c to 0cb8b8e Compare January 8, 2024 19:15

TallJimbo merged commit 307ac87 into main Jan 8, 2024
18 checks passed

TallJimbo deleted the tickets/DM-42324 branch January 8, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42324: refactor dimension record storage classes #936

DM-42324: refactor dimension record storage classes #936

TallJimbo commented Jan 3, 2024 •

edited

codecov bot commented Jan 3, 2024 •

edited

TallJimbo Jan 4, 2024

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving left a comment

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving Jan 8, 2024

dhirving Jan 5, 2024

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving Jan 8, 2024

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving Jan 5, 2024

dhirving Jan 5, 2024

dhirving Jan 5, 2024

TallJimbo Jan 6, 2024

dhirving Jan 8, 2024

dhirving commented Jan 5, 2024 •

edited

TallJimbo commented Jan 8, 2024

DM-42324: refactor dimension record storage classes #936

DM-42324: refactor dimension record storage classes #936

Conversation

TallJimbo commented Jan 3, 2024 • edited

Checklist

codecov bot commented Jan 3, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving commented Jan 5, 2024 • edited

TallJimbo commented Jan 8, 2024

TallJimbo commented Jan 3, 2024 •

edited

codecov bot commented Jan 3, 2024 •

edited

dhirving commented Jan 5, 2024 •

edited