Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42324: refactor dimension record storage classes #936

Merged
merged 8 commits into from Jan 8, 2024

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Jan 3, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes (not user-visible)

Copy link

codecov bot commented Jan 3, 2024

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (d0dcdfc) 88.36% compared to head (0cb8b8e) 88.37%.

Files Patch % Lines
...thon/lsst/daf/butler/registry/dimensions/static.py 94.58% 7 Missing and 6 partials ⚠️
python/lsst/daf/butler/dimensions/record_cache.py 86.95% 4 Missing and 2 partials ⚠️
...ython/lsst/daf/butler/registry/obscore/_manager.py 33.33% 2 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_elements.py 85.71% 2 Missing ⚠️
python/lsst/daf/butler/tests/_testRepo.py 50.00% 0 Missing and 1 partial ⚠️
python/lsst/daf/butler/transfers/_context.py 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #936      +/-   ##
==========================================
+ Coverage   88.36%   88.37%   +0.01%     
==========================================
  Files         306      301       -5     
  Lines       39167    38850     -317     
  Branches     8233     8185      -48     
==========================================
- Hits        34608    34333     -275     
+ Misses       3371     3343      -28     
+ Partials     1188     1174      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TallJimbo TallJimbo force-pushed the tickets/DM-42324 branch 2 times, most recently from f29dd40 to 5cf03f5 Compare January 3, 2024 20:05
Notes
-----
The nested `DimensionRecordSet` objects should never be modified in place
except when returned by the `modifying` context manager.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering adding a non-mutable ABC for DimensionRecordSet that could be used as the value type of this Mapping, to enforce this requirement with static typing (probably on DM-42344 instead of this ticket). The internal dict would still use the mutable full type, as would the modifying method. I'm open to opinions on whether that's a useful addition or an excessive one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that kind of thing in general as well as the pattern like the modifying method here. It's definitely helpful knowing that a thing can only be mutated in a small number of easily-identified places.

Doesn't seem like the DimensionRecordSet that comes out of the cache travels very far though, so I'd be inclined to say not totally necessary. (It also doesn't protect you from the other direction, where the cache is modified intentionally while there is a dangling reference to one of the inner DimensionRecordSet objects hanging out somewhere.)

Better might be to avoid exposing DimensionRecordCache's DimensionRecordSet objects at all. It looks like the only operation that is ever called on them is find. So you could add find to DimensionRecordCache and pass DimensionRecordCache around, instead of pulling out and passing around its inner DimensionRecordSet objects. This would have the advantage that you could right-click -> find references to locate everywhere that is using data from this cache. It also makes visible in record_cache.py all allowed access patterns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to just exposing find; I'll do that on DM-42344.

It also occurs to me that being able to modify the cache at all (rather that just reset it) may be unnecessary complexity, making this question a moot point.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A couple of nitpicks that you can take or leave. My preference would be for this to merge down with zero or minimal changes, so I can pick it up on my branch.

Most of the comments I left are more notes for myself than requested changes from you -- though I'd appreciate any thoughts or additional context you want to share related to them.

Comment on lines +214 to +217
# Database transaction succeeded; update the cache to keep them
# consistent.
if cache_records is not None:
cache_records.update(records, replace=not skip_existing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not for this ticket -- matches existing behavior) There's a potential consistency issue here, because we may be in an outer transaction that could still be rolled back. So we can end up with data in the cache that isn't actually in the DB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I am inclined to just remove all of the logic about updating the cache on insert or sync in favor of just resetting whenever those are called. I'll do that on DM-42344, after taking a quick look at the higher-level code that inserts these these to make sure it doesn't do some kind of insert-query-insert-query interleaving that would be slowed down a lot by that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm in favor of removing that complexity, would definitely simplify reasoning about this. The refetch is still slightly tricky in theory because of cross-transaction visibility, but that's not a huge issue if the cache isn't shared between Butler instances.

Comment on lines +277 to +278
except LookupError:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not for this ticket, matches existing behavior) If we made this pass instead of return None it would allow us to read through on a cache miss. So it would make this work for the case where someone did an insert into the DB in the background. Because the cache is used directly with this same pattern a few other places it wouldn't totally solve the issue though, and there's nothing that can be done if someone does a sync in the background instead of an insert.

Comment on lines +626 to +631
) -> tuple[
list[dict[str, Any]],
list[dict[str, Any]],
list[dict[str, Any]],
dict[str, list[dict[str, Any]]],
]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: declaring a NamedTuple or dataclass to use for this return type might improve clarity and remove some duplicated tuple unpacking elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, given that I've already forgotten which items in that tuple are which just a few days later, this is worth doing. I'll do it on DM-42344.

Comment on lines -292 to +305
return type(self)(self._db, defaults, self._managers)
result = SqlRegistry(self._db, defaults, self._managers)
result.dimension_record_cache.load_from(self.dimension_record_cache)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a change in behavior. If you do:

butler1 = Butler()
butler2 = Butler(butler=butler1)
butler2.registry.insertDimensionData(aCacheableDimension, ...)
# do something with butler1

Previously it would have updated the cache in both butlers. But after this PR, butler1 has a stale cache, as do any subsequent butlers copied from it. So they may throw errors if they try to do things with the missing dimension data, since cache misses are not handled gracefully.

This may be mostly a theoretical issue given that the cacheable dimensions are "mostly immutable."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't actually realized the CachingDimensionRecordStorage instance was being shared between the Butler instances before; mostly I hadn't thought about what this sort of Butler construction does at all since until very recently it's mostly been for unit testing.

But I do think this is consistent with your plan for having different Butler instances each get their own cache when constructed by the factory, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think whatever happens to copy or share the cache happens right here, so this is fine structurally. As long as nothing is depending on that behavior currently (which it probably isn't) I think it's fine to leave this as-is.

else:
# MyPy says this isn't a real collections.abc.Mapping, but it
# sure behaves like one.
mapping = row._mapping # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually complaining because mapping is declared as a dict in the other branch of the if and this is a different type. If you predeclare mapping: typing.Mapping above the if element.temporal block it seems to satisfy mypy without the ignore. (Or possibly this was fixed in the very latest version of sqlalchemy that updated a bunch of types.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the issue is the key type; I had tried that with Mapping[str, Any], and that still gets rejected, but just Mapping does work. I guess it's probably something like Mapping[str | sqlalchemy.<something>, Any], but I'm not sure what the <something> is; in any case this will do.

Comment on lines -356 to -359
# TODO: this clears the caches sometimes when we wouldn't actually
# need to. Can we avoid that?
self._managers.dimensions.clearCaches()
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not on this ticket -- mostly theoretical issue) Yeah it looks like this removed code was trying to mitigate the issue I noticed above with cache consistency after a rollback.

Notes
-----
The nested `DimensionRecordSet` objects should never be modified in place
except when returned by the `modifying` context manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that kind of thing in general as well as the pattern like the modifying method here. It's definitely helpful knowing that a thing can only be mutated in a small number of easily-identified places.

Doesn't seem like the DimensionRecordSet that comes out of the cache travels very far though, so I'd be inclined to say not totally necessary. (It also doesn't protect you from the other direction, where the cache is modified intentionally while there is a dangling reference to one of the inner DimensionRecordSet objects hanging out somewhere.)

Better might be to avoid exposing DimensionRecordCache's DimensionRecordSet objects at all. It looks like the only operation that is ever called on them is find. So you could add find to DimensionRecordCache and pass DimensionRecordCache around, instead of pulling out and passing around its inner DimensionRecordSet objects. This would have the advantage that you could right-click -> find references to locate everywhere that is using data from this cache. It also makes visible in record_cache.py all allowed access patterns.

Comment on lines +129 to +138
def load_from(self, other: DimensionRecordCache) -> None:
"""Load records from another cache, but do nothing if it doesn't
currently have any records.

Parameters
----------
other : `DimensionRecordCache`
Other cache to potentially copy records from.
"""
self._records = copy.deepcopy(other._records)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache is a bit problematic for the service backend use case in its current form, but I'm happy to bang on it myself in the ticket I'm working on. This refactor is a huge improvement over the existing code already.

The most immediate (and easy) issue is that services instantiate a "template" Butler object that never does any work, so this copy never has anything to copy. Thus we end up re-fetching the cache on every request. Not necessarily worth worrying about, but I would also prefer that this not do so much copying -- was hoping that cloning a Butler would be on the order of "instantiating a few empty dicts and small objects and copying a few pointers".

Two possible ways to fix that:

  1. Force a 'prefetch' in LabeledButlerFactory so the data is there to copy.
  2. Make this cache threadsafe and shareable between instances.

I might go ahead and remove __getitem__ and add find as I described in the comment above. This would allow us to recover from someone inserting dimension records in the background, without a server restart. With find called from this class instead of directly on the DimensionRecordSet, we can detect the cache miss and force a refresh.

There are some problems with the interaction between multiple Butler instances and this cache. This is "mostly theoretical" because these dimensions are "mostly immutable", but I think worth mentioning because the same issues come up with any Butler-related cache. These issues can be ignored at the moment for read-only Butler server, but I'll need to make sure whatever I do here doesn't cause breakage for usage of DirectButler cloned through the same function.

If the cache is NOT shared, any mutation of the dimension records causes the cache in any other Butler instance to become stale. (Of course this also applies to a mutation done by a separate process entirely.)

If the cache IS shared, the big issue is that we allow long-lived user-initiated "outer" transactions, so it's tough to know when it's safe to update the cache. If we have mutated the data in one Butler instance, it may be rolled back before becoming "real", and we end up out of sync with the server.

Even doing the initial fetch or a refresh on a cache miss is potentially problematic if the cache is shared. Depending how old the transaction is, we may "move backwards in time" from another instance's point of view if we force-refresh the entire cache. Even if there are no transactions, with multiple threads a delayed response from the database could have the same effect.

So I think my approach might be:

  • Keep the cache copied between instances instead of shared.
  • For the server, prefetch everything when the "template" instance is created.
  • Move find onto this class
  • Read through to fetch_one on a cache miss instead of throwing an error, and update our non-shared copy of the cache.

We're still hosed if someone updates a cached dimension record, but at least inserts work with only a small performance penalty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was indeed copying here precisely because I didn't want to deal with the complexity of having a shareable cache, though I hadn't thought through all of the implications as thoroughly as you have.

And I agree with your preferred approach in the bullets. You're welcome to add find to this class on your ticket branch instead of me doing it on DM-42344. My own plans for building atop this branch involve adding new methods to DimensionRecordStorageManager, so don't think they'll conflict with yours.

A few comments on issues you brought up:

  • I eventually want to get rid of the possibility of user-defined outer transactions in favor of having a way for users to batch up a number of operations to be performed atomically in a more declarative sense. That's not a priority, but I've prototyped it out a bit on a technote and convinced myself it satisfies our use cases. Certainly we'll never try to support user-defined outer transactions in RemoteButler.

  • One thing to look out for if we switch to refreshing the cache on misses is the possibility of someone repeatedly asking for records that do not exist. This cache is primarily used by Registry.expandDataId and when processing query result objects. There's no real chance of this problem in the latter, but the former is what's called in Butler.put and other dataset-writing methods to both expand the data ID and check that it's valid, so invalid data IDs resulting in cache misses is a real possibility - but for now, that's mostly in DirectButler, I think, because RemoteButler doesn't do dataset writes. But expandDataId calls are all over the place, and it might be worthwhile to do little survey of them to see which ones might appear in loops where invalid data IDs are possible.

  • We really want this cache to be client-side in RemoteButler for the query-result-objects use case (where misses are not a concern); this will significantly reduce the complexity of serializing the query results compactly, especially if we use Arrow. That had made me assume it'd also be used client-side in Registry.expandDataId. But I'm not so sure now; we really want at most one server call in the implementation of that method, but we also want to avoid any server calls when everything is in the cache, and that'll be tricky to implement either way. It might depend on whether we can get user code to start using a vectorized replacement for expandDataId instead of calling it inside their own loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome to add find to this class on your ticket branch instead of me doing it on DM-42344.

I think I'll probably hold off on this for now, since this cache is truly immutable for the "fully read-only data release DB" use case I'm trying to get working right now.

* I eventually want to get rid of the possibility of user-defined outer transactions in favor of having a way for users to batch up a number of operations to be performed atomically in a more declarative sense.

Yeah I like that approach, should be a huge win if we can make it work.

* `expandDataId` calls are all over the place

Thanks for calling this out -- I was vaguely aware of the existence of the expanded data ID thing but hadn't done any thinking on the implications yet. I'll leave the change to reading through on a miss for later as well then, since I don't need it for my immediate use case.

* We really want this cache to be client-side in `RemoteButler` for the query-result-objects use case (where misses are not a concern);

I think we can allow this. For cases like this where it's like "~50MB or smaller, almost never changes" we can use a pattern like:

  1. Client downloads the full cache on first use.
  2. Server sends a hash or version number for the cache with every query response, so the client knows when it needs to refresh the cache. (In this case I think we would pre-generate a sorted, JSON-serialized version of the cache data at startup and generate an SHA1 of the JSON string.)
  3. Handle cache invalidation on the server "somehow". (For this particular case we can probably just manually restart the server, since these changes are so infrequent.)

Because the cache is versioned we can safely share it between threads or even write it to local disk for use between runs. This same pattern would work OK for the DimensionUniverse cache as well.

@dhirving
Copy link
Contributor

dhirving commented Jan 5, 2024

@TallJimbo
Copy link
Member Author

Oh, any idea why there isn't coverage of lines 280-282 in static.py?

Ah, interesting - it's because the only dimension that uses the "implied union" stuff is band, but that's also configured to be cached in the default dimension universe, so it always gets fetched via the cache. It'd definitely be nice to have some tests that use something other than the default dimension universe to cover cases like this, but I don't think it's worth prioritizing right now.

This is an extremely edgy edge case: when we materialize a query
with no columns, we have to add a dummy column to the temp table, just
as we do to the query.  But for some reason we were giving that column
a default, and those behave strangely in SQLAlchemy (they don't go
through the usual conversion logic), and it didn't like that we were
giving a bool column a bool default value.  Easy fix is to just not
give it a default value.

The test that was supposed to exercise this code was being thwarted by
some aggressive query simplification; we could tell at query-build time
that the test query would have exactly one row, and that triggered a
different code path that skipped the temporary table entirely.
This adds three new properties that replace two old properties and one
old method, conveying the same information in a more convenient,
better-named, and snake_case form.

I plan to deprecate the old ones eventually, but I'm going to aggregate
a bunch of these very minor, "barely escaped into the public API"
deprecations into one RFC, and I think I'll find more shortly.
This isn't used by anything yet, so the cache-fetching code added here
isn't being invoked anywhere yet, either.
@TallJimbo TallJimbo merged commit 307ac87 into main Jan 8, 2024
18 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-42324 branch January 8, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants