Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41117: Improvements to caching and Butler initialization #904

Merged
merged 11 commits into from Nov 14, 2023

Conversation

andy-slac
Copy link
Contributor

A bunch of improvements for collection caching to reduce the need to cache all collections in memory. In some cases when we need to filter collections based on regexes we still need to fetch everything into memory, but in other cases we only read collections that are explicitly specified.

Similarly, collection summary cache is eliminated completely to avoid upfront loading which takes significant time. Now we are only fetching summaries for collections that are needed.

Table reflection is eliminated for static tables, and is postponed for dynamic (tags/calib) tables until they are used.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Attention: 121 lines in your changes are missing coverage. Please review.

Comparison is base (faf02a0) 87.86% compared to head (922359c) 87.72%.

Files Patch % Lines
...st/daf/butler/registry/_collection_record_cache.py 33.33% 30 Missing ⚠️
...thon/lsst/daf/butler/registry/collections/_base.py 71.87% 16 Missing and 11 partials ⚠️
.../butler/registry/datasets/byDimensions/_manager.py 80.35% 11 Missing and 11 partials ⚠️
...t/daf/butler/registry/_collection_summary_cache.py 50.00% 9 Missing ⚠️
...on/lsst/daf/butler/registry/_dataset_type_cache.py 78.04% 7 Missing and 2 partials ⚠️
...butler/registry/datasets/byDimensions/summaries.py 83.67% 5 Missing and 3 partials ⚠️
...ython/lsst/daf/butler/registry/_caching_context.py 80.95% 4 Missing ⚠️
python/lsst/daf/butler/registry/sql_registry.py 62.50% 3 Missing ⚠️
python/lsst/daf/butler/_named.py 0.00% 0 Missing and 2 partials ⚠️
.../butler/registry/datasets/byDimensions/_storage.py 88.23% 1 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #904      +/-   ##
==========================================
- Coverage   87.86%   87.72%   -0.14%     
==========================================
  Files         281      285       +4     
  Lines       36440    36744     +304     
  Branches     7604     7690      +86     
==========================================
+ Hits        32017    32233     +216     
- Misses       3273     3340      +67     
- Partials     1150     1171      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks great; definitely a step in the right direction.

Pre-fetching of collection summaries was quite expensive and we did not use
those summaries very often. Removing the cache completely, now we query
summaries each time but only for the collections that are actually used.
`DefaultCollectionManager.refresh` now runs a single query to fetch
full contents of collection_chain table. This removes update logic from
collection record classes which became simple data classes now.
Static tables do not really need schema verification because we rely
on version numbers in butler_attributes. Verification and reflection
may still be usefule for dynamic tables (tags/calibs) but we now delay
it until the tables are actually used.
@andy-slac andy-slac force-pushed the tickets/DM-41117 branch 2 times, most recently from f2e1c09 to 8227a57 Compare November 9, 2023 16:17
@andy-slac andy-slac force-pushed the tickets/DM-41117 branch 3 times, most recently from 0181a32 to 1e59418 Compare November 13, 2023 23:17
@andy-slac
Copy link
Contributor Author

@TallJimbo, sorry for another review request, I have added three more commits to try to improve things:

  • 5833c91 - removes caching from dataset type manager, looking at the debug logs I realized that we now have too many SQL queries.
  • 14721a2 - adds context manager to Registry that enables caching and adds caches for collections and collection summaries
  • 1e59418 - adds cache for dataset types.

I'm not very happy with how it is structured now, cache for dataset types looks particularly ugly, but this is the best that I could do. As the result, number of queries is approximately the same in both QG building and pipetask run in ci_hsc_gen3. (In QG generation the number of queries (but not time, in my setup) is actually dominated by _skyFrameLookup).

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I think it's time to wrap this one up and start one or more new tickets for further caching work, as I think this is a clear step in the right direction with some immediate benefits and no obvious downsides.

I do still think there's a ton of work ahead here, because the caching context interface implies a per-user cache and that'd be a massive pain to put on the server, and yet some of what we cache now can only go on the server (SQLAlchemy table objects) or needs to be on the server as a check even if it's also on the client (collection summaries). I think I'd advise some "step back and design it conceptually" work before further coding on this problem, though.

Typo: additiona in commit message for 14721a2.

Adds special cache classes for collection and summary records and
an additional structure that holds caches. New registry method is
a context manager that enables caches temporarily for the duration
of that context.
Unlike collection caches, dataset type cache is always on, this
helps to reduce number of queries in `pipetask run` without the need
to explicitly enable caching in multiple places.
Registry shiim now redirects to this method instead of Registry
method. RemoteButler raises NotImplementedError but may do something
non-trivial later when we know how caching is going to work with
client/server.
@andy-slac andy-slac merged commit c817402 into main Nov 14, 2023
15 of 17 checks passed
@andy-slac andy-slac deleted the tickets/DM-41117 branch November 14, 2023 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants