Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-33638: Check datastore cache for existence before checking remote datastore #646

Merged
merged 9 commits into from Feb 23, 2022

Conversation

timj
Copy link
Member

@timj timj commented Feb 18, 2022

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@@ -282,6 +282,35 @@ def should_be_cached(self, entity: Union[DatasetRef, DatasetType, StorageClass])
"""
raise NotImplementedError()

@abstractmethod
def known_to_cache(self, ref: DatasetRef, extension: Optional[str] = None) -> bool:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder whether I should put an underscore in front of this name. It's not a method that we should encourage people using. It's only here as a shortcut for the datastore checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AbstractDatastoreCacheManager exposed to "people"? If it is internal to datastores' implementation, then we probably don't care if it's public or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the assumption is that this is something that only datastores care about. Hopfully the "don't use this" caveat in the description is enough.

@codecov
Copy link

codecov bot commented Feb 18, 2022

Codecov Report

Merging #646 (b608633) into main (0bfdc4b) will decrease coverage by 0.00%.
The diff coverage is 81.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #646      +/-   ##
==========================================
- Coverage   83.61%   83.60%   -0.01%     
==========================================
  Files         239      239              
  Lines       30171    30245      +74     
  Branches     5047     5065      +18     
==========================================
+ Hits        25226    25286      +60     
- Misses       3799     3806       +7     
- Partials     1146     1153       +7     
Impacted Files Coverage Δ
...thon/lsst/daf/butler/core/datastoreCacheManager.py 79.72% <70.45%> (-1.20%) ⬇️
python/lsst/daf/butler/datastores/fileDatastore.py 80.23% <80.95%> (-0.10%) ⬇️
python/lsst/daf/butler/core/datasets/type.py 84.33% <100.00%> (+0.92%) ⬆️
tests/test_cliCmdPruneCollection.py 97.77% <100.00%> (+0.18%) ⬆️
tests/test_datasets.py 99.24% <100.00%> (+<0.01%) ⬆️
tests/test_datastore.py 99.62% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bfdc4b...b608633. Read the comment docs.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, few minor comments.

@@ -282,6 +282,35 @@ def should_be_cached(self, entity: Union[DatasetRef, DatasetType, StorageClass])
"""
raise NotImplementedError()

@abstractmethod
def known_to_cache(self, ref: DatasetRef, extension: Optional[str] = None) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AbstractDatastoreCacheManager exposed to "people"? If it is internal to datastores' implementation, then we probably don't care if it's public or not.

Comment on lines +731 to +784
cached_location = self._construct_cache_name(ref, extension)
path_in_cache = cached_location.relative_to(self.cache_directory)
assert path_in_cache is not None # For mypy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three lines feel like you want a separate method _construct_cache_name_relative(ref, extension) -> str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. When I tried to remember how the cache key was formed I was a bit shocked to see that I repeated the logic in multiple places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although now that I look again I see that these two lines:

cached_location = self._construct_cache_name(ref, extension)
path_in_cache = cached_location.relative_to(self.cache_directory)

are never consecutive in the code other than this new method so it's not going to be a straightforward tweak.

python/lsst/daf/butler/core/datastoreCacheManager.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/datastoreCacheManager.py Outdated Show resolved Hide resolved
@timj timj force-pushed the tickets/DM-33638 branch 2 times, most recently from ceb0b4e to 30a325f Compare February 22, 2022 22:18
This tests that you get the warning and removes it from the
default output from pytest.
The cache size might be zero with multiple empty files so safer
to use the number of files in the cache.
Use a marker default object to indicate "can raise"
This makes the default "known_to_cache" implementation much
more efficient at the cost of an additional dict.
@timj timj merged commit 40211b7 into main Feb 23, 2022
@timj timj deleted the tickets/DM-33638 branch February 23, 2022 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants