New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-21849: overhaul collections system #236
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me so long to make my way through this huge change. It looks good to me although I got a bit derailed in the low level query classes.
collection will be used for input lookups as well; if not, it must have | ||
the same value as ``run``. | ||
Name of the run datasets should be output to. If the run | ||
does not exist, it will be created. If ``collections`` is `None`, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say explicitly here that if run
is not set it is a readonly butler?
refs = list(refs) | ||
for ref in refs: | ||
# isComponent isn't actually reliable, because the dataset type | ||
# name isn't the whole story, and that's mildly concerning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dataset type name is A.b
how do you get that if it isn't a component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of the converse, in the deferred-virtual-composite case, e.g. we write jointcal_wcs
as a standalone dataset and later use it as the wcs
component of some Exposure
dataset.
# not. This is consistent with the fact that we want | ||
# to ignore already-removed-from-datastore datasets | ||
# anyway. | ||
if self.datastore.exists(ref): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still worry that this hides problems with composites since non-existence might be a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; I was just trying to handle the common case here, and defer a real solution to DM-23671.
python/lsst/daf/butler/_butler.py
Outdated
The names of a `~CollectionType.TAGGED` collections to associate | ||
the dataset with, overriding ``self.tags``. These collections | ||
must have already been added to the `Registry`. | ||
collection : `str` or `False`, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no collection argument.
collectionFieldName = self._collections.getCollectionForeignKeyName() | ||
collectionRecord = self._collections.find(collection) | ||
if collectionRecord.type is CollectionType.RUN: | ||
raise TypeError(f"Collection '{collection}' has type {collectionRecord.type.name}, not TAGGED.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message seems odd since you are checking it IS RUN, not it's not TAGGED. This is going to get really confusing if extra collection types turn up. Can you say is not TAGGED
instead?
from ..wildcards import DatasetTypeRestriction | ||
|
||
|
||
DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think os.path.dirname()
is preferred over os.path.split()[0]
.
|
||
|
||
DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0], | ||
"../../../../../../tests/data/registry")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a little easier to let the tests specify the data dir themselves and pass it in to this class?
if expression is ...: | ||
if not allowAny: | ||
raise TypeError("This expression may not be unconstrained.") | ||
return ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string says this should return None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in an earlier (squashed away) version of this I was converting ...
to None
when going from public to internal APIs, and then I realized it was a bad idea to have two different sentinel types to represent the same concept. Docs fixed.
|
||
|
||
class DatasetTypeRestriction: | ||
"""An immutable set-like object that represents a restriction on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inherit from collections.abc.Set
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't; it can't always define __len__
(maybe other limitations, but at least that). I'll add a comment to that effect.
@@ -102,7 +97,7 @@ def validateButlerConfiguration(root, datasetTypes=None, ignore=None, quiet=Fals | |||
|
|||
# The collection does not matter for validation but if a run is specified | |||
# in the configuration then it must be consistent with this collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is now completely irrelevant
@andy-slac can you please take a look at the commits you were assigned so we can move towards merging? |
Sorry, I missed that, will look at it ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked three commits that were specifically assigned to me, looks OK, small bunch of comments.
"""Datasets can be associated with and removed from ``TAGGED`` collections | ||
arbitrarily. | ||
|
||
Within a particular run, there may only be one dataset with a particular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run? Should it be a "tagged collection"?
---------- | ||
name : `str` | ||
Name of the collection. | ||
id : `int` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not exist any more?
One of "CASCADE" or "SET NULL", indicating what should happen to | ||
the referencing row if the collection row is deleted. `None` | ||
indicates that this should be an integrity error. | ||
kwds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwds
?
Returns | ||
------- | ||
fieldSpec : `ddl.FieldSpec` | ||
Specification for the field being added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it should both add FieldSpec to a TableSpec and return it?
One of "CASCADE" or "SET NULL", indicating what should happen to | ||
the referencing row if the collection row is deleted. `None` | ||
indicates that this should be an integrity error. | ||
kwds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwds
Notes | ||
----- | ||
Collections registered by another client of the same layer since the | ||
last call to `initialize` or `refresh` may not be found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method throw an exception for missing collection? My impression that get()
is a "stronger" version of a find()
method, usually if someone has a foreign key it means that object exists/existed. OTOH find()
is a general search by name, it may be more reasonable to re return None
from that method if nothing is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. In keeping with Python conventions, I'll also rename it from get
to __getitem__
.
row = { | ||
"name": self.name, | ||
TIMESPAN_FIELD_SPECS.begin.name: timespan.begin if timespan is not None else None, | ||
TIMESPAN_FIELD_SPECS.end.name: timespan.end if timespan is not None else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timespan is not None
is always True
, there is a protection for that above.
count = self._db.update(self._table, row, keys=["id"], | ||
values=[TIMESPAN_FIELD_SPECS.begin.name, | ||
TIMESPAN_FIELD_SPECS.end.name, | ||
"host"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at Database.update()
method and I'm not sure these parameters match the declaration?
record = NameKeyRunRecord(**kwds) | ||
else: | ||
kwds["type"] = type | ||
record = NameKeyCollectionRecord(**kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unconventionally looking, why not just record = NameKeyCollectionRecord(type=type)
? Same for NameKeyRunRecord
above.
wildcard.makeWhereExpression(self._tables.collection.columns.name) | ||
) | ||
for row in self._db.query(sql).fetchall(): | ||
yield self._records[row["name"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If collection was just created in database and is not in self._records
this will raise LookupError
, I'm not sure this is an expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A later commit makes this a moot point.
46ebbf4
to
06f342e
Compare
These were placeholders for future implementations that will not exist anytime in the near future, and that makes them clutter right now.
I think I swung too far in the direction of 'from X import Y' before; a lot of these very short symbols really need namespace context.
This replaces Butler.delete with Butler.prune, which is vectorized and doesn't "unstore" from the Datastore by default. That's both because we don't want to make it too easy to delete things in collections the current butler doesn't know about, and because PosixDatastore.delete isn't exception safe, and that makes it way too easy to get a corrupted repository.
Allowing 'collection' to mean 'run' isn't technically wrong, but it's about to be more confusing than helpful, because in the future butler.run == butler.collection will not be guaranteed.
...and fix a pretty obvious bug in it. Looks like a bad conflict resolution, but that wouldn't explain the bug. In any case, I'm quite surprised flake8 wasn't mad about the unqualified return statement in the middle of a function.
This could (and probably should) been done on DM-21246, since this collection was only ever used to instantiate the Butler, but we need to remove it now because instantiating a read-only Butler with a nonexistent collection will soon be an error.
Eventually we should extend this data and use it in the query tests, too. But for now that's a distraction from more pressing work.
These assume a different model of how runs and collections are related (namely, that runs are a type of collection) than what Registry has now, and hence making Registry use them will involve an interface change (in a later commit).
These just use the name as the primary key, and assume the number of collections/runs is small enough that the overhead of loading them all into memory up front can be ignored. We can of course try other approaches in the future with other implementation classes.
This makes the order of arguments consistent with virtually every other Registry and Butler API we have, and the name more consistent with other Registry method names.
This includes a number of changes for consistency and clarity, including: - replacing Registry's getAll* methods with more flexible query* methods; - making the types accepted in expressions for collections and dataset types more consistent; - switching from SQL LIKE patterns to regular expressions (it looks like we mostly want to apply these in Python rather than in the database); - adding user guide docs for those expressions, integrating it with the existing page on data ID expression strings.
This is a lead-up to supporting virtually-chained collections.
06f342e
to
612da1e
Compare
Numpydoc doesn't actually support Attributes as a heading, and those are already documented via class attributes.
I'm not sure why the apparently-duplicate py:currentmodule directives are needed (note that there's one at the top of index.rst, and that was enough for configuring.rst), but they make the links work, and without them, the links don't.
167b56d
to
474022d
Compare
There are a few mostly-separate projects here (and a lot of miscellaneous prep work), combined in one branch to yield only one set of disruptive changes downstream (rather than four). I've tried hard to keep the commits for those projects separate, but sadly GitHub's chronological rather than topological ordering makes that particularly hard to see. Here are the commits ordered topologically and grouped by project:
Various cleanups to existing code and other prep work (recommended reviewer @timj):
Reimplement collections in Registry ala prototype, and make runs a type of collection (recommended reviewer @andy-slac):
Move multi-collection support into Registry and Butler, add CHAINED collections (no preference for reviewer):