Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-21849: overhaul collections system #236

Merged
merged 23 commits into from Mar 25, 2020
Merged

DM-21849: overhaul collections system #236

merged 23 commits into from Mar 25, 2020

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Mar 4, 2020

There are a few mostly-separate projects here (and a lot of miscellaneous prep work), combined in one branch to yield only one set of disruptive changes downstream (rather than four). I've tried hard to keep the commits for those projects separate, but sadly GitHub's chronological rather than topological ordering makes that particularly hard to see. Here are the commits ordered topologically and grouped by project:

Various cleanups to existing code and other prep work (recommended reviewer @timj):

Reimplement collections in Registry ala prototype, and make runs a type of collection (recommended reviewer @andy-slac):

Move multi-collection support into Registry and Butler, add CHAINED collections (no preference for reviewer):

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to make my way through this huge change. It looks good to me although I got a bit derailed in the low level query classes.

collection will be used for input lookups as well; if not, it must have
the same value as ``run``.
Name of the run datasets should be output to. If the run
does not exist, it will be created. If ``collections`` is `None`, it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say explicitly here that if run is not set it is a readonly butler?

refs = list(refs)
for ref in refs:
# isComponent isn't actually reliable, because the dataset type
# name isn't the whole story, and that's mildly concerning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset type name is A.b how do you get that if it isn't a component?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the converse, in the deferred-virtual-composite case, e.g. we write jointcal_wcs as a standalone dataset and later use it as the wcs component of some Exposure dataset.

# not. This is consistent with the fact that we want
# to ignore already-removed-from-datastore datasets
# anyway.
if self.datastore.exists(ref):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still worry that this hides problems with composites since non-existence might be a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; I was just trying to handle the common case here, and defer a real solution to DM-23671.

The names of a `~CollectionType.TAGGED` collections to associate
the dataset with, overriding ``self.tags``. These collections
must have already been added to the `Registry`.
collection : `str` or `False`, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no collection argument.

collectionFieldName = self._collections.getCollectionForeignKeyName()
collectionRecord = self._collections.find(collection)
if collectionRecord.type is CollectionType.RUN:
raise TypeError(f"Collection '{collection}' has type {collectionRecord.type.name}, not TAGGED.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message seems odd since you are checking it IS RUN, not it's not TAGGED. This is going to get really confusing if extra collection types turn up. Can you say is not TAGGED instead?

from ..wildcards import DatasetTypeRestriction


DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think os.path.dirname() is preferred over os.path.split()[0].



DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0],
"../../../../../../tests/data/registry"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little easier to let the tests specify the data dir themselves and pass it in to this class?

if expression is ...:
if not allowAny:
raise TypeError("This expression may not be unconstrained.")
return ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string says this should return None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in an earlier (squashed away) version of this I was converting ... to None when going from public to internal APIs, and then I realized it was a bad idea to have two different sentinel types to represent the same concept. Docs fixed.



class DatasetTypeRestriction:
"""An immutable set-like object that represents a restriction on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherit from collections.abc.Set ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't; it can't always define __len__ (maybe other limitations, but at least that). I'll add a comment to that effect.

@@ -102,7 +97,7 @@ def validateButlerConfiguration(root, datasetTypes=None, ignore=None, quiet=Fals

# The collection does not matter for validation but if a run is specified
# in the configuration then it must be consistent with this collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is now completely irrelevant

@timj
Copy link
Member

timj commented Mar 12, 2020

@andy-slac can you please take a look at the commits you were assigned so we can move towards merging?

@andy-slac
Copy link
Contributor

Sorry, I missed that, will look at it ASAP.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked three commits that were specifically assigned to me, looks OK, small bunch of comments.

"""Datasets can be associated with and removed from ``TAGGED`` collections
arbitrarily.

Within a particular run, there may only be one dataset with a particular
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run? Should it be a "tagged collection"?

----------
name : `str`
Name of the collection.
id : `int`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not exist any more?

One of "CASCADE" or "SET NULL", indicating what should happen to
the referencing row if the collection row is deleted. `None`
indicates that this should be an integrity error.
kwds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**kwds?

Returns
-------
fieldSpec : `ddl.FieldSpec`
Specification for the field being added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should both add FieldSpec to a TableSpec and return it?

One of "CASCADE" or "SET NULL", indicating what should happen to
the referencing row if the collection row is deleted. `None`
indicates that this should be an integrity error.
kwds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**kwds

Notes
-----
Collections registered by another client of the same layer since the
last call to `initialize` or `refresh` may not be found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method throw an exception for missing collection? My impression that get() is a "stronger" version of a find() method, usually if someone has a foreign key it means that object exists/existed. OTOH find() is a general search by name, it may be more reasonable to re return None from that method if nothing is found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. In keeping with Python conventions, I'll also rename it from get to __getitem__.

row = {
"name": self.name,
TIMESPAN_FIELD_SPECS.begin.name: timespan.begin if timespan is not None else None,
TIMESPAN_FIELD_SPECS.end.name: timespan.end if timespan is not None else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timespan is not None is always True, there is a protection for that above.

count = self._db.update(self._table, row, keys=["id"],
values=[TIMESPAN_FIELD_SPECS.begin.name,
TIMESPAN_FIELD_SPECS.end.name,
"host"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at Database.update() method and I'm not sure these parameters match the declaration?

record = NameKeyRunRecord(**kwds)
else:
kwds["type"] = type
record = NameKeyCollectionRecord(**kwds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unconventionally looking, why not just record = NameKeyCollectionRecord(type=type)? Same for NameKeyRunRecord above.

wildcard.makeWhereExpression(self._tables.collection.columns.name)
)
for row in self._db.query(sql).fetchall():
yield self._records[row["name"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If collection was just created in database and is not in self._records this will raise LookupError, I'm not sure this is an expected behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A later commit makes this a moot point.

These were placeholders for future implementations that will not exist
anytime in the near future, and that makes them clutter right now.
I think I swung too far in the direction of 'from X import Y' before;
a lot of these very short symbols really need namespace context.
This replaces Butler.delete with Butler.prune, which is vectorized and
doesn't "unstore" from the Datastore by default.  That's both because
we don't want to make it too easy to delete things in collections the
current butler doesn't know about, and because PosixDatastore.delete
isn't exception safe, and that makes it way too easy to get a corrupted
repository.
Allowing 'collection' to mean 'run' isn't technically wrong, but it's
about to be more confusing than helpful, because in the future
butler.run == butler.collection will not be guaranteed.
...and fix a pretty obvious bug in it.

Looks like a bad conflict resolution, but that wouldn't explain the bug.

In any case, I'm quite surprised flake8 wasn't mad about the
unqualified return statement in the middle of a function.
This could (and probably should) been done on DM-21246, since this
collection was only ever used to instantiate the Butler, but we need
to remove it now because instantiating a read-only Butler with a
nonexistent collection will soon be an error.
Eventually we should extend this data and use it in the query tests,
too.  But for now that's a distraction from more pressing work.
These assume a different model of how runs and collections are related
(namely, that runs are a type of collection) than what Registry has
now, and hence making Registry use them will involve an interface
change (in a later commit).
These just use the name as the primary key, and assume the number of
collections/runs is small enough that the overhead of loading them all
into memory up front can be ignored.  We can of course try other
approaches in the future with other implementation classes.
This makes the order of arguments consistent with virtually every other
Registry and Butler API we have, and the name more consistent with
other Registry method names.
This includes a number of changes for consistency and clarity,
including:

 - replacing Registry's getAll* methods with more flexible query*
   methods;

 - making the types accepted in expressions for collections and
   dataset types more consistent;

 - switching from SQL LIKE patterns to regular expressions (it looks
   like we mostly want to apply these in Python rather than in
   the database);

 - adding user guide docs for those expressions, integrating
   it with the existing page on data ID expression strings.
This is a lead-up to supporting virtually-chained collections.
Numpydoc doesn't actually support Attributes as a heading, and those
are already documented via class attributes.
I'm not sure why the apparently-duplicate py:currentmodule directives
are needed (note that there's one at the top of index.rst, and that
was enough for configuring.rst), but they make the links work, and
without them, the links don't.
@TallJimbo TallJimbo merged commit 25e4ed4 into master Mar 25, 2020
@TallJimbo TallJimbo deleted the tickets/DM-21849 branch March 25, 2020 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants