New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-12624: Implement Butler DataUnits #34
Conversation
292b90d
to
fcc8b25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a quick look (I'm traveling) and it looks reasonably okay. I think you can now remove daf_persistence
from the table file.
@@ -21,4 +21,3 @@ | |||
from .storageClass import * | |||
from .storageInfo import * | |||
from .storedFileInfo import * | |||
from .units import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dataUnit
has to be added to this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me that we want to pull up all these submodules into the core
scope. But as long as we do that we should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't they won't be documented (as currently set up)
`DataUnit` instances represent concrete units such as e.g. `Camera`, | ||
`Sensor`, `Visit` and `SkyMap`. | ||
|
||
Attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Init parameters need to be documented separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think the attributes themselves need to be documented as the docstrings of the properties themselves.
ValueError | ||
If a value for a required dependency is missing. | ||
""" | ||
for columnName in self.primaryKey: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe do this using sets so that the error message shows all keys that are bad rather than the first.
config : `SchemaConfig` | ||
`Registry` schema configuration describing `DataUnit` relations. | ||
builder : `SchemaBuilder`, optional | ||
When given, create `Table` entries for every `DataUnit` table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no Table
class in daf_butler
.
primaryKeyNames = [] | ||
for dataUnitName in dataUnitNames: | ||
primaryKeyNames.extend(self[dataUnitName].primaryKey) | ||
return primaryKeyNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return [self[name].primarykey for name in dataUnitNames]
? Why a list and not a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. I'll change it.
@@ -20,43 +20,18 @@ | |||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | |||
|
|||
from .utils import iterable | |||
from .config import Config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this move down? If you are grouping by local versus external imports shouldn't the .utils
import also move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know. I assume it can move back up.
@@ -79,7 +79,7 @@ def testBasicPutGet(self): | |||
butler = Butler(self.configFile) | |||
# Create and register a DatasetType | |||
datasetTypeName = "test_metric" | |||
dataUnits = ("camera", "visit") | |||
dataUnits = ("Camera", "Visit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should they be case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.
That's probably out of scope for this ticket, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was tricky for me to decide. Right now DataUnit
names are CamelCase
and column names / dataId dict keys are snake_case
. But it is probably nicer to allow both.
@@ -189,7 +189,7 @@ def testStorageInfo(self): | |||
def testAssembler(self): | |||
registry = Registry.fromConfig(self.configFile) | |||
storageClass = makeNewStorageClass("testAssembler")() | |||
datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass) | |||
datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Camera
here and camera
a few lines later so maybe they are case insensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok then I missunderstand what you say above. The only place CamelCase
is used is when referring to either DataUnit
types or their tables. All other cases are snake_case
because they map directly to column names in said tables. So for now I'll leave it as is and will make a separate ticket to decide / fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; nothing looks awry from a structural perspective, though I do have a few design comments and of course a couple of style comments.
The union of `requiredDependencies` and `optionalDependencies`. | ||
table : `sqlalchemy.core.Table`, optional | ||
When not ``None`` the primary table entry corresponding to this | ||
`DataUnit`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this preclude the table actually being implemented as a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the comment might, but the structure doesn't as far as I think of it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the comment if that is better. I assume one can duck-type whatever entity represents a view in sqlalchemy
with the few things required (columns
) in our API.
`DataUnit` instances represent concrete units such as e.g. `Camera`, | ||
`Sensor`, `Visit` and `SkyMap`. | ||
|
||
Attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think the attributes themselves need to be documented as the docstrings of the properties themselves.
if hasattr(self, '_table'): | ||
return self._table | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getattr(self, '_table', None)
|
||
|
||
class DataUnitRegistry: | ||
"""Instances of this class keep track of `DataUnit` relations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there actually multiple instances of this class (or, if so, a need for multiple instances)?
And as much as it's the obvious name, given that Registry
means something very important and specific in daf_butler, we should probably find a better name for this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't like the name either. But I couldn't think of a better one. Also note that we have a number of similar (but not identical enough) things in the design that basically act (or could act) like a map
of some kind. There may be multiple instances, although I don't know a need for it at this moment. Perhaps when there are multiple Butler instances with different units? I don't know.
self._dataUnits[dataUnitName] = dataUnit | ||
|
||
def __iter__(self): | ||
yield from self._dataUnitNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different from / better than iter(self._dataUnitNames)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I suspect in this case it does the same (since it is not part of a nested for loop). I'll change it to iter
.
@@ -19,7 +19,7 @@ | |||
# You should have received a copy of the GNU General Public License | |||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | |||
|
|||
from abc import ABCMeta, abstractmethod | |||
from abc import ABCMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A code comment explaining that we'll put the extra methods back into Registry
when we're done prototyping would be welcome.
if isinstance(config, str): | ||
config = SchemaConfig(config) | ||
self.config = config | ||
self.builder = SchemaBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little strange that the builder
remains attached to the Schema
even after the Schema
has been built. That might be a perfectly reasonable relationship, but I think if it is the Builder
class might be better off with a different name, or we should at least have some docstrings that explain what their relationship is in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did that such that we could add more DataUnit
tables later. But indeed, it might make more sense to separate it. Building is basically a one-off thing.
dataId = {dataUnitName: result[self._schema.dataUnits[dataUnitName]] | ||
for dataUnitName in datasetType.dataUnits} | ||
dataId = {dataUnitName: result[self._schema.dataUnits.links[dataUnitName]] | ||
for dataUnitName in self._schema.dataUnits.getPrimaryKeyNames(datasetType.dataUnits)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be deferred to another ticket if we do it at all, but this seems like the wrong time and place to merge the primary key names for all of the DataUnits associated with a DatasetType
; could we do that instead when the DatasetType
instance is constructed?
That was, of course, part of the motivation behind having a custom DataUnitTypeSet
class for DatasetType.dataUnits
, though we don't have to do it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like your suggestion to refactor it. This was a bit quick and dirty. No need to build this over and over again. I'll file a ticket (possibly resurrecting DataUnitTypeSet
).
@@ -79,7 +79,7 @@ def testBasicPutGet(self): | |||
butler = Butler(self.configFile) | |||
# Create and register a DatasetType | |||
datasetTypeName = "test_metric" | |||
dataUnits = ("camera", "visit") | |||
dataUnits = ("Camera", "Visit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.
That's probably out of scope for this ticket, though.
@@ -189,7 +189,7 @@ def testStorageInfo(self): | |||
def testAssembler(self): | |||
registry = Registry.fromConfig(self.configFile) | |||
storageClass = makeNewStorageClass("testAssembler")() | |||
datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass) | |||
datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.
These classes are used to keep all information about DataUnits and their relations. Also implements `Registry._validateDataId`.
0f0ba99
to
b32ce16
Compare
No description provided.