Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-12624: Implement Butler DataUnits #34

Merged
merged 4 commits into from Apr 20, 2018
Merged

DM-12624: Implement Butler DataUnits #34

merged 4 commits into from Apr 20, 2018

Conversation

pschella
Copy link
Collaborator

No description provided.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a quick look (I'm traveling) and it looks reasonably okay. I think you can now remove daf_persistence from the table file.

@@ -21,4 +21,3 @@
from .storageClass import *
from .storageInfo import *
from .storedFileInfo import *
from .units import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dataUnit has to be added to this file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me that we want to pull up all these submodules into the core scope. But as long as we do that we should be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't they won't be documented (as currently set up)

`DataUnit` instances represent concrete units such as e.g. `Camera`,
`Sensor`, `Visit` and `SkyMap`.

Attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init parameters need to be documented separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think the attributes themselves need to be documented as the docstrings of the properties themselves.

ValueError
If a value for a required dependency is missing.
"""
for columnName in self.primaryKey:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe do this using sets so that the error message shows all keys that are bad rather than the first.

config : `SchemaConfig`
`Registry` schema configuration describing `DataUnit` relations.
builder : `SchemaBuilder`, optional
When given, create `Table` entries for every `DataUnit` table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Table class in daf_butler.

primaryKeyNames = []
for dataUnitName in dataUnitNames:
primaryKeyNames.extend(self[dataUnitName].primaryKey)
return primaryKeyNames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return [self[name].primarykey for name in dataUnitNames]

? Why a list and not a tuple?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. I'll change it.

@@ -20,43 +20,18 @@
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from .utils import iterable
from .config import Config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this move down? If you are grouping by local versus external imports shouldn't the .utils import also move?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know. I assume it can move back up.

@@ -79,7 +79,7 @@ def testBasicPutGet(self):
butler = Butler(self.configFile)
# Create and register a DatasetType
datasetTypeName = "test_metric"
dataUnits = ("camera", "visit")
dataUnits = ("Camera", "Visit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should they be case sensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.

That's probably out of scope for this ticket, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was tricky for me to decide. Right now DataUnit names are CamelCase and column names / dataId dict keys are snake_case. But it is probably nicer to allow both.

@@ -189,7 +189,7 @@ def testStorageInfo(self):
def testAssembler(self):
registry = Registry.fromConfig(self.configFile)
storageClass = makeNewStorageClass("testAssembler")()
datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass)
datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Camera here and camera a few lines later so maybe they are case insensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok then I missunderstand what you say above. The only place CamelCase is used is when referring to either DataUnit types or their tables. All other cases are snake_case because they map directly to column names in said tables. So for now I'll leave it as is and will make a separate ticket to decide / fix this.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; nothing looks awry from a structural perspective, though I do have a few design comments and of course a couple of style comments.

The union of `requiredDependencies` and `optionalDependencies`.
table : `sqlalchemy.core.Table`, optional
When not ``None`` the primary table entry corresponding to this
`DataUnit`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this preclude the table actually being implemented as a view?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the comment might, but the structure doesn't as far as I think of it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the comment if that is better. I assume one can duck-type whatever entity represents a view in sqlalchemy with the few things required (columns) in our API.

`DataUnit` instances represent concrete units such as e.g. `Camera`,
`Sensor`, `Visit` and `SkyMap`.

Attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think the attributes themselves need to be documented as the docstrings of the properties themselves.

if hasattr(self, '_table'):
return self._table
else:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr(self, '_table', None)



class DataUnitRegistry:
"""Instances of this class keep track of `DataUnit` relations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there actually multiple instances of this class (or, if so, a need for multiple instances)?

And as much as it's the obvious name, given that Registry means something very important and specific in daf_butler, we should probably find a better name for this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't like the name either. But I couldn't think of a better one. Also note that we have a number of similar (but not identical enough) things in the design that basically act (or could act) like a map of some kind. There may be multiple instances, although I don't know a need for it at this moment. Perhaps when there are multiple Butler instances with different units? I don't know.

self._dataUnits[dataUnitName] = dataUnit

def __iter__(self):
yield from self._dataUnitNames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different from / better than iter(self._dataUnitNames)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I suspect in this case it does the same (since it is not part of a nested for loop). I'll change it to iter.

@@ -19,7 +19,7 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from abc import ABCMeta, abstractmethod
from abc import ABCMeta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A code comment explaining that we'll put the extra methods back into Registry when we're done prototyping would be welcome.

if isinstance(config, str):
config = SchemaConfig(config)
self.config = config
self.builder = SchemaBuilder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little strange that the builder remains attached to the Schema even after the Schema has been built. That might be a perfectly reasonable relationship, but I think if it is the Builder class might be better off with a different name, or we should at least have some docstrings that explain what their relationship is in detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did that such that we could add more DataUnit tables later. But indeed, it might make more sense to separate it. Building is basically a one-off thing.

dataId = {dataUnitName: result[self._schema.dataUnits[dataUnitName]]
for dataUnitName in datasetType.dataUnits}
dataId = {dataUnitName: result[self._schema.dataUnits.links[dataUnitName]]
for dataUnitName in self._schema.dataUnits.getPrimaryKeyNames(datasetType.dataUnits)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be deferred to another ticket if we do it at all, but this seems like the wrong time and place to merge the primary key names for all of the DataUnits associated with a DatasetType; could we do that instead when the DatasetType instance is constructed?

That was, of course, part of the motivation behind having a custom DataUnitTypeSet class for DatasetType.dataUnits, though we don't have to do it that way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like your suggestion to refactor it. This was a bit quick and dirty. No need to build this over and over again. I'll file a ticket (possibly resurrecting DataUnitTypeSet).

@@ -79,7 +79,7 @@ def testBasicPutGet(self):
butler = Butler(self.configFile)
# Create and register a DatasetType
datasetTypeName = "test_metric"
dataUnits = ("camera", "visit")
dataUnits = ("Camera", "Visit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.

That's probably out of scope for this ticket, though.

@@ -189,7 +189,7 @@ def testStorageInfo(self):
def testAssembler(self):
registry = Registry.fromConfig(self.configFile)
storageClass = makeNewStorageClass("testAssembler")()
datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass)
datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.

Pim Schellart added 2 commits April 20, 2018 17:22
These classes are used to keep all information about DataUnits and their
relations.  Also implements `Registry._validateDataId`.
@pschella pschella merged commit 555a971 into master Apr 20, 2018
@ktlim ktlim deleted the tickets/DM-12624 branch August 25, 2018 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants