DM-12624: Implement Butler DataUnits #34

pschella · 2018-04-19T19:45:30Z

No description provided.

timj

I've had a quick look (I'm traveling) and it looks reasonably okay. I think you can now remove daf_persistence from the table file.

timj · 2018-04-20T11:06:34Z

python/lsst/daf/butler/core/__init__.py

@@ -21,4 +21,3 @@
 from .storageClass import *
 from .storageInfo import *
 from .storedFileInfo import *
-from .units import *


I think dataUnit has to be added to this file

It is not clear to me that we want to pull up all these submodules into the core scope. But as long as we do that we should be consistent.

If you don't they won't be documented (as currently set up)

timj · 2018-04-20T11:08:16Z

python/lsst/daf/butler/core/dataUnit.py

+    `DataUnit` instances represent concrete units such as e.g. `Camera`,
+    `Sensor`, `Visit` and `SkyMap`.
+
+    Attributes


Init parameters need to be documented separately.

And I think the attributes themselves need to be documented as the docstrings of the properties themselves.

timj · 2018-04-20T11:11:16Z

python/lsst/daf/butler/core/dataUnit.py

+        ValueError
+            If a value for a required dependency is missing.
+        """
+        for columnName in self.primaryKey:


maybe do this using sets so that the error message shows all keys that are bad rather than the first.

timj · 2018-04-20T11:15:00Z

python/lsst/daf/butler/core/dataUnit.py

+        config : `SchemaConfig`
+            `Registry` schema configuration describing `DataUnit` relations.
+        builder : `SchemaBuilder`, optional
+            When given, create `Table` entries for every `DataUnit` table.


There is no Table class in daf_butler.

timj · 2018-04-20T12:22:40Z

python/lsst/daf/butler/core/dataUnit.py

+        primaryKeyNames = []
+        for dataUnitName in dataUnitNames:
+            primaryKeyNames.extend(self[dataUnitName].primaryKey)
+        return primaryKeyNames


return [self[name].primarykey for name in dataUnitNames]

? Why a list and not a tuple?

No reason. I'll change it.

timj · 2018-04-20T12:25:16Z

python/lsst/daf/butler/core/schema.py

@@ -20,43 +20,18 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.

 from .utils import iterable
-from .config import Config


Why did this move down? If you are grouping by local versus external imports shouldn't the .utils import also move?

Don't know. I assume it can move back up.

timj · 2018-04-20T12:36:03Z

tests/test_butler.py

@@ -79,7 +79,7 @@ def testBasicPutGet(self):
        butler = Butler(self.configFile)
        # Create and register a DatasetType
        datasetTypeName = "test_metric"
-        dataUnits = ("camera", "visit")
+        dataUnits = ("Camera", "Visit")


should they be case sensitive?

I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.

That's probably out of scope for this ticket, though.

Yes, this was tricky for me to decide. Right now DataUnit names are CamelCase and column names / dataId dict keys are snake_case. But it is probably nicer to allow both.

timj · 2018-04-20T12:37:28Z

tests/test_sqlRegistry.py

@@ -189,7 +189,7 @@ def testStorageInfo(self):
    def testAssembler(self):
        registry = Registry.fromConfig(self.configFile)
        storageClass = makeNewStorageClass("testAssembler")()
-        datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass)
+        datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass)


Camera here and camera a few lines later so maybe they are case insensitive?

Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.

Hmm, ok then I missunderstand what you say above. The only place CamelCase is used is when referring to either DataUnit types or their tables. All other cases are snake_case because they map directly to column names in said tables. So for now I'll leave it as is and will make a separate ticket to decide / fix this.

TallJimbo

Looks good; nothing looks awry from a structural perspective, though I do have a few design comments and of course a couple of style comments.

TallJimbo · 2018-04-20T13:51:02Z

python/lsst/daf/butler/core/dataUnit.py

+        The union of `requiredDependencies` and `optionalDependencies`.
+    table : `sqlalchemy.core.Table`, optional
+        When not ``None`` the primary table entry corresponding to this
+        `DataUnit`.


Would this preclude the table actually being implemented as a view?

Well, the comment might, but the structure doesn't as far as I think of it now.

I can change the comment if that is better. I assume one can duck-type whatever entity represents a view in sqlalchemy with the few things required (columns) in our API.

TallJimbo · 2018-04-20T13:55:14Z

python/lsst/daf/butler/core/dataUnit.py

+    `DataUnit` instances represent concrete units such as e.g. `Camera`,
+    `Sensor`, `Visit` and `SkyMap`.
+
+    Attributes


And I think the attributes themselves need to be documented as the docstrings of the properties themselves.

TallJimbo · 2018-04-20T13:56:00Z

python/lsst/daf/butler/core/dataUnit.py

+        if hasattr(self, '_table'):
+            return self._table
+        else:
+            return None


getattr(self, '_table', None)

TallJimbo · 2018-04-20T14:25:28Z

python/lsst/daf/butler/core/dataUnit.py

+
+
+class DataUnitRegistry:
+    """Instances of this class keep track of `DataUnit` relations.


Are there actually multiple instances of this class (or, if so, a need for multiple instances)?

And as much as it's the obvious name, given that Registry means something very important and specific in daf_butler, we should probably find a better name for this class.

Yes, I don't like the name either. But I couldn't think of a better one. Also note that we have a number of similar (but not identical enough) things in the design that basically act (or could act) like a map of some kind. There may be multiple instances, although I don't know a need for it at this moment. Perhaps when there are multiple Butler instances with different units? I don't know.

TallJimbo · 2018-04-20T14:26:49Z

python/lsst/daf/butler/core/dataUnit.py

+        self._dataUnits[dataUnitName] = dataUnit
+
+    def __iter__(self):
+        yield from self._dataUnitNames


Is this different from / better than iter(self._dataUnitNames)?

Good question. I suspect in this case it does the same (since it is not part of a nested for loop). I'll change it to iter.

TallJimbo · 2018-04-20T15:04:13Z

python/lsst/daf/butler/core/registry.py

@@ -19,7 +19,7 @@
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.

-from abc import ABCMeta, abstractmethod
+from abc import ABCMeta


A code comment explaining that we'll put the extra methods back into Registry when we're done prototyping would be welcome.

TallJimbo · 2018-04-20T15:12:27Z

python/lsst/daf/butler/core/schema.py

+        if isinstance(config, str):
+            config = SchemaConfig(config)
+        self.config = config
+        self.builder = SchemaBuilder()


I find it a little strange that the builder remains attached to the Schema even after the Schema has been built. That might be a perfectly reasonable relationship, but I think if it is the Builder class might be better off with a different name, or we should at least have some docstrings that explain what their relationship is in detail.

I only did that such that we could add more DataUnit tables later. But indeed, it might make more sense to separate it. Building is basically a one-off thing.

TallJimbo · 2018-04-20T15:56:25Z

python/lsst/daf/butler/registries/sqlRegistry.py

-            dataId = {dataUnitName: result[self._schema.dataUnits[dataUnitName]]
-                      for dataUnitName in datasetType.dataUnits}
+            dataId = {dataUnitName: result[self._schema.dataUnits.links[dataUnitName]]
+                      for dataUnitName in self._schema.dataUnits.getPrimaryKeyNames(datasetType.dataUnits)}


This should probably be deferred to another ticket if we do it at all, but this seems like the wrong time and place to merge the primary key names for all of the DataUnits associated with a DatasetType; could we do that instead when the DatasetType instance is constructed?

That was, of course, part of the motivation behind having a custom DataUnitTypeSet class for DatasetType.dataUnits, though we don't have to do it that way.

Yes, I like your suggestion to refactor it. This was a bit quick and dirty. No need to build this over and over again. I'll file a ticket (possibly resurrecting DataUnitTypeSet).

TallJimbo · 2018-04-20T16:02:54Z

tests/test_butler.py

@@ -79,7 +79,7 @@ def testBasicPutGet(self):
        butler = Butler(self.configFile)
        # Create and register a DatasetType
        datasetTypeName = "test_metric"
-        dataUnits = ("camera", "visit")
+        dataUnits = ("Camera", "Visit")


I've been wondering about this, and I think the best approach is probably to be case (and underscore) insensitive, at least for data ID dictionary keys - the same words appear as both DataUnit type names (upper/camel case) and database field names (lower/snake case), and I think it'll be frustrating if we don't accept both in the dictionaries where the keys effectively refer to both.

That's probably out of scope for this ticket, though.

TallJimbo · 2018-04-20T16:06:33Z

tests/test_sqlRegistry.py

@@ -189,7 +189,7 @@ def testStorageInfo(self):
    def testAssembler(self):
        registry = Registry.fromConfig(self.configFile)
        storageClass = makeNewStorageClass("testAssembler")()
-        datasetType = DatasetType(name="test", dataUnits=("camera",), storageClass=storageClass)
+        datasetType = DatasetType(name="test", dataUnits=("Camera",), storageClass=storageClass)


Note that these are slightly different contexts; this one probably should be case-sensitive and require upper/camel case, since it's definitely a DataUnit type, not a field in a DataUnit table. The one a few lines later is where I think we ought to be case insensitive (as discussed in the comment at https://github.com/lsst/daf_butler/pull/34/files#diff-f06c7b82277ef9b866c66ef83c24c846R82), but probably do so by translating user inputs to lower/snake case.

These classes are used to keep all information about DataUnits and their relations. Also implements `Registry._validateDataId`.

Pim Schellart added 2 commits April 19, 2018 16:06

Move building of Schema to new SchemaBuilder.

0cdf911

Add addColumn and addForeignKeyConstraint to SchemaBuilder.

40e17e8

pschella force-pushed the tickets/DM-12624 branch from 292b90d to fcc8b25 Compare April 19, 2018 23:22

timj approved these changes Apr 20, 2018

View reviewed changes

TallJimbo approved these changes Apr 20, 2018

View reviewed changes

Pim Schellart added 2 commits April 20, 2018 17:22

Remove obsolete DataUnit support.

080eed4

Add DataUnit and DataUnitRegistry.

b32ce16

These classes are used to keep all information about DataUnits and their relations. Also implements `Registry._validateDataId`.

pschella force-pushed the tickets/DM-12624 branch from 0f0ba99 to b32ce16 Compare April 20, 2018 21:24

pschella merged commit 555a971 into master Apr 20, 2018

ktlim deleted the tickets/DM-12624 branch August 25, 2018 06:49



		class DataUnitRegistry:
		"""Instances of this class keep track of `DataUnit` relations.

DM-12624: Implement Butler DataUnits #34

DM-12624: Implement Butler DataUnits #34

Conversation

pschella commented Apr 19, 2018

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment