DM-12635: Subpackage for converting Gen2->Gen3. #30

TallJimbo · 2018-04-07T00:47:09Z

The conversion code here has two high-level drivers:

The ConversionWalker class (walker.py) extracts everything we'll need for the conversion from a Gen2 repo. It's basically complete, and I've done some one-off testing it on ci_hsc (check out ci_hsc, setup and build, and start by calling ConversionWalker.tryRoot on $CI_HSC_DIR/DATA).
The ConversionWriter class (writer.py) adds entries to a Registry and Datastore - or it would if all of the actual calls to do so weren't just "TODO" comment placeholders

As for those TODO placeholders, there are a ton of them, and I think they cover most of what needs to be done to get things working. Beyond that, we'll also eventually need a high-level driver that can be run easily from the command-line, but that doesn't necessarily need to be done right now.

timj · 2018-04-07T00:48:49Z

Important note: datastore doesn't yet write the formatter information to somewhere permanent. It uses a shim at the moment that passes tests but forgets immediately the process ends. We need a more permanent fix in there.

TallJimbo · 2018-04-07T00:53:47Z

Good to know. I think we'd get a lot of the benefits from just getting this to write to Registry, as that'd let us easily put together realistically complex test datasets for it.

pschella · 2018-04-09T00:05:09Z

python/lsst/daf/butler/registries/sqlRegistry.py

@@ -508,6 +510,7 @@ def addRun(self, run):
                                                        collection=run.collection,
                                                        environment_id=None,  # TODO add environment
                                                        pipeline_id=None))    # TODO add pipeline
+        # TODO: set given Run's 'id' attribute.


No, this is already done in the Execution part.

TallJimbo · 2018-04-26T01:19:53Z

This is now as far along as I think make sense on this ticket. Next steps will be making Datastore records persistent and adding concrete StorageClasses and Formatters.

pschella

I'm assuming we can't really unittest any of this?

pschella · 2018-04-16T20:55:39Z

python/lsst/daf/butler/core/registry.py

@@ -135,6 +135,7 @@ def addDataset(self, ref, uri, components, run, producer=None):
            If a Dataset with the given `DatasetRef` already exists in the
            given Collection.
        """
+        # TODO: this signature no longer agrees with SqlRegistry.addDataset.


One of many I'm afraid. I'm planning to copy them all over after things have settled a bit.

Yes, this TODO has already been removed.

pschella · 2018-04-16T20:57:05Z

python/lsst/daf/butler/gen2convert/extractor.py

+
+from .structures import Gen2Dataset
+
+TEMPLATE_RE = re.compile(r'\%\((?P<name>\w+)\).*?(?P<type>[idrs])')


Would be nice to add an example comment for what this matches.

pschella · 2018-04-17T14:10:54Z

python/lsst/daf/butler/gen2convert/writer.py

+        self.insertSkyMaps(registry)
+        self.insertObservations(registry)
+        self.insertDatasets(registry, datastore)
+        # TODO: associate parent repo datasets with child repo Collections


Is this Registry.merge? Or just Registry.associate?

pschella · 2018-04-26T18:33:02Z

config/gen2convert.yaml

+    # entries; options here say how to get those from a Gen2 repo.
+    VisitInfo:
+      # The Gen2 DatasetType to read when trying to create a VisitInfo.
+      # (we actually add a "_md" suffix, because we just read the metadata).


Something we don't (yet) have in Gen3. Unless you mean findDataUnitEntry type things here.

I was thinking we'd handle the metadata as another Exposure component dataset.

pschella · 2018-04-26T19:04:33Z

config/gen2convert.yaml

+  # its Run; this will be the one corresponding to the Gen2 repository
+  # that originally contained the file, unless that has been overridden.
+  raw/HSC: 1
+  ref/ps1_pv3_3pi_20170110: 2


We may want to allow for a None default that will generate a new Run.

That default is already in place; if nothing here matches, we'll make a new Run, and use the repo root at its Collection name.

pschella · 2018-04-27T01:28:05Z

python/lsst/daf/butler/gen2convert/writer.py

+        self.insertObservations(registry)
+        self.insertDatasetTypes(registry)
+        self.insertDatasets(registry, datastore)
+        # TODO: associate parent repo datasets with child repo Collections


Is this still open? Seems reasonably essential.

Yes, I'm afraid it is, and I remember it being hard but don't remember exactly why. We can work around it with a custom call to query in the ci_hsc drivers for the conversion for now.

pschella · 2018-04-27T01:29:31Z

python/lsst/daf/butler/gen2convert/writer.py

+            skyMapName = self.skyMapNames.get(sha1, None)
+            try:
+                existing, = registry.query("SELECT skymap FROM SkyMap WHERE sha1=:sha1",
+                                           sha1=sha1)


In general we probably don't want to rely on query too much, when other API calls should exist.

I completely agree. I've been meaning to talk to you about how to add alternate unique lookup APIs for some DataUnits (SkyMap sha1, Patch cell_x and cell_y, Sensor name, probably some Camera-specific Visit or Exposure identifiers).

pschella · 2018-04-27T01:31:25Z

python/lsst/daf/butler/gen2convert/writer.py

+                     "and does not already exist in the Registry.").format(sha1.hex())
+                )
+            log.info("Inserting SkyMap '%s' with sha1=%s", skyMapName, sha1.hex())
+            skyMap.register(skyMapName, registry)


Do we want the same skymap to have different names? Otherwise I don't really get why the call can't be skyMap.register(registry).

Or directly addDataUnitEntry(...).

pschella · 2018-04-27T01:36:38Z

python/lsst/daf/butler/gen2convert/writer.py

+                    "boresight_ra": visitInfo.getBoresightRaDec().getLongitude().asDegrees(),
+                    "boresight_dec": visitInfo.getBoresightRaDec().getLatitude().asDegrees(),
+                    "boresight_parallactic_angle": visitInfo.getBoresightParAngle().asDegrees(),
+                    "local_era": visitInfo.getLocalEra().asDegrees(),


Do we loose information with respect to visitInfo here?

Possibly, but I don't want to think too hard about this until we've gone through the schema review.

pschella · 2018-04-27T01:37:24Z

python/lsst/daf/butler/gen2convert/writer.py

+        log = Log.getLogger("lsst.daf.butler.gen2convert")
+        for datasetType in self.datasetTypes.values():
+            # TODO: should put this "just make sure it exists" logic
+            # into registerDatasetType itself, and make it a bit more careful.


Definitely!

TallJimbo

I think I've addressed all review comments. I'll squash and merge by Monday.

As for tests, I don't think it's a good use of time to try to cook up the kind of test data we'd need for small scale unit testing of this code in daf_butler. Instead, I'm planning to add running the conversion to ci_hsc's SConstruct, and add checks in that package that we can access the datasets using the Gen3 butler and that this is the same as Gen2 access. We need at least DM-14225 before that's worthwhile, though, and may want to add a few more concrete StorageClasses/Formatters first as well.

afw.image.VisitInfo provides a good initial guess at what metadata we'll want for Visits and Exposures.

We don't really have an entity responsible for creating AbstractFilters, since they're not a associated with a Camera or a SkyMap. It's probably best to just let them be "created" when first referenced, especially since we don't actually need to store any metadata associated with them.

Having addDataset call associate leads to "locked database" errors, apparently because the transaction context managers aren't nesting the way SQLAlchemy's documentation claims they should. This should be investigated further, but this workaround lets us move forward for now.

pschella reviewed Apr 9, 2018

View reviewed changes

TallJimbo force-pushed the tickets/DM-12635 branch 3 times, most recently from 81e6489 to 9442e67 Compare April 20, 2018 22:59

TallJimbo added 3 commits April 20, 2018 23:07

Add PhysicalFilter to Exposure.

19ba7db

Fix Exposure/Visit timestamp relationships.

62e253c

Minor fixes for schema yaml.

ba9c8dc

TallJimbo force-pushed the tickets/DM-12635 branch 7 times, most recently from 72722ff to b6d9c68 Compare April 26, 2018 01:11

TallJimbo changed the title ~~DM-12635: WIP subpackage for converting Gen2->Gen3.~~ DM-12635: Subpackage for converting Gen2->Gen3. Apr 26, 2018

TallJimbo force-pushed the tickets/DM-12635 branch from 812c5bd to b6d9c68 Compare April 27, 2018 01:16

pschella reviewed Apr 27, 2018

View reviewed changes

TallJimbo commented Apr 27, 2018

View reviewed changes

TallJimbo added 9 commits April 30, 2018 10:07

Add schema entries from VisitInfo.

ee499ed

afw.image.VisitInfo provides a good initial guess at what metadata we'll want for Visits and Exposures.

Add seeing to Visit table.

6bd6ea2

Add butler_test_repository to .gitignore

9245406

Add interface for generic SQL queries to SqlRegistry.

0998d7b

Add __repr__ for Run.

a3c45d3

Add API ingesting existing files into PosixDatastore.

9a9c645

Add subpackage for converting Gen2->Gen3.

26e7d6f

TallJimbo force-pushed the tickets/DM-12635 branch from d7bfb5d to 26e7d6f Compare April 30, 2018 14:07

TallJimbo merged commit 13d5374 into master Apr 30, 2018

ktlim deleted the tickets/DM-12635 branch August 25, 2018 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-12635: Subpackage for converting Gen2->Gen3. #30

DM-12635: Subpackage for converting Gen2->Gen3. #30

TallJimbo commented Apr 7, 2018

timj commented Apr 7, 2018

TallJimbo commented Apr 7, 2018

pschella Apr 9, 2018

TallJimbo commented Apr 26, 2018

pschella left a comment

pschella Apr 16, 2018

TallJimbo Apr 27, 2018

pschella Apr 16, 2018

pschella Apr 17, 2018

pschella Apr 26, 2018

TallJimbo Apr 27, 2018

pschella Apr 26, 2018

TallJimbo Apr 27, 2018

pschella Apr 27, 2018

TallJimbo Apr 30, 2018

pschella Apr 27, 2018

TallJimbo Apr 30, 2018

pschella Apr 27, 2018

pschella Apr 27, 2018

pschella Apr 27, 2018

TallJimbo Apr 30, 2018

pschella Apr 27, 2018

TallJimbo left a comment •

edited


		from .structures import Gen2Dataset

		TEMPLATE_RE = re.compile(r'\%\((?P<name>\w+)\).*?(?P<type>[idrs])')

DM-12635: Subpackage for converting Gen2->Gen3. #30

DM-12635: Subpackage for converting Gen2->Gen3. #30

Conversation

TallJimbo commented Apr 7, 2018

timj commented Apr 7, 2018

TallJimbo commented Apr 7, 2018

Choose a reason for hiding this comment

TallJimbo commented Apr 26, 2018

pschella left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo left a comment • edited

Choose a reason for hiding this comment

TallJimbo left a comment •

edited