Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-12635: Subpackage for converting Gen2->Gen3. #30

Merged
merged 12 commits into from Apr 30, 2018
Merged

Conversation

TallJimbo
Copy link
Member

The conversion code here has two high-level drivers:

  • The ConversionWalker class (walker.py) extracts everything we'll need for the conversion from a Gen2 repo. It's basically complete, and I've done some one-off testing it on ci_hsc (check out ci_hsc, setup and build, and start by calling ConversionWalker.tryRoot on $CI_HSC_DIR/DATA).
  • The ConversionWriter class (writer.py) adds entries to a Registry and Datastore - or it would if all of the actual calls to do so weren't just "TODO" comment placeholders

As for those TODO placeholders, there are a ton of them, and I think they cover most of what needs to be done to get things working. Beyond that, we'll also eventually need a high-level driver that can be run easily from the command-line, but that doesn't necessarily need to be done right now.

@timj
Copy link
Member

timj commented Apr 7, 2018

Important note: datastore doesn't yet write the formatter information to somewhere permanent. It uses a shim at the moment that passes tests but forgets immediately the process ends. We need a more permanent fix in there.

@TallJimbo
Copy link
Member Author

Good to know. I think we'd get a lot of the benefits from just getting this to write to Registry, as that'd let us easily put together realistically complex test datasets for it.

@@ -508,6 +510,7 @@ def addRun(self, run):
collection=run.collection,
environment_id=None, # TODO add environment
pipeline_id=None)) # TODO add pipeline
# TODO: set given Run's 'id' attribute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is already done in the Execution part.

@TallJimbo TallJimbo force-pushed the tickets/DM-12635 branch 3 times, most recently from 81e6489 to 9442e67 Compare April 20, 2018 22:59
@TallJimbo TallJimbo force-pushed the tickets/DM-12635 branch 7 times, most recently from 72722ff to b6d9c68 Compare April 26, 2018 01:11
@TallJimbo
Copy link
Member Author

This is now as far along as I think make sense on this ticket. Next steps will be making Datastore records persistent and adding concrete StorageClasses and Formatters.

@TallJimbo TallJimbo changed the title DM-12635: WIP subpackage for converting Gen2->Gen3. DM-12635: Subpackage for converting Gen2->Gen3. Apr 26, 2018
Copy link
Collaborator

@pschella pschella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we can't really unittest any of this?

@@ -135,6 +135,7 @@ def addDataset(self, ref, uri, components, run, producer=None):
If a Dataset with the given `DatasetRef` already exists in the
given Collection.
"""
# TODO: this signature no longer agrees with SqlRegistry.addDataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of many I'm afraid. I'm planning to copy them all over after things have settled a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this TODO has already been removed.


from .structures import Gen2Dataset

TEMPLATE_RE = re.compile(r'\%\((?P<name>\w+)\).*?(?P<type>[idrs])')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add an example comment for what this matches.

self.insertSkyMaps(registry)
self.insertObservations(registry)
self.insertDatasets(registry, datastore)
# TODO: associate parent repo datasets with child repo Collections
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Registry.merge? Or just Registry.associate?

# entries; options here say how to get those from a Gen2 repo.
VisitInfo:
# The Gen2 DatasetType to read when trying to create a VisitInfo.
# (we actually add a "_md" suffix, because we just read the metadata).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we don't (yet) have in Gen3. Unless you mean findDataUnitEntry type things here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd handle the metadata as another Exposure component dataset.

# its Run; this will be the one corresponding to the Gen2 repository
# that originally contained the file, unless that has been overridden.
raw/HSC: 1
ref/ps1_pv3_3pi_20170110: 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to allow for a None default that will generate a new Run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That default is already in place; if nothing here matches, we'll make a new Run, and use the repo root at its Collection name.

self.insertObservations(registry)
self.insertDatasetTypes(registry)
self.insertDatasets(registry, datastore)
# TODO: associate parent repo datasets with child repo Collections
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still open? Seems reasonably essential.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm afraid it is, and I remember it being hard but don't remember exactly why. We can work around it with a custom call to query in the ci_hsc drivers for the conversion for now.

skyMapName = self.skyMapNames.get(sha1, None)
try:
existing, = registry.query("SELECT skymap FROM SkyMap WHERE sha1=:sha1",
sha1=sha1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we probably don't want to rely on query too much, when other API calls should exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. I've been meaning to talk to you about how to add alternate unique lookup APIs for some DataUnits (SkyMap sha1, Patch cell_x and cell_y, Sensor name, probably some Camera-specific Visit or Exposure identifiers).

"and does not already exist in the Registry.").format(sha1.hex())
)
log.info("Inserting SkyMap '%s' with sha1=%s", skyMapName, sha1.hex())
skyMap.register(skyMapName, registry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the same skymap to have different names? Otherwise I don't really get why the call can't be skyMap.register(registry).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or directly addDataUnitEntry(...).

"boresight_ra": visitInfo.getBoresightRaDec().getLongitude().asDegrees(),
"boresight_dec": visitInfo.getBoresightRaDec().getLatitude().asDegrees(),
"boresight_parallactic_angle": visitInfo.getBoresightParAngle().asDegrees(),
"local_era": visitInfo.getLocalEra().asDegrees(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we loose information with respect to visitInfo here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I don't want to think too hard about this until we've gone through the schema review.

log = Log.getLogger("lsst.daf.butler.gen2convert")
for datasetType in self.datasetTypes.values():
# TODO: should put this "just make sure it exists" logic
# into registerDatasetType itself, and make it a bit more careful.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

Copy link
Member Author

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed all review comments. I'll squash and merge by Monday.

As for tests, I don't think it's a good use of time to try to cook up the kind of test data we'd need for small scale unit testing of this code in daf_butler. Instead, I'm planning to add running the conversion to ci_hsc's SConstruct, and add checks in that package that we can access the datasets using the Gen3 butler and that this is the same as Gen2 access. We need at least DM-14225 before that's worthwhile, though, and may want to add a few more concrete StorageClasses/Formatters first as well.

afw.image.VisitInfo provides a good initial guess at
what metadata we'll want for Visits and Exposures.
We don't really have an entity responsible for creating
AbstractFilters, since they're not a associated with a
Camera or a SkyMap.  It's probably best to just let them
be "created" when first referenced, especially since we
don't actually need to store any metadata associated with
them.
Having addDataset call associate leads to "locked database"
errors, apparently because the transaction context managers
aren't nesting the way SQLAlchemy's documentation claims
they should.  This should be investigated further, but this
workaround lets us move forward for now.
@TallJimbo TallJimbo merged commit 13d5374 into master Apr 30, 2018
@ktlim ktlim deleted the tickets/DM-12635 branch August 25, 2018 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants