New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-12635: Subpackage for converting Gen2->Gen3. #30
Conversation
Important note: datastore doesn't yet write the formatter information to somewhere permanent. It uses a shim at the moment that passes tests but forgets immediately the process ends. We need a more permanent fix in there. |
Good to know. I think we'd get a lot of the benefits from just getting this to write to |
@@ -508,6 +510,7 @@ def addRun(self, run): | |||
collection=run.collection, | |||
environment_id=None, # TODO add environment | |||
pipeline_id=None)) # TODO add pipeline | |||
# TODO: set given Run's 'id' attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is already done in the Execution
part.
81e6489
to
9442e67
Compare
72722ff
to
b6d9c68
Compare
This is now as far along as I think make sense on this ticket. Next steps will be making |
812c5bd
to
b6d9c68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming we can't really unittest any of this?
@@ -135,6 +135,7 @@ def addDataset(self, ref, uri, components, run, producer=None): | |||
If a Dataset with the given `DatasetRef` already exists in the | |||
given Collection. | |||
""" | |||
# TODO: this signature no longer agrees with SqlRegistry.addDataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of many I'm afraid. I'm planning to copy them all over after things have settled a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this TODO has already been removed.
|
||
from .structures import Gen2Dataset | ||
|
||
TEMPLATE_RE = re.compile(r'\%\((?P<name>\w+)\).*?(?P<type>[idrs])') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add an example comment for what this matches.
self.insertSkyMaps(registry) | ||
self.insertObservations(registry) | ||
self.insertDatasets(registry, datastore) | ||
# TODO: associate parent repo datasets with child repo Collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Registry.merge
? Or just Registry.associate
?
# entries; options here say how to get those from a Gen2 repo. | ||
VisitInfo: | ||
# The Gen2 DatasetType to read when trying to create a VisitInfo. | ||
# (we actually add a "_md" suffix, because we just read the metadata). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we don't (yet) have in Gen3. Unless you mean findDataUnitEntry
type things here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd handle the metadata as another Exposure
component dataset.
# its Run; this will be the one corresponding to the Gen2 repository | ||
# that originally contained the file, unless that has been overridden. | ||
raw/HSC: 1 | ||
ref/ps1_pv3_3pi_20170110: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to allow for a None
default that will generate a new Run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That default is already in place; if nothing here matches, we'll make a new Run
, and use the repo root at its Collection
name.
self.insertObservations(registry) | ||
self.insertDatasetTypes(registry) | ||
self.insertDatasets(registry, datastore) | ||
# TODO: associate parent repo datasets with child repo Collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still open? Seems reasonably essential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm afraid it is, and I remember it being hard but don't remember exactly why. We can work around it with a custom call to query
in the ci_hsc drivers for the conversion for now.
skyMapName = self.skyMapNames.get(sha1, None) | ||
try: | ||
existing, = registry.query("SELECT skymap FROM SkyMap WHERE sha1=:sha1", | ||
sha1=sha1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we probably don't want to rely on query
too much, when other API calls should exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree. I've been meaning to talk to you about how to add alternate unique lookup APIs for some DataUnits (SkyMap sha1
, Patch cell_x
and cell_y
, Sensor name
, probably some Camera-specific Visit
or Exposure
identifiers).
"and does not already exist in the Registry.").format(sha1.hex()) | ||
) | ||
log.info("Inserting SkyMap '%s' with sha1=%s", skyMapName, sha1.hex()) | ||
skyMap.register(skyMapName, registry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the same skymap to have different names? Otherwise I don't really get why the call can't be skyMap.register(registry)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or directly addDataUnitEntry(...)
.
"boresight_ra": visitInfo.getBoresightRaDec().getLongitude().asDegrees(), | ||
"boresight_dec": visitInfo.getBoresightRaDec().getLatitude().asDegrees(), | ||
"boresight_parallactic_angle": visitInfo.getBoresightParAngle().asDegrees(), | ||
"local_era": visitInfo.getLocalEra().asDegrees(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we loose information with respect to visitInfo
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I don't want to think too hard about this until we've gone through the schema review.
log = Log.getLogger("lsst.daf.butler.gen2convert") | ||
for datasetType in self.datasetTypes.values(): | ||
# TODO: should put this "just make sure it exists" logic | ||
# into registerDatasetType itself, and make it a bit more careful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've addressed all review comments. I'll squash and merge by Monday.
As for tests, I don't think it's a good use of time to try to cook up the kind of test data we'd need for small scale unit testing of this code in daf_butler. Instead, I'm planning to add running the conversion to ci_hsc's SConstruct, and add checks in that package that we can access the datasets using the Gen3 butler and that this is the same as Gen2 access. We need at least DM-14225 before that's worthwhile, though, and may want to add a few more concrete StorageClasses/Formatters first as well.
afw.image.VisitInfo provides a good initial guess at what metadata we'll want for Visits and Exposures.
We don't really have an entity responsible for creating AbstractFilters, since they're not a associated with a Camera or a SkyMap. It's probably best to just let them be "created" when first referenced, especially since we don't actually need to store any metadata associated with them.
Having addDataset call associate leads to "locked database" errors, apparently because the transaction context managers aren't nesting the way SQLAlchemy's documentation claims they should. This should be investigated further, but this workaround lets us move forward for now.
d7bfb5d
to
26e7d6f
Compare
The conversion code here has two high-level drivers:
ConversionWalker
class (walker.py
) extracts everything we'll need for the conversion from a Gen2 repo. It's basically complete, and I've done some one-off testing it onci_hsc
(check outci_hsc
, setup and build, and start by callingConversionWalker.tryRoot
on$CI_HSC_DIR/DATA
).ConversionWriter
class (writer.py
) adds entries to aRegistry
andDatastore
- or it would if all of the actual calls to do so weren't just "TODO" comment placeholdersAs for those TODO placeholders, there are a ton of them, and I think they cover most of what needs to be done to get things working. Beyond that, we'll also eventually need a high-level driver that can be run easily from the command-line, but that doesn't necessarily need to be done right now.