DM-29543: Gen3 refcat converter #255

parejkoj · 2021-09-09T22:39:22Z

No description provided.

Factor out butler.get/butler.put calls, so that we can use the rest of the refcat ingest code for both the gen2/gen3 versions. Replace `createIndexedCatalog` with `run`. Name change ingest->convert, as the gen3 code only does conversion, not ingestion. Make refcat convert manager a config option: * This eliminates the need to have a separate IngestGaia Task and lets us write configs for gen3. * Rename ingestIndexManager->convertRefcatManager to better reflect what it is for (it is a purely internal class). Cleanup/refactor convert manager to output better docs.

erykoff

Looks fine overall, but see comments below.

My biggest comment is a separate use case that might be out of scope for this ticket but want to make sure this is compatible... what to do if I have either (a) a big in memory catalog, or (b) an even bigger catalog that is a bunch of non-csv files (regular fits, afw catalogs, parquet, etc) that I would like to turn into a refcat. My current reading of the docs/code is that I have to write my own sharding code, schema-writing code, and an ingest table file. In which case the schema and ingest table code should be made into static or free functions, or there should be a way to get into this code without having csv files. (Some of this might be misreading the code; it's a bit late but I would have suggested abandoning the gen2 code rather than refactoring it even at the expense of code duplication).

doc/lsst.meas.algorithms/creating-a-reference-catalog.rst

python/lsst/meas/algorithms/convertReferenceCatalog.py

python/lsst/meas/algorithms/ingestIndexReferenceTask.py

erykoff · 2021-09-13T22:30:08Z

python/lsst/meas/algorithms/ingestIndexReferenceTask.py

+
+
+class IngestIndexedReferenceTask(ConvertReferenceCatalogBase, pipeBase.CmdLineTask):
+    """Class for producing and loading indexed reference catalogs (gen2 version).


Did you have to refactor all this gen2 code, rather than just shunting it off to its gen2 penalty box?

Refactoring it was very helpful for me in finding all the places (actually only a few!) that had gen3 butler interactions, and turning them into simple file operations. Once gen2 really is gone, we can think about refactoring to merge the base class back into here.

Although, thinking about your comment above about converting in-memory catalogs, maybe the baseclass would be useful for building something to do that.

That would be useful! I want to make sure that this is put together in a way that the key parts are accessible without running the full task.

doc/lsst.meas.algorithms/creating-a-reference-catalog.rst

parejkoj · 2021-09-15T23:02:57Z

Your suggestion about converting/sharding an in-memory catalog is a good one for another ticket: I think we'd want to carefully consider what such an input catalog would look like, and whether the code that produces it (I'm guessing fgcmcal, in this case?) can get the in-memory representation right, so it only has to be sharded, not have fields converted.

If you have a parquet table, you'd have to write a readParquetCatalogTask, which would be pretty trivial, I think.

We have a readFitsCatalogTask (I updated the docs with a link to it), which was made to work here. It would also work for converting an afw catalog; I don't know whether that's a completely trivial thing (requiring only sharding), given differing column naming conventions in SourceCatalogs vs. refcats.

Because this is now a butler collection name, it should always be explicitly set by the user: however, changing that now breaks refcat loading in gen2, so just mark it with the ticket number for now.

Add tests of cmdline arguments and mock files. Add convert script to gitignore.

test_htmIndex is all gen2, nopytest_ingestIndexReferenceCatalog is all gen3 * Move gen2-specific test setup to gen2-only tests. * Add gen3 butler creation method. * Change ingest->convert where relevant. * Write working gen3 parallel convert test. * Add output to ConvertManager for output filename tracking.

These tests and the associated file are irrelevant for gen3, and did not provide anything new over existing gen2 tests.

The version cannot be None, and must be a number, as it is compared with an integer. The default if there is no value must be `0`, to match the config default.

The check script only checked that sources existed, it didn't test any of the relevant conversion values (which is much harder), and it would be difficult to update it for gen3. I'm removing it because it would probably just confuse people and provide a false sense of security were it updated to gen3.

parejkoj added 3 commits September 2, 2021 18:38

Update license file

1fa0d42

Default htm depth should be 7 everywhere for gen3 compatibility

769284e

erykoff requested changes Sep 14, 2021

View reviewed changes

parejkoj force-pushed the tickets/DM-29543 branch from 3f8b0e1 to 4b350bf Compare September 17, 2021 01:27

parejkoj added 9 commits September 16, 2021 19:06

Mark ref_dataset_name with TODO

879e45e

Because this is now a butler collection name, it should always be explicitly set by the user: however, changing that now breaks refcat loading in gen2, so just mark it with the ticket number for now.

Add gen3 refcat converter with cmdline scripting

63d2afa

Add tests of cmdline arguments and mock files. Add convert script to gitignore.

Update refcat creation docs to reflect new workflow

a57539a

Cleanup unnecessary tests

00dd1db

These tests and the associated file are irrelevant for gen3, and did not provide anything new over existing gen2 tests.

Fix bug in refcat version guesser

4c78c6a

The version cannot be None, and must be a number, as it is compared with an integer. The default if there is no value must be `0`, to match the config default.

remove docs for removed Gaia task

0207999

Add read catalog tasks to __init__.py for documentation

e938d97

parejkoj force-pushed the tickets/DM-29543 branch from 4b350bf to e938d97 Compare September 17, 2021 02:07

erykoff approved these changes Sep 17, 2021

View reviewed changes

parejkoj merged commit 608e23a into master Sep 17, 2021

parejkoj deleted the tickets/DM-29543 branch September 17, 2021 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-29543: Gen3 refcat converter #255

DM-29543: Gen3 refcat converter #255

parejkoj commented Sep 9, 2021

erykoff left a comment

erykoff Sep 13, 2021

parejkoj Sep 15, 2021 •

edited

erykoff Sep 16, 2021

parejkoj commented Sep 15, 2021



		class IngestIndexedReferenceTask(ConvertReferenceCatalogBase, pipeBase.CmdLineTask):
		"""Class for producing and loading indexed reference catalogs (gen2 version).

DM-29543: Gen3 refcat converter #255

DM-29543: Gen3 refcat converter #255

Conversation

parejkoj commented Sep 9, 2021

erykoff left a comment

Choose a reason for hiding this comment

erykoff Sep 13, 2021

Choose a reason for hiding this comment

parejkoj Sep 15, 2021 • edited

Choose a reason for hiding this comment

erykoff Sep 16, 2021

Choose a reason for hiding this comment

parejkoj commented Sep 15, 2021

parejkoj Sep 15, 2021 •

edited