Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-29543: Gen3 refcat converter #255

Merged
merged 12 commits into from Sep 17, 2021
Merged

DM-29543: Gen3 refcat converter #255

merged 12 commits into from Sep 17, 2021

Conversation

parejkoj
Copy link
Contributor

@parejkoj parejkoj commented Sep 9, 2021

No description provided.

Factor out butler.get/butler.put calls, so that we can use the rest of
the refcat ingest code for both the gen2/gen3 versions.

Replace `createIndexedCatalog` with `run`.

Name change ingest->convert, as the gen3 code only does conversion, not
ingestion.

Make refcat convert manager a config option:
* This eliminates the need to have a separate IngestGaia Task and lets us
write configs for gen3.
* Rename ingestIndexManager->convertRefcatManager to better reflect what
it is for (it is a purely internal class).

Cleanup/refactor convert manager to output better docs.
Copy link
Contributor

@erykoff erykoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall, but see comments below.

My biggest comment is a separate use case that might be out of scope for this ticket but want to make sure this is compatible... what to do if I have either (a) a big in memory catalog, or (b) an even bigger catalog that is a bunch of non-csv files (regular fits, afw catalogs, parquet, etc) that I would like to turn into a refcat. My current reading of the docs/code is that I have to write my own sharding code, schema-writing code, and an ingest table file. In which case the schema and ingest table code should be made into static or free functions, or there should be a way to get into this code without having csv files. (Some of this might be misreading the code; it's a bit late but I would have suggested abandoning the gen2 code rather than refactoring it even at the expense of code duplication).

doc/lsst.meas.algorithms/creating-a-reference-catalog.rst Outdated Show resolved Hide resolved
doc/lsst.meas.algorithms/creating-a-reference-catalog.rst Outdated Show resolved Hide resolved
doc/lsst.meas.algorithms/creating-a-reference-catalog.rst Outdated Show resolved Hide resolved
python/lsst/meas/algorithms/ingestIndexReferenceTask.py Outdated Show resolved Hide resolved


class IngestIndexedReferenceTask(ConvertReferenceCatalogBase, pipeBase.CmdLineTask):
"""Class for producing and loading indexed reference catalogs (gen2 version).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have to refactor all this gen2 code, rather than just shunting it off to its gen2 penalty box?

Copy link
Contributor Author

@parejkoj parejkoj Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring it was very helpful for me in finding all the places (actually only a few!) that had gen3 butler interactions, and turning them into simple file operations. Once gen2 really is gone, we can think about refactoring to merge the base class back into here.

Although, thinking about your comment above about converting in-memory catalogs, maybe the baseclass would be useful for building something to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be useful! I want to make sure that this is put together in a way that the key parts are accessible without running the full task.

@parejkoj
Copy link
Contributor Author

Your suggestion about converting/sharding an in-memory catalog is a good one for another ticket: I think we'd want to carefully consider what such an input catalog would look like, and whether the code that produces it (I'm guessing fgcmcal, in this case?) can get the in-memory representation right, so it only has to be sharded, not have fields converted.

If you have a parquet table, you'd have to write a readParquetCatalogTask, which would be pretty trivial, I think.

We have a readFitsCatalogTask (I updated the docs with a link to it), which was made to work here. It would also work for converting an afw catalog; I don't know whether that's a completely trivial thing (requiring only sharding), given differing column naming conventions in SourceCatalogs vs. refcats.

Because this is now a butler collection name, it should always be
explicitly set by the user: however, changing that now breaks refcat
loading in gen2, so just mark it with the ticket number for now.
Add tests of cmdline arguments and mock files.
Add convert script to gitignore.
test_htmIndex is all gen2, nopytest_ingestIndexReferenceCatalog is all gen3

* Move gen2-specific test setup to gen2-only tests.
* Add gen3 butler creation method.
* Change ingest->convert where relevant.
* Write working gen3 parallel convert test.
* Add output to ConvertManager for output filename tracking.
These tests and the associated file are irrelevant for gen3, and did not
provide anything new over existing gen2 tests.
The version cannot be None, and must be a number, as it is compared with
an integer. The default if there is no value must be `0`, to match the
config default.
The check script only checked that sources existed, it didn't test any of the
relevant conversion values (which is much harder), and it would be difficult
to update it for gen3. I'm removing it because it would probably just confuse
people and provide a false sense of security were it updated to gen3.
@parejkoj parejkoj merged commit 608e23a into master Sep 17, 2021
@parejkoj parejkoj deleted the tickets/DM-29543 branch September 17, 2021 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants