New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-29543: Gen3 refcat converter #255
Conversation
Factor out butler.get/butler.put calls, so that we can use the rest of the refcat ingest code for both the gen2/gen3 versions. Replace `createIndexedCatalog` with `run`. Name change ingest->convert, as the gen3 code only does conversion, not ingestion. Make refcat convert manager a config option: * This eliminates the need to have a separate IngestGaia Task and lets us write configs for gen3. * Rename ingestIndexManager->convertRefcatManager to better reflect what it is for (it is a purely internal class). Cleanup/refactor convert manager to output better docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine overall, but see comments below.
My biggest comment is a separate use case that might be out of scope for this ticket but want to make sure this is compatible... what to do if I have either (a) a big in memory catalog, or (b) an even bigger catalog that is a bunch of non-csv files (regular fits, afw catalogs, parquet, etc) that I would like to turn into a refcat. My current reading of the docs/code is that I have to write my own sharding code, schema-writing code, and an ingest table file. In which case the schema and ingest table code should be made into static or free functions, or there should be a way to get into this code without having csv files. (Some of this might be misreading the code; it's a bit late but I would have suggested abandoning the gen2 code rather than refactoring it even at the expense of code duplication).
|
||
|
||
class IngestIndexedReferenceTask(ConvertReferenceCatalogBase, pipeBase.CmdLineTask): | ||
"""Class for producing and loading indexed reference catalogs (gen2 version). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have to refactor all this gen2 code, rather than just shunting it off to its gen2 penalty box?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring it was very helpful for me in finding all the places (actually only a few!) that had gen3 butler interactions, and turning them into simple file operations. Once gen2 really is gone, we can think about refactoring to merge the base class back into here.
Although, thinking about your comment above about converting in-memory catalogs, maybe the baseclass would be useful for building something to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be useful! I want to make sure that this is put together in a way that the key parts are accessible without running the full task.
Your suggestion about converting/sharding an in-memory catalog is a good one for another ticket: I think we'd want to carefully consider what such an input catalog would look like, and whether the code that produces it (I'm guessing fgcmcal, in this case?) can get the in-memory representation right, so it only has to be sharded, not have fields converted. If you have a parquet table, you'd have to write a We have a |
3f8b0e1
to
4b350bf
Compare
Because this is now a butler collection name, it should always be explicitly set by the user: however, changing that now breaks refcat loading in gen2, so just mark it with the ticket number for now.
Add tests of cmdline arguments and mock files. Add convert script to gitignore.
test_htmIndex is all gen2, nopytest_ingestIndexReferenceCatalog is all gen3 * Move gen2-specific test setup to gen2-only tests. * Add gen3 butler creation method. * Change ingest->convert where relevant. * Write working gen3 parallel convert test. * Add output to ConvertManager for output filename tracking.
These tests and the associated file are irrelevant for gen3, and did not provide anything new over existing gen2 tests.
The version cannot be None, and must be a number, as it is compared with an integer. The default if there is no value must be `0`, to match the config default.
The check script only checked that sources existed, it didn't test any of the relevant conversion values (which is much harder), and it would be difficult to update it for gen3. I'm removing it because it would probably just confuse people and provide a false sense of security were it updated to gen3.
4b350bf
to
e938d97
Compare
No description provided.