Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-26082: Persist source-to-external reference matched catalogs in pipe_analysis to parquet #293

Merged
merged 1 commit into from Sep 9, 2020

Conversation

laurenam
Copy link
Contributor

The scripts in pipe_analysis perform a matching between sources and
external reference catalogs (those used in the calibration stages) to
create comparison plots. This adds datasets to persist those matched
catalogs at the tract/visit level (although any sub-selection of
patch/ccd can be made for a given run) to parquet tables for use in
further QA/validation pursuits.

Copy link
Contributor

@erykoff erykoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, except for the term "denormalized".

@@ -1333,6 +1333,14 @@ analysisVisitTable_commonZp:
storage: ParquetStorage
python: lsst.pipe.tasks.parquetTable.ParquetTable
template: plots/%(filter)s/tract-%(tract)d/visit-%(visit)d%(subdir)s/%(tract)d_%(visit)d_commonZp.parq
analysisMatchRefVisitTable:
description: >
Per-visit table (for specific tract) of matched and denormalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "denormalized" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your confusion as I had the same when I first encountered this term in the stack. I inherited/adopted it since it is fairly widely used, e.g. https://github.com/search?q=org%3Alsst+denormalized&type=Code
(Note, in particular, meas_astrom/python/lsst/meas/astrom/denormalizeMatches.py & https://github.com/lsst/obs_base/blob/master/policy/datasets.yaml#L676-L684, which may actually suggest I should includeFull in the name 😉).
The crux of it is that a “normalized” catalog only contains data IDs (and maybe a coord) to minimize space when persisting. I’m explicitly referring to these as “demoralized” to indicate that they include the full set of catalog info for both src & ref cats. If this adds more confusion than help, I’m happy to leave out that term (or use a more self-explanatory one if you have a suggestion!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace this with denormalized (contains all columns from XX)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. And looking at the precedent set above, I am leaning strongly towards adding "Full" to the dataset names, i.e

analysisMatchFullRefVisitTable
analysisMatchFullRefCoaddTable_forced
analysisMatchFullRefCoaddTable_unforced

Do you agree?

@@ -1347,6 +1355,22 @@ analysisCoaddTable_unforced:
storage: ParquetStorage
python: lsst.pipe.tasks.parquetTable.ParquetTable
template: plots/%(filter)s/tract-%(tract)d%(subdir)s/%(tract)d_unforced.parq
analysisMatchRefCoaddTable_forced:
description: >
Per-tract table of matched and denormalized source-to-external reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

template: plots/%(filter)s/tract-%(tract)d%(subdir)s/%(tract)d_matchRef_forced.parq
analysisMatchRefCoaddTable_unforced:
description: >
Per-tract table of matched and denormalized source-to-external reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

The scripts in pipe_analysis perform a matching between sources and
external reference catalogs (those used in the calibration stages) to
create comparison plots.  This adds datasets to persist those matched
catalogs at the tract/visit level (although any sub-selection of
patch/ccd can be made for a given run) to parquet tables for use in
further QA/validation pursuits.

The persisted tables are "denormalized", i.e. contain all fields from
the original source and external catalogs (but with "src_" and "ref_"
prefixes on the column names).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants