New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-26082: Persist source-to-external reference matched catalogs in pipe_analysis to parquet #293
Conversation
065f786
to
0ba124f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, except for the term "denormalized".
policy/datasets.yaml
Outdated
@@ -1333,6 +1333,14 @@ analysisVisitTable_commonZp: | |||
storage: ParquetStorage | |||
python: lsst.pipe.tasks.parquetTable.ParquetTable | |||
template: plots/%(filter)s/tract-%(tract)d/visit-%(visit)d%(subdir)s/%(tract)d_%(visit)d_commonZp.parq | |||
analysisMatchRefVisitTable: | |||
description: > | |||
Per-visit table (for specific tract) of matched and denormalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "denormalized" means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate your confusion as I had the same when I first encountered this term in the stack. I inherited/adopted it since it is fairly widely used, e.g. https://github.com/search?q=org%3Alsst+denormalized&type=Code
(Note, in particular, meas_astrom/python/lsst/meas/astrom/denormalizeMatches.py
& https://github.com/lsst/obs_base/blob/master/policy/datasets.yaml#L676-L684, which may actually suggest I should includeFull
in the name 😉).
The crux of it is that a “normalized” catalog only contains data IDs (and maybe a coord) to minimize space when persisting. I’m explicitly referring to these as “demoralized” to indicate that they include the full set of catalog info for both src
& ref
cats. If this adds more confusion than help, I’m happy to leave out that term (or use a more self-explanatory one if you have a suggestion!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace this with denormalized (contains all columns from XX)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. And looking at the precedent set above, I am leaning strongly towards adding "Full" to the dataset names, i.e
analysisMatchFullRefVisitTable
analysisMatchFullRefCoaddTable_forced
analysisMatchFullRefCoaddTable_unforced
Do you agree?
policy/datasets.yaml
Outdated
@@ -1347,6 +1355,22 @@ analysisCoaddTable_unforced: | |||
storage: ParquetStorage | |||
python: lsst.pipe.tasks.parquetTable.ParquetTable | |||
template: plots/%(filter)s/tract-%(tract)d%(subdir)s/%(tract)d_unforced.parq | |||
analysisMatchRefCoaddTable_forced: | |||
description: > | |||
Per-tract table of matched and denormalized source-to-external reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
policy/datasets.yaml
Outdated
template: plots/%(filter)s/tract-%(tract)d%(subdir)s/%(tract)d_matchRef_forced.parq | ||
analysisMatchRefCoaddTable_unforced: | ||
description: > | ||
Per-tract table of matched and denormalized source-to-external reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too.
0ba124f
to
b678b4e
Compare
The scripts in pipe_analysis perform a matching between sources and external reference catalogs (those used in the calibration stages) to create comparison plots. This adds datasets to persist those matched catalogs at the tract/visit level (although any sub-selection of patch/ccd can be made for a given run) to parquet tables for use in further QA/validation pursuits. The persisted tables are "denormalized", i.e. contain all fields from the original source and external catalogs (but with "src_" and "ref_" prefixes on the column names).
b678b4e
to
7ef4920
Compare
The scripts in pipe_analysis perform a matching between sources and
external reference catalogs (those used in the calibration stages) to
create comparison plots. This adds datasets to persist those matched
catalogs at the tract/visit level (although any sub-selection of
patch/ccd can be made for a given run) to parquet tables for use in
further QA/validation pursuits.