New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-24638: Convert TransformSourceTableTask and friends to Gen3 #426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some updates for physical_filter
replacing filter
that need to be done, but otherwise looks good. The default pipeline file should also be updated (this will be clear after a rebase).
df = parq.get(columns=self.columns) | ||
elif isinstance(parq, pd.DataFrame): | ||
df = parq | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the parq
instance be here? I think it's worth making this explicit and raising on else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, does the __call__
need a docstring that explicitly says what parq
can be? There's a lot of datatypes, but not all possible datatypes that this supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parq
instance is for the Gen3 case if deferLoad=False
. The inputs datatypes are pd.DataFrames after loading. and the parq.toDataFrame
is the original (Gen2). I'll add inline comments.
storageClass="DataFrame", | ||
dimensions=("instrument", "visit", "detector") | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for DM-27164 (and this will need to be rebased of course), I defined from lsst.pipe.base import connectionTypes
so you can use just connectionTypes
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me that connectionTypes
is more readable than pipeBase.connectionTypes
when there are other pipeBase
objects abounding, but sure, I'll change them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to makework! But we don't have any standard convention on this, and that's unpleasant.
Must be subclassed. | ||
""" | ||
inputCatalog = pipeBase.connectionTypes.Input( | ||
name="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for connectionTypes
def runQuantum(self, butlerQC, inputRefs, outputRefs): | ||
inputs = butlerQC.get(inputRefs) | ||
result = self.run(parq=inputs['inputCatalog'], funcs=self.funcs, | ||
dataId=outputRefs.outputCatalog.dataId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if using dataId in this way is really kosher in Gen3. Specifically, filter
is now physical_filter
. So in this case, I think that this won't work correctly. I'm trying to follow how dataId
gets used downstream, if there's anything else that needs to come from it; if it's just returning the dataId in the struct that seems like a gen2 necessity not gen3 (though I could be wrong).
What I would suggest is specifically pulling out physical_filter
for gen3 (and filter
for gen2), and having that as a separate keyword which seems special and necessary. And then I think the dataId
can be set only in the gen2 code run. Though I may be missing one of the uses here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, because the regular dataId depends on what you pass on the command-line. OK I changed it outputRefs.outputCatalog.dataId.full
so that it'll at least be the same keys are being added to the table every time. Technically, all this info is redundant: everything will eventually be looked up with the "cccVisitId." I've been adding these columns because we don't have a corresponding ccdVisit
table yet to look up the full ccd info given a ccdVisitId.
Passing an optional dataId for logging was kosher last I checked.
dimensions=("instrument", "visit", "detector")): | ||
|
||
inputCatalog = pipeBase.connectionTypes.Input( | ||
doc="Wide input catalog of sources produced by WriteSourceTableTask", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about connectionTypes
.
class ConsolidateSourceTableConnections(pipeBase.PipelineTaskConnections, | ||
dimensions=("instrument", "visit")): | ||
inputCatalogs = pipeBase.connectionTypes.Input( | ||
doc="Input per-detector Source Tables", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
1c850e3
to
11bbb5b
Compare
11bbb5b
to
c5b67a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the drive by review, just caught my eye while I was waiting for a batch job and looking at something else in my email. You might have this well handled already, just something to think about.
|
||
def runQuantum(self, butlerQC, inputRefs, outputRefs): | ||
inputs = butlerQC.get(inputRefs) | ||
result = self.run(parq=inputs['inputCatalog'], funcs=self.funcs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self.config.functorFile
is False
, does that mean self.funcs
is never defined? If so I think this line will raise an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.config.functorFile
is optional in the sense that you should be able to instantiate the task in the notebook and run any functors you want (not necessarily from a yaml); and there's no default set of functors that make sense for the base class.
self.config.functorFile
is not optional when running the subclasses as a commandlineTask or pipelineTask. I'll add an appropriate validation error message somewhere.
@@ -3,6 +3,8 @@ tasks: | |||
isr: lsst.ip.isr.IsrTask | |||
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask | |||
calibrate: lsst.pipe.tasks.calibrate.CalibrateTask | |||
writeSourceTable: lsst.pipe.tasks.postprocess.WriteSourceTableTask | |||
transformSourceTable: lsst.pipe.tasks.postprocess.TransformSourceTableTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to define another subset (or put them in Eli's consolidate group). The current subset if someone says run processCcd
will only run the top 3 tasks and not your two new ones (which might be what we want) but if you want a grouping that encompasses your new tasks, maybe something like:
subsets:
singleFrame:
subset:
- isr
- characterizeImage
- calibrate
- writeSourceTable
- transformSourceTable
description: Single frame processing that includes table transformations
processCcd:
....
These subsets let people quickly run a select group of tasks from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to mirror what we had before. Can subsets refer to each other? e.g.
subsets:
processCcd:
- isr
- characterizeImage
- calibrate
singleFrame:
subset:
- processCcd
- writeSourceTable
- transformSourceTable
description: Single frame processing that includes table transformations
Edit: They can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't get the DECam repo going in time to add for all cameras, so @natelust take a look at https://github.com/lsst/obs_subaru/pull/341/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently this is the only place that labeled subsets cant be used as a substitute for labels. This is because at the time we didnt think it was worth adding code and complexity to track and handle cyclical definitions, to save a few lines in a (mostly) static file. If you think this is a good feature to have, we can add it.
|
||
def getAnalysis(self, parq, funcs=None, band=None): | ||
# Avoids disk access if funcs is passed | ||
if funcs is None: | ||
funcs = self.getFunctors() | ||
funcs = self.funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
funcs = CompositeFunctor.from_file(self.config.functorFile) | ||
funcs.update(dict(PostprocessAnalysis._defaultFuncs)) | ||
return funcs | ||
return self.funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
d89413f
to
08ab119
Compare
@@ -8,4 +8,3 @@ subsets: | |||
- forcedPhotCcd | |||
- forcedPhotCoadd | |||
description: A set of tasks to run when doing forced measurements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pass github checks? I though yaml lint required a new line at the end (but maybe I have that backwards)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint DOES require a newline at the end. You had 2 newlines before: https://github.com/lsst/pipe_tasks/blob/master/pipelines/_Forced.yaml
- writeSourceTable | ||
- transformSourceTable | ||
- consolidateSourceTable | ||
description: Set of tasks for complete single frame processing. Analogous to SingleFrameDriver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should mention SingleFrameDriver in the docs, it does not help anyone unfamiliar with pipe_drivers, and its somewhat ties this package to that, potentially after the later is retired. It's up to your judgment though. If you feel it should be there then feel free to leave it.
in preparation for running Gen3 DRP pipelines on cameras beyond obs_subaru
08ab119
to
d14aee8
Compare
No description provided.