Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-24638: Convert TransformSourceTableTask and friends to Gen3 #426

Merged
merged 3 commits into from Feb 24, 2021

Conversation

yalsayyad
Copy link
Contributor

No description provided.

Copy link
Contributor

@erykoff erykoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some updates for physical_filter replacing filter that need to be done, but otherwise looks good. The default pipeline file should also be updated (this will be clear after a rebase).

df = parq.get(columns=self.columns)
elif isinstance(parq, pd.DataFrame):
df = parq
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the parq instance be here? I think it's worth making this explicit and raising on else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, does the __call__ need a docstring that explicitly says what parq can be? There's a lot of datatypes, but not all possible datatypes that this supports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parq instance is for the Gen3 case if deferLoad=False. The inputs datatypes are pd.DataFrames after loading. and the parq.toDataFrame is the original (Gen2). I'll add inline comments.

storageClass="DataFrame",
dimensions=("instrument", "visit", "detector")
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for DM-27164 (and this will need to be rebased of course), I defined from lsst.pipe.base import connectionTypes so you can use just connectionTypes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that connectionTypes is more readable than pipeBase.connectionTypes when there are other pipeBase objects abounding, but sure, I'll change them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to makework! But we don't have any standard convention on this, and that's unpleasant.

Must be subclassed.
"""
inputCatalog = pipeBase.connectionTypes.Input(
name="",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for connectionTypes

def runQuantum(self, butlerQC, inputRefs, outputRefs):
inputs = butlerQC.get(inputRefs)
result = self.run(parq=inputs['inputCatalog'], funcs=self.funcs,
dataId=outputRefs.outputCatalog.dataId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if using dataId in this way is really kosher in Gen3. Specifically, filter is now physical_filter. So in this case, I think that this won't work correctly. I'm trying to follow how dataId gets used downstream, if there's anything else that needs to come from it; if it's just returning the dataId in the struct that seems like a gen2 necessity not gen3 (though I could be wrong).

What I would suggest is specifically pulling out physical_filter for gen3 (and filter for gen2), and having that as a separate keyword which seems special and necessary. And then I think the dataId can be set only in the gen2 code run. Though I may be missing one of the uses here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, because the regular dataId depends on what you pass on the command-line. OK I changed it outputRefs.outputCatalog.dataId.full so that it'll at least be the same keys are being added to the table every time. Technically, all this info is redundant: everything will eventually be looked up with the "cccVisitId." I've been adding these columns because we don't have a corresponding ccdVisit table yet to look up the full ccd info given a ccdVisitId.

Passing an optional dataId for logging was kosher last I checked.

dimensions=("instrument", "visit", "detector")):

inputCatalog = pipeBase.connectionTypes.Input(
doc="Wide input catalog of sources produced by WriteSourceTableTask",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about connectionTypes.

class ConsolidateSourceTableConnections(pipeBase.PipelineTaskConnections,
dimensions=("instrument", "visit")):
inputCatalogs = pipeBase.connectionTypes.Input(
doc="Input per-detector Source Tables",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive by review, just caught my eye while I was waiting for a batch job and looking at something else in my email. You might have this well handled already, just something to think about.


def runQuantum(self, butlerQC, inputRefs, outputRefs):
inputs = butlerQC.get(inputRefs)
result = self.run(parq=inputs['inputCatalog'], funcs=self.funcs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self.config.functorFile is False, does that mean self.funcs is never defined? If so I think this line will raise an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.config.functorFile is optional in the sense that you should be able to instantiate the task in the notebook and run any functors you want (not necessarily from a yaml); and there's no default set of functors that make sense for the base class.

self.config.functorFile is not optional when running the subclasses as a commandlineTask or pipelineTask. I'll add an appropriate validation error message somewhere.

@@ -3,6 +3,8 @@ tasks:
isr: lsst.ip.isr.IsrTask
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
calibrate: lsst.pipe.tasks.calibrate.CalibrateTask
writeSourceTable: lsst.pipe.tasks.postprocess.WriteSourceTableTask
transformSourceTable: lsst.pipe.tasks.postprocess.TransformSourceTableTask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to define another subset (or put them in Eli's consolidate group). The current subset if someone says run processCcd will only run the top 3 tasks and not your two new ones (which might be what we want) but if you want a grouping that encompasses your new tasks, maybe something like:

subsets:
  singleFrame:
    subset:
      - isr
      - characterizeImage
      - calibrate
      - writeSourceTable
      - transformSourceTable
    description: Single frame processing that includes table transformations
  processCcd:
      ....

These subsets let people quickly run a select group of tasks from the command line.

Copy link
Contributor Author

@yalsayyad yalsayyad Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to mirror what we had before. Can subsets refer to each other? e.g.

subsets:
  processCcd:
      - isr
      - characterizeImage
      - calibrate
  singleFrame:
    subset:
      - processCcd
      - writeSourceTable
      - transformSourceTable
    description: Single frame processing that includes table transformations

Edit: They can't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't get the DECam repo going in time to add for all cameras, so @natelust take a look at https://github.com/lsst/obs_subaru/pull/341/files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently this is the only place that labeled subsets cant be used as a substitute for labels. This is because at the time we didnt think it was worth adding code and complexity to track and handle cyclical definitions, to save a few lines in a (mostly) static file. If you think this is a good feature to have, we can add it.


def getAnalysis(self, parq, funcs=None, band=None):
# Avoids disk access if funcs is passed
if funcs is None:
funcs = self.getFunctors()
funcs = self.funcs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

funcs = CompositeFunctor.from_file(self.config.functorFile)
funcs.update(dict(PostprocessAnalysis._defaultFuncs))
return funcs
return self.funcs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@yalsayyad yalsayyad force-pushed the tickets/DM-24638 branch 4 times, most recently from d89413f to 08ab119 Compare February 18, 2021 04:52
@@ -8,4 +8,3 @@ subsets:
- forcedPhotCcd
- forcedPhotCoadd
description: A set of tasks to run when doing forced measurements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass github checks? I though yaml lint required a new line at the end (but maybe I have that backwards)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint DOES require a newline at the end. You had 2 newlines before: https://github.com/lsst/pipe_tasks/blob/master/pipelines/_Forced.yaml

- writeSourceTable
- transformSourceTable
- consolidateSourceTable
description: Set of tasks for complete single frame processing. Analogous to SingleFrameDriver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should mention SingleFrameDriver in the docs, it does not help anyone unfamiliar with pipe_drivers, and its somewhat ties this package to that, potentially after the later is retired. It's up to your judgment though. If you feel it should be there then feel free to leave it.

@yalsayyad yalsayyad merged commit 0a6ae79 into master Feb 24, 2021
@yalsayyad yalsayyad deleted the tickets/DM-24638 branch February 24, 2021 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants