Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-11118: Build stubbed out verify_ap #1

Merged
merged 1 commit into from Jul 28, 2017
Merged

Conversation

kfindeisen
Copy link
Member

This is an initial implementation for verify_ap. It should be equivalent to the pseudocode on DM-11118, except that it has two afterburners and adds explicit support for lsst.verify.

The current code does not run because there are no working datasets, so the constructor of Dataset always raises. In addition, the stubs in pipedriver all raise NotImplementedError. However, test_args does run, and the main program's command-line parser is functional:

$ python python/lsst/verify/ap/verify_ap.py -h
usage: verify_ap.py [-h] --dataIdString DATAID [-j PROCESSES] [--silent]
                    --dataset {HiTS} (--output OUTPUT | --rerun OUTPUT)
                    [--version]

Executes the LSST DM AP pipeline and analyzes its performance using metrics.

optional arguments:
  -h, --help            show this help message and exit
  --dataIdString DATAID
                        An identifier for the data to process. Syntax and
                        meaning same as --id in Tasks.
  -j PROCESSES, --processes PROCESSES
                        Number of processes to use
  --silent              Do NOT submit metrics to SQuaSH (not yet implemented).
  --dataset {HiTS}      The source of data to pass through the pipeline.
  --output OUTPUT       The location of the repository to use for program
                        output.
  --rerun OUTPUT        The location of the repository to use for program
                        output, as DATASET/rerun/OUTPUT
  --version             show program's version number and exit

Copy link
Contributor

@djreiss djreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me.

-------
a string giving the location of the top-level directory for telescope output files
"""
os.path.join(self.dataset_root, 'raw')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need return.

-------
a string giving the location of the top-level directory for master calibration files
"""
os.path.join(self.dataset_root, 'calib')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need return.

-------
a string giving the location of the top-level directory for precomputed templates
"""
os.path.join(self.dataset_root, 'templates')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, return.



def check_squash_ready(parsed_cmd_line):
"""Tests whether the program can has everything it needs for the SQuaSH API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can

"""
log = lsst.log.Log.getLogger('verify.ap.metrics.AutoJob.__exit__')

out_file = 'verify_ap.verify.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcoded? (there might be a perfectly good reason, I'm just not aware of it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the default name assumed by verify, according to SQR-019, and there doesn't seem an obvious reason to let the user create multiple dump files.

from lsst.verify.ap.dataset import Dataset


def _ingest_raws(dataset, repo_root):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these methods are not implemented in a class? There are many common variables passed to each of these methods, esp. dataset and repo_root.

I see they're all called from run_ap_pipe below. But question still holds...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think of any particular program state that would be guaranteed by having a PipeWrapper object sitting around, like there is with Dataset and AutoJob.

I guess I could create a class as a convenience wrapper for the function arguments, but that seems a bit gratuitous. It would also make it more difficult for other coders to implement the stubs and/or keep straight which information is needed for which pipeline step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems a natural case for a class, where future implementations can override the stubs in a subclass. I know you are not a fan of overuse of inheritance, but to me this seems like a situation where it would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I think this module could use some module-level docstring explaining the usage and where/how it gets called. (or a class-level docstring if we go with my previous suggestion).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any plan for "future implementations"; the stubs will be filled in once ap_pipe's API is finalized. Where would the subclasses live, anyway? There shouldn't be any packages that depend on verify_ap because it's a command-line program rather than a library...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @djreiss, this module will be redesigned to make it easier to adapt to other pipelines if somebody decides our system is better than ci_hsc or validate_drp.

from lsst.verify.ap.pipedriver import ApPipeParser, run_ap_pipe


class _VerifyApParser(argparse.ArgumentParser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet up to speed on the verify framework or the verify_ap design discussions, but is there any reason that this stuff is not set up as a lsst.pipe.base.Task subclass? That way you'd already have argument and config file parsing built-in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simon did not want verify_ap to be a Task, because some of the processing steps are not Tasks and so config info would be lost anyway (or at least, that's what my notes say; is that too garbled, @SimonKrughoff?)

Also, the desired behavior for verify_ap, particularly the idea of a "dataset" that is not necessarily a Butler repository (to support native observatory output, IIRC), leads to command-line arguments that are incompatible with Task requirements. This includes using --dataIdString instead of --id, and the absence of an input-repo or --rerun INPUT:OUTPUT argument. I'll admit I don't know the reason for avoiding --id, but there was some discussion about input reruns on Slack.


class CommandLineTestSuite(lsst.utils.tests.TestCase):

def test_minimum(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings to this and all test_* methods.

"""Verify that if a Dataset object exists, the corresponding data are available.
"""
try:
import lsst.verify.ap.hits2015 # noqa: F401
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djreiss, I don't think this will work with what you did for DM-11117. Do you have any suggestions for how to make sure a dataset package has been set up correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I need to add it to repos.yaml. I'll do that shortly. Until I do, You can also just clone the dataset repo and use setup -r .. Let me know if that does/doesn't work.

Copy link
Contributor

@mrawls mrawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really impressive package you've put together! Once you and @djreiss sort out the dataset interaction business I'm very happy to have this merged and use it as the thing ap_pipe must plug into. Please ensure the README is updated with any UI changes you've made before merging.

def __init__(self):
# Help and documentation will be handled by main program's parser
argparse.ArgumentParser.__init__(self, add_help=False)
self.add_argument('--dataIdString', dest='dataId', required=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for calling this dataIdString and not just dataId or id? The README example has it set to "visit=54123" and I'm wondering what other information might typically be passed here aside from a visit number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DM-11118 description said:

"--dataIdString" (note: often just "--Id" as a command line string in other places)

which I took to mean "we have a specific reason for calling it --dataIdString and especially not --id". Do you remember where that came from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the distinction is likely because it's not a canonical dataId, and id alone is a vague term. I'm fine with dataIdString, I just mistakenly thought the name came from you and not from the design outline. No problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it not a canonical dataId? I wrote the documentation assuming it was (i.e., 'Syntax and meaning same as --id in Tasks.').

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that dataId means a very specific thing to the butler and it's represented as a dictionary in python. In contrast, this is a string with some of the info you might find in a "proper" dataId.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should the user-facing documentation say? "May not support all features of a Butler dataId, exact subset TBD"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that would be great :) but at least give a list of what it currently does support (visit, obviously; anything else?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's supported is up to you. From the design outline:

ap_pipe.process(repo_location, rerun, parallelization_level, dataIdString)
ap_pipe.diff(repo_location, rerun, parallelization_level, dataIdString)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough! For now it will just be visit, but I see why this is more of an ap_pipe consideration. Let's just leave this be.

from __future__ import absolute_import, division, print_function


class Pipeline(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the point of the Pipeline class. It's here for future-proofing but not presently used? Is that the idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. It's for clarifying exactly what an alternative to ApPipe needs to look like (i.e., it must work in terms of a Dataset and an output repo, and it must have a run method), and separating things needed by any pipeline wrapper from things specific to wrapping ap_pipe.

It's a very C++/Java way of approaching the problem, but @parejkoj assures me it's not discouraged in Python (though he did suggest some changes if I keep this class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pipeline class was my suggestion -- and thanks for doing it. I also think all the "Use ..." comments in each method is helpful, which I failed to realize would disappear when you put the methods into a class.

parsed_cmd_line: `argparse.Namespace`
Command-line arguments, including all arguments supported by `ApPipeParser`.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, your plan is to have each of the methods below (_ingest_raws, etc.) call the appropriate piece of ap_pipe and get ride of the NotImplementedErrors, yes? I like how you've broken them out into logical steps and I like the names you chose 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly the plan.

raise NotImplementedError

def _ingest_templates(self):
"""Ingest precomputed templates for use by LSST.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to specify that the template can be either a calexp or a coadd (there may be more official names for these two data products, but ap_pipe will accept either, with a coadd being the default)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Please disregard my previous comment here.) Am I correct in assuming that raws and calibs should be in a telescope-specific format rather than LSST types like "calexp"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be whatever the butler calls them, which is telescope-specific for DECam CP calibration products (e.g., cpBias and cpFlat) and shouldn't be telescope-specific for the other products (raw, calexp, etc.). I think the correct template datatypes are either calexp or deepCoadd_psfMatchedWarp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer was template-specific but your question was about raws and calibs, sorry. I'm actually not sure, this would be a good question for slack.

class ApPipe(Pipeline):
"""Wrapper for `lsst.ap.pipe` that executes all steps through source
association.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to mention that the steps must be executed in the order presented for the pipeline to work. This is pretty clear from the run method but it can't hurt to say upfront.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. The steps are executed in the order determined by run; who (except somebody modifying that method) would need to care about what happens if you change the order? From the caller's point of view they just call run, and the pipeline runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true; I was thinking of a very weird case where somebody stumbled across ApPipe, ignored the run method, and thought "I'll just run process" and ignore all the rest!" It's fine as-is.


@property
def dataset_root(self):
"""The parent directory containing everything related to the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not assuming that the data_location, calib_location, and template_location must all be in some default set of locations that can be known just by providing dataset_root?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are...? I don't think I understand the question. 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code doesn't make it immediately clear that dataset_root will know where to find "lower-level" things like data_location, calib_location, etc. But I guess there's probably not another obvious way to define them all, so it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this answers your question, but I don't expect ApPipe will need to use dataset_root at all. That's just there as a backup for when an operation really needs to know about the dataset as a whole (currently, it's only used to figure out reruns).


def make_output_repo(self, output_dir):
"""Set up a directory as an output repository compatible with this dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this play nice with reruns? (Asking for a friend who really likes reruns)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reruns are translated to directories in the verify_ap module, so this class doesn't need to care.

output.add_argument(
'--rerun', metavar='OUTPUT',
type=_FormattedType('[^:]+',
'Invalid name "%s"; verify_ap does not support reruns as input.'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing because it's an output argument but the message is "verify_ap does not support reruns as input". Confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is for people who ignore/forget the fact that it's an output argument. Would it be less confusing if the _FormattedType object were factored into a separate variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still confused as to what the type argument is catching here. Normally you use it to clarify that the argparse input must be an int, str, float, whatever. But you're trying to catch some kind of user error where they mistakenly think --rerun could be used in lieu of --dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argparse documentation makes it sound like type is where input validation is supposed to go, given that they imply that it would be the appropriate place for range-checking if choices didn't exist. I agree it feels weird, so maybe I misunderstood.

The error I'm trying to catch is a user saying --rerun INPUT:OUTPUT, which is common(?) when using Tasks. I'm guessing the mindset behind the error would be less "in lieu of --dataset" and more "--dataset specifies the survey but not the file location". (This check was requested by Eric, so maybe that last sentence is completely wrong.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, OK then! I'd never seen anyone try --rerun INPUT:OUTPUT but now that you say that and I pretend to remember how regexes work I can see what you've done. Thanks for clarifying. So you're really insisting that there not be more than one thing shoved into --rerun. A clearer error message might be Invalid name "%s"; verify_ap supports only one output destination when using --rerun. You have entered something that appears to be of the form INPUT:OUTPUT. Try again and specify only OUTPUT.

The framework supports the current dataset system (see DM-11116). It
has been designed to be easy to modify and extend, as the details of
the verification framework are still being worked out.
@kfindeisen kfindeisen merged commit e0328c2 into master Jul 28, 2017
@kfindeisen kfindeisen deleted the tickets/DM-11118 branch November 30, 2018 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants