DM-11118: Build stubbed out verify_ap #1

kfindeisen · 2017-07-19T22:15:08Z

This is an initial implementation for verify_ap. It should be equivalent to the pseudocode on DM-11118, except that it has two afterburners and adds explicit support for lsst.verify.

The current code does not run because there are no working datasets, so the constructor of Dataset always raises. In addition, the stubs in pipedriver all raise NotImplementedError. However, test_args does run, and the main program's command-line parser is functional:

$ python python/lsst/verify/ap/verify_ap.py -h
usage: verify_ap.py [-h] --dataIdString DATAID [-j PROCESSES] [--silent]
                    --dataset {HiTS} (--output OUTPUT | --rerun OUTPUT)
                    [--version]

Executes the LSST DM AP pipeline and analyzes its performance using metrics.

optional arguments:
  -h, --help            show this help message and exit
  --dataIdString DATAID
                        An identifier for the data to process. Syntax and
                        meaning same as --id in Tasks.
  -j PROCESSES, --processes PROCESSES
                        Number of processes to use
  --silent              Do NOT submit metrics to SQuaSH (not yet implemented).
  --dataset {HiTS}      The source of data to pass through the pipeline.
  --output OUTPUT       The location of the repository to use for program
                        output.
  --rerun OUTPUT        The location of the repository to use for program
                        output, as DATASET/rerun/OUTPUT
  --version             show program's version number and exit

djreiss

Looks mostly good to me.

djreiss · 2017-07-21T21:48:46Z

python/lsst/verify/ap/dataset.py

+        -------
+        a string giving the location of the top-level directory for telescope output files
+        """
+        os.path.join(self.dataset_root, 'raw')


Need return.

djreiss · 2017-07-21T21:48:56Z

python/lsst/verify/ap/dataset.py

+        -------
+        a string giving the location of the top-level directory for master calibration files
+        """
+        os.path.join(self.dataset_root, 'calib')


Need return.

djreiss · 2017-07-21T21:49:07Z

python/lsst/verify/ap/dataset.py

+        -------
+        a string giving the location of the top-level directory for precomputed templates
+        """
+        os.path.join(self.dataset_root, 'templates')


Again, return.

djreiss · 2017-07-21T23:21:20Z

python/lsst/verify/ap/metrics.py

+
+
+def check_squash_ready(parsed_cmd_line):
+    """Tests whether the program can has everything it needs for the SQuaSH API.


djreiss · 2017-07-21T23:27:05Z

python/lsst/verify/ap/metrics.py

+        """
+        log = lsst.log.Log.getLogger('verify.ap.metrics.AutoJob.__exit__')
+
+        out_file = 'verify_ap.verify.json'


Why hardcoded? (there might be a perfectly good reason, I'm just not aware of it)

It's the default name assumed by verify, according to SQR-019, and there doesn't seem an obvious reason to let the user create multiple dump files.

djreiss · 2017-07-24T21:15:15Z

python/lsst/verify/ap/pipedriver.py

+from lsst.verify.ap.dataset import Dataset
+
+
+def _ingest_raws(dataset, repo_root):


Is there a reason these methods are not implemented in a class? There are many common variables passed to each of these methods, esp. dataset and repo_root.

I see they're all called from run_ap_pipe below. But question still holds...

I couldn't think of any particular program state that would be guaranteed by having a PipeWrapper object sitting around, like there is with Dataset and AutoJob.

I guess I could create a class as a convenience wrapper for the function arguments, but that seems a bit gratuitous. It would also make it more difficult for other coders to implement the stubs and/or keep straight which information is needed for which pipeline step.

To me it seems a natural case for a class, where future implementations can override the stubs in a subclass. I know you are not a fan of overuse of inheritance, but to me this seems like a situation where it would be useful.

Oh, and I think this module could use some module-level docstring explaining the usage and where/how it gets called. (or a class-level docstring if we go with my previous suggestion).

I don't think there's any plan for "future implementations"; the stubs will be filled in once ap_pipe's API is finalized. Where would the subclasses live, anyway? There shouldn't be any packages that depend on verify_ap because it's a command-line program rather than a library...

After discussing with @djreiss, this module will be redesigned to make it easier to adapt to other pipelines if somebody decides our system is better than ci_hsc or validate_drp.

djreiss · 2017-07-24T22:04:21Z

python/lsst/verify/ap/verify_ap.py

+from lsst.verify.ap.pipedriver import ApPipeParser, run_ap_pipe
+
+
+class _VerifyApParser(argparse.ArgumentParser):


I'm not yet up to speed on the verify framework or the verify_ap design discussions, but is there any reason that this stuff is not set up as a lsst.pipe.base.Task subclass? That way you'd already have argument and config file parsing built-in.

Simon did not want verify_ap to be a Task, because some of the processing steps are not Tasks and so config info would be lost anyway (or at least, that's what my notes say; is that too garbled, @SimonKrughoff?)

Also, the desired behavior for verify_ap, particularly the idea of a "dataset" that is not necessarily a Butler repository (to support native observatory output, IIRC), leads to command-line arguments that are incompatible with Task requirements. This includes using --dataIdString instead of --id, and the absence of an input-repo or --rerun INPUT:OUTPUT argument. I'll admit I don't know the reason for avoiding --id, but there was some discussion about input reruns on Slack.

djreiss · 2017-07-24T22:58:51Z

tests/test_args.py

+
+class CommandLineTestSuite(lsst.utils.tests.TestCase):
+
+    def test_minimum(self):


Please add docstrings to this and all test_* methods.

kfindeisen · 2017-07-26T20:25:00Z

tests/test_dataset.py

+        """Verify that if a Dataset object exists, the corresponding data are available.
+        """
+        try:
+            import lsst.verify.ap.hits2015  # noqa: F401


@djreiss, I don't think this will work with what you did for DM-11117. Do you have any suggestions for how to make sure a dataset package has been set up correctly?

Apparently I need to add it to repos.yaml. I'll do that shortly. Until I do, You can also just clone the dataset repo and use setup -r .. Let me know if that does/doesn't work.

mrawls

This is a really impressive package you've put together! Once you and @djreiss sort out the dataset interaction business I'm very happy to have this merged and use it as the thing ap_pipe must plug into. Please ensure the README is updated with any UI changes you've made before merging.

mrawls · 2017-07-27T00:13:32Z

python/lsst/verify/ap/appipe.py

+    def __init__(self):
+        # Help and documentation will be handled by main program's parser
+        argparse.ArgumentParser.__init__(self, add_help=False)
+        self.add_argument('--dataIdString', dest='dataId', required=True,


What is the rationale for calling this dataIdString and not just dataId or id? The README example has it set to "visit=54123" and I'm wondering what other information might typically be passed here aside from a visit number.

The DM-11118 description said:

"--dataIdString" (note: often just "--Id" as a command line string in other places)

which I took to mean "we have a specific reason for calling it --dataIdString and especially not --id". Do you remember where that came from?

Hmm, the distinction is likely because it's not a canonical dataId, and id alone is a vague term. I'm fine with dataIdString, I just mistakenly thought the name came from you and not from the design outline. No problem.

How is it not a canonical dataId? I wrote the documentation assuming it was (i.e., 'Syntax and meaning same as --id in Tasks.').

My understanding is that dataId means a very specific thing to the butler and it's represented as a dictionary in python. In contrast, this is a string with some of the info you might find in a "proper" dataId.

So what should the user-facing documentation say? "May not support all features of a Butler dataId, exact subset TBD"?

Yeah, something like that would be great :) but at least give a list of what it currently does support (visit, obviously; anything else?)

What's supported is up to you. From the design outline:

ap_pipe.process(repo_location, rerun, parallelization_level, dataIdString)
ap_pipe.diff(repo_location, rerun, parallelization_level, dataIdString)

Fair enough! For now it will just be visit, but I see why this is more of an ap_pipe consideration. Let's just leave this be.

mrawls · 2017-07-27T00:16:40Z

python/lsst/verify/ap/pipeline.py

+from __future__ import absolute_import, division, print_function
+
+
+class Pipeline(object):


I'm a bit confused about the point of the Pipeline class. It's here for future-proofing but not presently used? Is that the idea?

Sort of. It's for clarifying exactly what an alternative to ApPipe needs to look like (i.e., it must work in terms of a Dataset and an output repo, and it must have a run method), and separating things needed by any pipeline wrapper from things specific to wrapping ap_pipe.

It's a very C++/Java way of approaching the problem, but @parejkoj assures me it's not discouraged in Python (though he did suggest some changes if I keep this class).

The Pipeline class was my suggestion -- and thanks for doing it. I also think all the "Use ..." comments in each method is helpful, which I failed to realize would disappear when you put the methods into a class.

mrawls · 2017-07-27T00:20:49Z

python/lsst/verify/ap/appipe.py

+    parsed_cmd_line: `argparse.Namespace`
+        Command-line arguments, including all arguments supported by `ApPipeParser`.
+    """
+


Just to confirm, your plan is to have each of the methods below (_ingest_raws, etc.) call the appropriate piece of ap_pipe and get ride of the NotImplementedErrors, yes? I like how you've broken them out into logical steps and I like the names you chose 👍

Yes, that's exactly the plan.

mrawls · 2017-07-27T00:22:08Z

python/lsst/verify/ap/appipe.py

+        raise NotImplementedError
+
+    def _ingest_templates(self):
+        """Ingest precomputed templates for use by LSST.


You might want to specify that the template can be either a calexp or a coadd (there may be more official names for these two data products, but ap_pipe will accept either, with a coadd being the default)

(Please disregard my previous comment here.) Am I correct in assuming that raws and calibs should be in a telescope-specific format rather than LSST types like "calexp"?

It should probably be whatever the butler calls them, which is telescope-specific for DECam CP calibration products (e.g., cpBias and cpFlat) and shouldn't be telescope-specific for the other products (raw, calexp, etc.). I think the correct template datatypes are either calexp or deepCoadd_psfMatchedWarp.

My answer was template-specific but your question was about raws and calibs, sorry. I'm actually not sure, this would be a good question for slack.

mrawls · 2017-07-27T00:23:22Z

python/lsst/verify/ap/appipe.py

+class ApPipe(Pipeline):
+    """Wrapper for `lsst.ap.pipe` that executes all steps through source
+    association.
+


It might be worthwhile to mention that the steps must be executed in the order presented for the pipeline to work. This is pretty clear from the run method but it can't hurt to say upfront.

I'm not sure I understand. The steps are executed in the order determined by run; who (except somebody modifying that method) would need to care about what happens if you change the order? From the caller's point of view they just call run, and the pipeline runs.

That's true; I was thinking of a very weird case where somebody stumbled across ApPipe, ignored the run method, and thought "I'll just run process" and ignore all the rest!" It's fine as-is.

mrawls · 2017-07-27T00:27:42Z

python/lsst/verify/ap/dataset.py

+
+    @property
+    def dataset_root(self):
+        """The parent directory containing everything related to the dataset.


Are we not assuming that the data_location, calib_location, and template_location must all be in some default set of locations that can be known just by providing dataset_root?

Yes, we are...? I don't think I understand the question. 😕

I think the code doesn't make it immediately clear that dataset_root will know where to find "lower-level" things like data_location, calib_location, etc. But I guess there's probably not another obvious way to define them all, so it's fine.

I'm not sure if this answers your question, but I don't expect ApPipe will need to use dataset_root at all. That's just there as a backup for when an operation really needs to know about the dataset as a whole (currently, it's only used to figure out reruns).

mrawls · 2017-07-27T00:28:08Z

python/lsst/verify/ap/dataset.py

+
+    def make_output_repo(self, output_dir):
+        """Set up a directory as an output repository compatible with this dataset.
+


Does this play nice with reruns? (Asking for a friend who really likes reruns)

Reruns are translated to directories in the verify_ap module, so this class doesn't need to care.

mrawls · 2017-07-27T00:30:25Z

python/lsst/verify/ap/verify_ap.py

+        output.add_argument(
+            '--rerun', metavar='OUTPUT',
+            type=_FormattedType('[^:]+',
+                                'Invalid name "%s"; verify_ap does not support reruns as input.'),


This is confusing because it's an output argument but the message is "verify_ap does not support reruns as input". Confused.

The error message is for people who ignore/forget the fact that it's an output argument. Would it be less confusing if the _FormattedType object were factored into a separate variable?

I think I'm still confused as to what the type argument is catching here. Normally you use it to clarify that the argparse input must be an int, str, float, whatever. But you're trying to catch some kind of user error where they mistakenly think --rerun could be used in lieu of --dataset?

The argparse documentation makes it sound like type is where input validation is supposed to go, given that they imply that it would be the appropriate place for range-checking if choices didn't exist. I agree it feels weird, so maybe I misunderstood.

The error I'm trying to catch is a user saying --rerun INPUT:OUTPUT, which is common(?) when using Tasks. I'm guessing the mindset behind the error would be less "in lieu of --dataset" and more "--dataset specifies the survey but not the file location". (This check was requested by Eric, so maybe that last sentence is completely wrong.)

Oh, OK then! I'd never seen anyone try --rerun INPUT:OUTPUT but now that you say that and I pretend to remember how regexes work I can see what you've done. Thanks for clarifying. So you're really insisting that there not be more than one thing shoved into --rerun. A clearer error message might be Invalid name "%s"; verify_ap supports only one output destination when using --rerun. You have entered something that appears to be of the form INPUT:OUTPUT. Try again and specify only OUTPUT.

The framework supports the current dataset system (see DM-11116). It has been designed to be easy to modify and extend, as the details of the verification framework are still being worked out.

kfindeisen requested review from djreiss and mrawls July 19, 2017 22:15

djreiss reviewed Jul 24, 2017

View reviewed changes

kfindeisen force-pushed the tickets/DM-11118 branch from dea42e9 to 5bbe80a Compare July 26, 2017 18:21

kfindeisen force-pushed the master branch from 8f24338 to 9872711 Compare July 26, 2017 18:29

kfindeisen force-pushed the tickets/DM-11118 branch from 5bbe80a to 7b6efec Compare July 26, 2017 18:31

kfindeisen commented Jul 26, 2017

View reviewed changes

kfindeisen force-pushed the tickets/DM-11118 branch from 7b6efec to 8a20b55 Compare July 26, 2017 21:04

mrawls reviewed Jul 27, 2017

View reviewed changes

kfindeisen force-pushed the tickets/DM-11118 branch 2 times, most recently from 9e4ddee to 69e1920 Compare July 28, 2017 19:48

kfindeisen force-pushed the master branch from 9872711 to cf2fbc9 Compare July 28, 2017 20:00

kfindeisen force-pushed the tickets/DM-11118 branch from 69e1920 to 1e210be Compare July 28, 2017 20:01

Implement framework for ap_verify.

e0328c2

The framework supports the current dataset system (see DM-11116). It has been designed to be easy to modify and extend, as the details of the verification framework are still being worked out.

kfindeisen force-pushed the tickets/DM-11118 branch from 1e210be to e0328c2 Compare July 28, 2017 20:07

kfindeisen merged commit e0328c2 into master Jul 28, 2017

kfindeisen deleted the tickets/DM-11118 branch November 30, 2018 22:04



		def check_squash_ready(parsed_cmd_line):
		"""Tests whether the program can has everything it needs for the SQuaSH API.

		from lsst.verify.ap.dataset import Dataset


		def _ingest_raws(dataset, repo_root):

		from lsst.verify.ap.pipedriver import ApPipeParser, run_ap_pipe


		class _VerifyApParser(argparse.ArgumentParser):


		class CommandLineTestSuite(lsst.utils.tests.TestCase):

		def test_minimum(self):

		from __future__ import absolute_import, division, print_function


		class Pipeline(object):


		def make_output_repo(self, output_dir):
		"""Set up a directory as an output repository compatible with this dataset.

DM-11118: Build stubbed out verify_ap #1

DM-11118: Build stubbed out verify_ap #1

Conversation

kfindeisen commented Jul 19, 2017

djreiss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrawls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Jul 27, 2017 •

edited

kfindeisen Jul 27, 2017 •

edited

kfindeisen Jul 27, 2017 •

edited

kfindeisen Jul 27, 2017 •

edited