# Analysis Tools with AP

**Description:** Introduction to generating science performance diagnostic plots and metrics with the [analysis_tools](https://github.com/lsst/analysis_tools) package using an AP dataset

**Contact authors:** Eric Bellm

**Last verified to run:** 

**LSST Science Piplines version:** 

**Container size:** medium

**Targeted learning level:** intermediate

**Skills:** 


## Preliminaries

In [1]:
# Basic imports
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

from pprint import pprint

### Getting set up at USDF

See documentation at
* https://developer.lsst.io/usdf/lsst-login.html
* https://developer.lsst.io/usdf/onboarding.html

### dataset

We'll use an ap_verify repo while we await /repo/main

### Setting up the analysis_tools package

Check the version of the stack you are using

If you are doing development on the `analysis_tools` package and want to test in a notebook, follow the guidance [here](https://nb.lsst.io/science-pipelines/development-tutorial.html). Brief version below (for work on the RSP at USDF):

1. In the termal, clone the [analysis_tools](https://github.com/lsst/analysis_tools) repo and set up the package

```
source /opt/lsst/software/stack/loadLSST.bash
setup lsst_distrib

# Choose file location for your repo
cd ~/repos/
git clone https://github.com/lsst/analysis_tools.git
cd analysis_tools
setup -k -r .
scons
```

2. Add the following line to `~/notebooks/.user_setups`

```
setup -k -r ~/repos/analysis_tools
```

Your local version of `analysis_tools` should now be accessible in a notebook.

In [2]:
!eups list -s | grep lsst_distrib

lsst_distrib          g0b29ad24fb+bdc8955c18 	current w_2022_37 setup


In [3]:
!eups list -s | grep analysis_tools

analysis_tools        LOCAL:/sdf/group/rubin/u/ebellm/stack/analysis_tools 	setup


The `analysis_tools` package was added to `lsst_distrib` in August 2022, and accordingly, if you have set up the LSST Stack version `w_2022_32` or later, then you should be able to import `analysis_tools` directly in the notebook.

In [4]:
import lsst.analysis.tools
print(lsst.analysis.tools.__file__)

/sdf/group/rubin/u/ebellm/stack/analysis_tools/python/lsst/analysis/tools/__init__.py


## Generating consistent metric values and visualizations

### Load data for testing

In [5]:
from lsst.daf.butler import Butler

repo = "/sdf/group/rubin/u/ebellm/workspace/cosmos/repo"
collection = "ap_verify-output"

butler = Butler(repo, collections=[collection])
registry = butler.registry

In [6]:
# Display the available dataset types
for d in sorted(registry.queryDatasetTypes()): print(d.name)

apFakesCompletenessMag20t22_config
apFakesCompletenessMag20t22_log
apFakesCompletenessMag20t22_metadata
apFakesCompletenessMag22t24_config
apFakesCompletenessMag22t24_log
apFakesCompletenessMag22t24_metadata
apFakesCompletenessMag24t26_config
apFakesCompletenessMag24t26_log
apFakesCompletenessMag24t26_metadata
apFakesCountMag20t22_config
apFakesCountMag20t22_log
apFakesCountMag20t22_metadata
apFakesCountMag22t24_config
apFakesCountMag22t24_log
apFakesCountMag22t24_metadata
apFakesCountMag24t26_config
apFakesCountMag24t26_log
apFakesCountMag24t26_metadata
apFakesCount_config
apFakesCount_log
apFakesCount_metadata
apdb_marker
bfKernel
bias
calexp
calexpBackground
calibrate_config
calibrate_log
calibrate_metadata
camera
characterizeImage_config
characterizeImage_log
characterizeImage_metadata
coaddFakes_config
coaddFakes_log
coaddFakes_metadata
createFakes_config
createFakes_log
createFakes_metadata
dark
deepCoadd
deepDiff_diaSrc
deepDiff_diaSrc_schema
deepDiff_differenceExp
deepDiff_diff

In [7]:
sorted(registry.queryDatasets("fakes_deepDiff_assocDiaSrc"))

[DatasetRef(DatasetType('fakes_deepDiff_assocDiaSrc', {band, instrument, detector, physical_filter, visit}, DataFrame), {instrument: 'HSC', detector: 50, visit: 59150, ...}, id=d1ecf491-2fad-4cb6-80a7-2a677c33446c, run='ap_verify-output/20220831T200238Z'),
 DatasetRef(DatasetType('fakes_deepDiff_assocDiaSrc', {band, instrument, detector, physical_filter, visit}, DataFrame), {instrument: 'HSC', detector: 51, visit: 59160, ...}, id=6aa78ba7-fd51-4469-b150-37b87cc42787, run='ap_verify-output/20220831T200238Z')]

In [8]:
sorted(registry.queryDatasets("visitSsObjects"))

[DatasetRef(DatasetType('visitSsObjects', {band, instrument, physical_filter, visit}, DataFrame), {instrument: 'HSC', visit: 59150, ...}, id=3bcbc645-6ae2-4c0d-be68-ad8ad8c3803f, run='sso/cached'),
 DatasetRef(DatasetType('visitSsObjects', {band, instrument, physical_filter, visit}, DataFrame), {instrument: 'HSC', visit: 59160, ...}, id=87297025-294e-459b-9feb-4ef2a60ff873, run='sso/cached')]

In [9]:
dataset_refs = registry.queryDatasets("fakes_deepDiff_assocDiaSrc")

In [10]:
assocDiaSources = []
for ref in dataset_refs:
    assocDiaSource = butler.getDirect(ref)
    assocDiaSources.append(assocDiaSource)

df = pd.concat(assocDiaSources)

In [11]:
assocDiaSources = []
for ref in dataset_refs:
    assocDiaSource = butler.getDirect(ref)
    assocDiaSources.append(assocDiaSource)

df = pd.concat(assocDiaSources)

In [12]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,diaSourceId,ccdVisitId,filterName,diaObjectId,ssObjectId,parentDiaSourceId,midPointTai,bboxSize,flags,ra,...,isDipole,totFlux,totFluxErr,ixx,iyy,ixy,ixxPSF,iyyPSF,ixyPSF,programId
diaObjectId,filterName,diaSourceId,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
25404838930022778,g,25404838930022778,25404838930022778,11830050,g,25404838930022778,0,0,57454.309943,33,31162412.0,149.763291,...,False,27997.194310,44.926085,,,,0.106419,0.006454,0.006454,0
25404838930022779,g,25404838930022779,25404838930022779,11830050,g,25404838930022779,0,0,57454.309943,31,31064396.0,149.763497,...,False,19.508084,21.007758,,,,0.106480,0.006458,0.006458,0
25404838930022780,g,25404838930022780,25404838930022780,11830050,g,25404838930022780,0,0,57454.309943,19,25166156.0,149.764042,...,False,-111.842737,20.839950,,,,0.106483,0.006458,0.006458,0
25404838930022781,g,25404838930022781,25404838930022781,11830050,g,25404838930022781,0,0,57454.309943,25,25166156.0,149.764071,...,False,-84.233869,20.830093,,,,0.106484,0.006458,0.006458,0
25404838930022782,g,25404838930022782,25404838930022782,11830050,g,25404838930022782,0,0,57454.309943,46,332.0,149.764753,...,False,2106.014527,23.239308,0.061968,0.274033,-0.012331,0.106488,0.006458,0.006458,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25409136044802572,g,25409136044802572,25409136044802572,11832051,g,25409136044802572,0,0,57454.329282,19,31162732.0,149.963679,...,False,40.359542,41.123614,,,,0.106425,0.018253,0.018253,0
25409136044802573,g,25409136044802573,25409136044802573,11832051,g,25409136044802573,0,0,57454.329282,17,31163756.0,149.963675,...,False,66.989946,41.675299,,,,0.106416,0.018251,0.018251,0
25409136044802574,g,25409136044802574,25409136044802574,11832051,g,25409136044802574,0,0,57454.329282,18,31162700.0,149.963780,...,False,71.910864,41.623322,,,,0.106400,0.018248,0.018248,0
25409136044802575,g,25409136044802575,25409136044802575,11832051,g,25409136044802575,0,0,57454.329282,31,31162844.0,149.963873,...,False,92.134672,40.750230,,,,0.106355,0.018241,0.018241,0


In [13]:
np.sum(df['ssObjectId'] > 0)

1

In [15]:
# so we can simply count > ssObjectId for the number of assocated diaSources

### Terminology

Data Types
* Scalar - Something that is number like (int, float, numpy.float32 etc.)
* Vector - Something that is ndarray like
* KeyedData - Anything that is indexed by a string that can return a Vector, or Scalar

Analysis Structures
* ConfigurableAction - generic interface for function like objects (actions) that have state which can be set during configuration
* AnalysisAction - A ConfigurableAction subclass that is specialized for actions that function in analysis contexts
* AnalysisTool - A top level "container" of multiple AnalysisActions which performs one type of analysis

Below we dive into the later two in more detail

### Using AnalysisActions

* These are the atomic bits of analysis_tools; They can be combined together to make more complex actions, or used as part of an AnalysisTool
* Show some examples of using configurable actions like standalone functions. This is intended to provide users with more intution about how configurable actions work.
* Examples with KeyedDataActions, VectorActions (including selectors), and ScalarActions
* Show examples of configuration

In [14]:
from lsst.analysis.tools.actions.vector import DownselectVector, ThresholdSelector
from lsst.analysis.tools.actions.scalar import CountAction
#from lsst.analysis.tools.actions.vector import DownselectVector

Let's create an action to determine the number of nonzero `ssObjectId`s

In [15]:
downselector = DownselectVector(vectorKey='ssObjectId',selector=ThresholdSelector(op='gt',threshold=1,vectorKey='ssObjectId'))

In [16]:
# Inspect the configuration of this object.
pprint(downselector.toDict())

{'selector': {'op': 'gt', 'threshold': 1.0, 'vectorKey': 'ssObjectId'},
 'vectorKey': 'ssObjectId'}


In [17]:
# Inspect the required input schema, notice that we will need to provide the band information
list(downselector.getInputSchema())

[('ssObjectId', numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]]),
 ('ssObjectId', numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]])]

In [18]:
ssdf = downselector(df)
print(ssdf)

diaObjectId  filterName  diaSourceId      
0            g           25409136044802436    4362913742843557
Name: ssObjectId, dtype: int64


In [19]:
ct = CountAction(vectorKey='ssObjectId')

In [20]:
pprint(ct.toDict())

{'vectorKey': 'ssObjectId'}


In [21]:
list(ct.getInputSchema())

[('ssObjectId', numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]])]

In [22]:
# selector created a series but the action wanted a dataframe
ct(pd.DataFrame(ssdf))

1

## Make a custom analysis tool

Let's take a look at what at how a metric is implemented before creating a new AnalysisTool

As discussed above, AnalysisTools (AnalysisMetric is a specialized subclass) are container classes for AnalysisActions.

To make deploying metrics and plots as easy as possible Analysis(Metric/Plots) contain default AnalysisActions that enable new AnalysisTools to be created by simply setting configuration.

The default prep action allow specifiying keys to load from input data, and applying selectors. If calling an AnalysisTool directly, the required keys can be set automatically.

The default process action itself has 3 stages to it,
* buildActions - actions which build derived data
* filterActions - If derived data needs to be filtered it can go here
* calculateActions - Any final calculations that may need done can be put here

These stages run sequentually, and any stage are allowed to have no actions set to run in them.

AnalysisMetrics have a default action which allow values produced in process to be mapped to lsst.verify.Measurements

The produce stage of AnalysisPlots do not have a default, as this is where the plot to produce is to be set.

In [23]:
from lsst.analysis.tools.interfaces import AnalysisMetric

from lsst.analysis.tools.actions.vector import DownselectVector, ThresholdSelector
from lsst.analysis.tools.actions.scalar import CountAction

import astropy.units as u

class NumSsObjectsMetric(AnalysisMetric):
    def setDefaults(self):
        super().setDefaults()
        
        self.process.buildActions.thresholdSelector = ThresholdSelector()
        
        # select nonzero ssObjectId
        self.process.buildActions.thresholdSelector.vectorKey = "ssObjectId"
        self.process.buildActions.thresholdSelector.op = "ge"
        self.process.buildActions.thresholdSelector.threshold = 1
       
        # the final name in the qualification is used as a key to insert
        # the calculation into KeyedData
        self.process.filterActions.allSSOs = DownselectVector(vectorKey="ssObjectId",
                                                              selector=self.process.buildActions.thresholdSelector)
        
        self.process.calculateActions.NumSsObjectsMetric = CountAction(vectorKey="allSSOs")

        
        # tell the metric what the units are for the quantity (count, an astropy quantity); needs to be a string
        self.produce.units = {"NumSsObjectsMetric": "ct"}
        
        # Rename the quantity prior to producing the Metric
        # (useful for resuable workflows that set a name toward the end of computation)
        #self.produce.newNames = {"medianValueName": "NumSsObjectsMetric"}

# make some fake data
metric = NumSsObjectsMetric()(df)
print(metric)

{'NumSsObjectsMetric': Measurement('NumSsObjectsMetric', <Quantity 1. ct>)}


## Workflow examples

### Running analysis_tools as part of a pipeline

* **All examples in this notebook should use the simple pipeline executor** (here is how you do it in a notebook)
* We have a PipelineTask for each data product. A task can run multiple AnalysisTools that each produce a set of plots or set of metrics and are subclasses of AnalysisPipelineTask.
* Discuss an example yaml pipeline file (load the yaml)
* Provide the command to run the pipeline
* Show how to configure the pipeline, e.g., turning on or off different metrics and plots or changing other parameters

## Pipeline
This is a copy of the coaddQualityCore pipeline in analysis_tools, reproduced here for reference
```
description: |
  Tier1 plots and metrics to assess coadd quality
tasks:
  analyzeObjectTableCore:
    class: lsst.analysis.tools.tasks.ObjectTableTractAnalysisTask
    config:
      connections.outputName: objectTableCore
      plots.shapeSizeFractionalDiffScatter: ShapeSizeFractionalDiffScatterPlot
      metrics.shapeSizeFractionalMetric: ShapeSizeFractionalMetric
      plots.e1DiffScatter: E1DiffScatterPlot
      metrics.e1DiffScatterMetric: E1DiffMetric
      plots.e2DiffScatter: E2DiffScatterPlot
      metrics.e2DiffScatterMetric: E2DiffMetric
      metrics.skyFluxStatisticMetric: SkyFluxStatisticMetric
      metrics.skyFluxStatisticMetric.applyContext: CoaddContext
      python: |
        from lsst.analysis.tools.analysisPlots import *
        from lsst.analysis.tools.analysisMetrics import *
        from lsst.analysis.tools.contexts import *
  catalogMatchTract:
    class: lsst.analysis.tools.tasks.catalogMatch.CatalogMatchTask
    config:
      bands: ['u', 'g', 'r', 'i', 'z', 'y']
  refCatObjectTract:
    class: lsst.analysis.tools.tasks.refCatObjectAnalysis.RefCatObjectAnalysisTask
    config:
      bands: ['u', 'g', 'r', 'i', 'z', 'y']
      ```

In [6]:
from lsst.ctrl.mpexec import SimplePipelineExecutor
from lsst.pipe.base import Pipeline

# set up an output collection with your username
outputCollection = "u/ebellm/analysisToolsExample"

# this can be skipped if you already have a read writable butler setup (above is read only)
butlerRW = SimplePipelineExecutor.prep_butler(repo, inputs=[collection], output=outputCollection)

# load in the pipeline to run
pipeline = Pipeline.from_uri("$ANALYSIS_TOOLS_DIR/pipelines/apCcdVisitQualityCore.yaml")

# override a configuration within a certain AnalysisTool
#configKey = "plots.shapeSizeFractionalDiffScatter.prep.selectors.snSelector.threshold"
#pipeline.addConfigOverride("analyzeObjectTableCore", configKey, 400)

#bands = ['g', 'r', 'i', 'z']
#pipeline.addConfigOverride("analyzeObjectTableCore", "bands", bands)
#pipeline.addConfigOverride("catalogMatchTract", "bands", bands)
#pipeline.addConfigOverride("refCatObjectTract", "bands", bands)

# restrict processing to the same dataId used above
#whereString = "tract = 9813 AND skymap = 'hsc_rings_v1'"
whereString = ""

# Prevent the executor from dumping plots into the notebook
backend_ =  mpl.get_backend() 
mpl.use("Agg")

executor = SimplePipelineExecutor.from_pipeline(pipeline, where=whereString, butler=butlerRW)
quanta = executor.run(True)

# Restore the ability for plots to be put into the notebook
mpl.use(backend_)

# If you only want to run one plot in a task in the pipeline do the following prior to execution
# pipeline.addConfigOverride("analyzeObjectTableCore", "plots", None)
# pipeline.addConfigOverride("analyzeObjectTableCore", "plots", ShapeSizeFractionalDiffScatterPlot)

### Inspect the results

In [7]:
# refresh our read-only butler to see the changes made. (it's generally a
# good idea to work on read-only things)
butler.registry.refresh()

# see what datasets exist; there should now be diaSourceTableCore_metrics
#for d in sorted(butler.registry.queryDatasetTypes()): print(d.name)

# get the metric that was written
refs = sorted(butler.registry.queryDatasets("diaSourceTableCore_metrics", collections=outputCollection))
refs[1].dataId
diaSourceTable_metrics = butler.get("diaSourceTableCore_metrics", dataId=refs[1].dataId, collections=outputCollection)
pprint(diaSourceTable_metrics)

{'numSsObjects': [Measurement('NumSsObjectsMetric', <Quantity 1. ct>)]}


### Reconstruct the inputs to an analysis_tool

Analysis(Tools/Actions) allow the exact state of and AnalysisTools to be saved into the butler when a pipeline is run. This allows a user to 'reconstruct' things as they were when the tools were executed. This aids in debugging and deep diving into the data.

Below is an example of reconstructing one of the tasks that was run in the Pipeline above.

In [35]:
from lsst.analysis.tools.tasks.reconstructor import reconstructAnalysisTools

# Read in just one task
label = "analyzeObjectTableCore"
taskState, inputData = reconstructAnalysisTools(butler, collection=outputCollection, label=label, dataId=dataId, callback=None)
pprint(taskState)
pprint(inputData)

LookupError: Dataset analyzeObjectTableCore_config with data ID {} could not be found in collections ('u/bechtol/analysisToolsExample',).

Notice that we have access to the object table used to produce the diagnostics.

In [None]:
inputData["data"]

We can now reproduce diagnostic plots.

In [None]:
# Rerun one of the plots
taskState.plots.shapeSizeFractionalDiffScatter(
    inputData['data'],
    band='i',
    skymap=None,
    plotInfo=_StandinPlotInfo()
)

# change some configuration to see the differences
taskState.plots.shapeSizeFractionalDiffScatter.prep.selectors.snSelector.threshold = 50
taskState.plots.shapeSizeFractionalDiffScatter(
    inputData['data'],
    band='i',
    skymap=None,
    plotInfo=_StandinPlotInfo()
)

# This could be run in stages like the above example to investigate issues