# Pre-bootcamp exercises: accessing data products via butler

**Description:** Demonstrate how to generate science performance diagnostic plots and metrics with the [analysis_tools](https://github.com/lsst/analysis_tools) package using a small test dataset from HSC, [rc2_subset](https://github.com/lsst/rc2_subset).

**Contact authors:** Keith Bechtol, Nate Lust

**Last verified to run:** 2023-05-02

**LSST Science Piplines version:** w_2023_17

**Container Size:** Medium (or larger)

**Location:** This notebook points to files on the S3DF cluster at the USDF. Update paths accordingly if you are running elsewhere.

**Skills:** 
- Load source and object tables using the Butler.
- Generate a science performance diagnostic plot and corresponding metric values interactively in a notebook and as part of a pipeline (simple pipeline executor). 
- Adjust the configuration used to produce these diagnostics. 
- Retrieve persisted plots and metrics with the Bulter. 
- Reconstitute input data products that were used to create plots and metrics for further investigation.


For a quicker introduction, use an existing sandbox repo (prepared for this exercise) to bypass data reduction steps and go straight to data access via the Butler. If you want the full experience, run the data reduction steps in `process_rc2_subset.sh` and then point the Butler to your own repo.

## Preliminaries

In [None]:
# Basic imports
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt

from pprint import pprint

### Getting set up at USDF

The USDF is hosted on the S3DF cluster at SLAC. This notebook has been verified to run on the S3DF cluster.

See USDF documentation at
* https://developer.lsst.io/usdf/lsst-login.html
* https://developer.lsst.io/usdf/onboarding.html

### Processing rc2_subset

[rc2_subset](https://github.com/lsst-dm/rc2_subset) is a small dataset with just enough Hyper Suprime-Cam (HSC) exposures to compute a set of meaningful science performance metrics.

The LSST Science Pipelines [Getting Started tutorial](https://pipelines.lsst.io/#getting-started) provides a guided tour of data processing using rc2_subset as an example.

For convenience, there is a shell script `process_rc2_subset.sh` in the same directory as this notebook that shows the commands to process rc2_subset on the USDF.

### Setting up the analysis_tools package

Check the version of the stack you are using

In [None]:
!eups list -s | grep lsst_distrib

The `analysis_tools` package was added to `lsst_distrib` in August 2022, and accordingly, if you have set up the LSST Stack version `w_2022_32` or later, then you should be able to import `analysis_tools` directly in the notebook.

See the header for this notebook for the most recent verified Science Pipelines version for this tutorial.

In [None]:
import lsst.analysis.tools
print(lsst.analysis.tools.__file__)

**Additional background information:** If you are doing development on the `analysis_tools` package and want to test in a notebook, follow the guidance [here](https://nb.lsst.io/science-pipelines/development-tutorial.html). Brief version below (for work on the RSP at USDF):

1. In the termal, clone the [analysis_tools](https://github.com/lsst/analysis_tools) repo and set up the package

```
source /opt/lsst/software/stack/loadLSST.bash
setup lsst_distrib

# Choose file location for your repo
cd ~/repos/
git clone https://github.com/lsst/analysis_tools.git
cd analysis_tools
setup -k -r .
scons
```

2. Add the following line to `~/notebooks/.user_setups`

```
setup -k -r ~/repos/analysis_tools
```

Your local version of `analysis_tools` should now be accessible in a notebook.

## Load data for testing

In [None]:
import lsst.daf.butler as dafButler

Point to a shared sandbox instance of the processed rc2_subset or point to your own instance. 

**Note:** For one of the sections later in the notebook that shows how to run analysis_tools as part of a pipeline, you need to point to your own processed instance of rc2_subset. That section can be skipped if you are pointing to the shared sandbox repo.

In [None]:
# Point to existing sandbox repo if you prefer to skip processing steps
collections = ['u/bechtol']
repo = '/sdf/group/rubin/user/bechtol/bootcamp_2023/rc2_subset/SMALL_HSC/'

# User instance of the repo if you have processed rc2_subset yourself
#collections = ['u/%s'%os.environ['USER']]
#repo = '/sdf/group/rubin/user/%s/bootcamp_2023/rc2_subset/SMALL_HSC/'%(os.environ['USER'])

In [None]:
butler = dafButler.Butler(repo, collections=collections)
registry = butler.registry

Check what dataset types are present in the collection. Note that the cell below will show _only the dataset types that are present in the specified collection_, not all of the possible dataset types that are known to the butler registry.

In [None]:
for datasetType in registry.queryDatasetTypes():
    if registry.queryDatasets(datasetType, collections=collections).any(execute=False, exact=False):
        print(datasetType)

### Object tables

The examples below use an object table

In [None]:
refs = sorted(registry.queryDatasets("objectTable_tract"))
print(len(refs))

In [None]:
refs[0].dataId

In [None]:
objectTable = butler.get(refs[0])
objectTable

In [None]:
objectTable.columns.values

### Source tables

In [None]:
refs = sorted(registry.queryDatasets("sourceTable_visit"))

In [None]:
for ref in refs: print(ref.dataId.full)

In [None]:
sourceTable = butler.get(refs[-1])
sourceTable

In [None]:
sourceTable.columns.values

## Run `analysis_tools` interactively

Run an `AnalysisTool` interactively in a notebook by passing in-memory data inputs to create metrics and diagnostic plots.

In this example, we compute PSF model size residuals relative to the observed PSF size.

In [None]:
from lsst.analysis.tools.atools import ShapeSizeFractionalDiff
from lsst.analysis.tools.interfaces._task import _StandinPlotInfo
from lsst.analysis.tools.interfaces._actions import NoPlot

In [None]:
atool = ShapeSizeFractionalDiff()
atool.produce.plot.addSummaryPlot = False

# Do not produce plot; only metric values
#atool.produce.plot = NoPlot() 

# This helps simplify some of the configuration
# by ensuring that appropriate keys are set to 
# load columns that are needed in later steps. 
# This happens automatically when an AnalysisTool 
# is used as a single unit.
atool.populatePrepFromProcess() # Needed to run 

Notice that the returned metric values match summary statistics displayed on the plot

In [None]:
results = atool(objectTable, band='i', skymap=None, plotInfo=_StandinPlotInfo())
results

## ConfigurableActions are the atomic bits of `analysis_tools`

We introduce core concepts of the `analysis_tools` package, starting with the idea of [ConfigurableAction](https://pipelines.lsst.io/v/weekly/modules/lsst.pex.config/overview.html#specialized-config-subclasses)s, or actions for short.

### Terminology

Data Types (many of the [actions in analysis_tools](https://github.com/lsst/analysis_tools/tree/main/python/lsst/analysis/tools/actions) are grouped according to the resultant data type)
* `Scalar`: Something that is number like (int, float, numpy.float32 etc.)
* `Vector`: Something that is ndarray like
* `KeyedData`: Anything that is indexed by a string that can return a Vector, or Scalar

Analysis Structures
* `ConfigurableAction`: generic interface for function-like objects (actions) that have state which can be set during configuration
* `AnalysisAction`: A ConfigurableAction subclass that is specialized for actions that function in analysis contexts
* `AnalysisTool`: A top level "container" of multiple AnalysisActions which performs one type of analysis

We dive into the later two in more detail below

### Using AnalysisActions

- Configurable `AnalysisActions` are the atomic bits of `analysis_tools`. They can be combined together to make more complex actions, or used as part of an AnalysisTool
- We show some examples of using configurable actions like standalone functions to provide intution for how configurable actions work.
- Show examples with KeyedDataActions, VectorActions (including selectors), and ScalarActions
- Show examples of configuration

Let's use actions to compute the measured PSF size for a set of stars from an object catalog.

In [None]:
from lsst.analysis.tools.actions.vector import CalcShapeSize, MagColumnNanoJansky

In [None]:
sizeCalculator = CalcShapeSize()

In [None]:
# Inspect the configuration of this object.
pprint(sizeCalculator.toDict())

In [None]:
# Inspect the required input schema, notice that we will need to provide the band information
sizeCalculator.getInputSchema()

In [None]:
size = sizeCalculator(objectTable, band='i')
print(size)

In [None]:
# Another example, this time to convert fluxes to magnitudes
mag = MagColumnNanoJansky(vectorKey='{band}_psfFlux')(objectTable, band='i')

# Notice that the line above is equiavalent to the following
mag_alternate = MagColumnNanoJansky(vectorKey='i_psfFlux')(objectTable)

assert np.allclose(mag, mag_alternate, equal_nan=True)

In [None]:
plt.figure()
plt.scatter(mag, size, s=1)
plt.xlim(17.5, 30.)
plt.ylim(0, 5)
plt.xlabel('mag')
plt.ylabel('size')

Let's remake that simple plot now selecting only the stars

In [None]:
from lsst.analysis.tools.actions.vector import StarSelector

In [None]:
star_selection = StarSelector()(objectTable, band='i')

In [None]:
plt.figure()
plt.scatter(mag[star_selection], size.values[star_selection], s=1)
plt.xlim(17.5, 30.)
plt.ylim(0, 5)
plt.xlabel('mag')
plt.ylabel('size')

We can chain together `AnalysisAction`s, as in the following example that produces an equivalent plot. The `analysis_tools` package frequently uses this approach of chaining together `AnalysisAction`s.

In [None]:
from lsst.analysis.tools.actions.vector import DownselectVector
from lsst.analysis.tools.actions.keyedData import AddComputedVector

band = "i"
objectTableDemo = AddComputedVector(action=CalcShapeSize(), keyName="size")(objectTable, band=band)

# Note type(objectTable) is now a python dictionary instead of a pandas table, but since both
# "quack" like KeyedData so they can be used interchangably

objectTableDemo = AddComputedVector(
    action=MagColumnNanoJansky(vectorKey='{band}_psfFlux'),
    keyName="mag"
)(objectTableDemo, band=band)

size = DownselectVector(vectorKey="size", selector=StarSelector())(objectTableDemo, band=band)
mag = DownselectVector(vectorKey="mag", selector=StarSelector())(objectTableDemo, band=band)

plt.figure()
plt.scatter(mag, size, s=1)
plt.xlim(17.5, 30.)
plt.ylim(0, 5)
plt.xlabel('mag')
plt.ylabel('size')

### Actions as a generic interface for data
Actions are not restricted to tables or products loaded from the butler, `KeyedData` could also be things like dictionaries of numpy arrays.

In [None]:
import lsst.analysis.tools.actions
dir(lsst.analysis.tools.actions)

In [None]:
from lsst.analysis.tools.actions.scalar import StdevAction

# create some KeyedData
data = {"randomData": np.random.normal(0, 3, 10000)}

# initialize an action, setting it to use the key set above
action = StdevAction(vectorKey="randomData")

#plt.figure()
#plt.hist(data["randomData"], bins=100, density=True)

# Run the action and print the results
print(f"The standard deviation is {action(data)}")

### Create a new action

In the example below, we define a new VectorAction to multiply a vector by a scalar.

In [None]:
from lsst.analysis.tools import VectorAction, KeyedData, KeyedDataSchema, Vector
from lsst.pex.config import Field
#rom lsstinterfaces import KeyedData, KeyedDataSchema

class MultiplyByScalar(VectorAction):
    """Multiply vector by a scalar value"""

    vectorKey = Field[str](doc="Key of vector which should be loaded")
    factor = Field[float](doc="Multiplicative factor", default=1.)

    def getInputSchema(self) -> KeyedDataSchema:
        return ((self.vectorKey, Vector),)

    def __call__(self, data: KeyedData, **kwargs) -> Vector:
        return np.array(self.factor * data[self.vectorKey.format(**kwargs)])

In [None]:
action = MultiplyByScalar(vectorKey="randomData", factor=2.)
results = action(data)
assert np.allclose(results, 2. * data['randomData'])

### Three conceptual steps in an `AnalysisTool`: prep, process, produce

As mentioned AnalysisTools can be thought of as executable containers of AnalysisActions. There are three different AnalysisActions, referred to as stages, named prep, process, and produce.
* Prep: Responsible for any initial selection and filtering of data
* Process: This is where any transformations and/or calculations are made
* Produce: Generates final plot and/or metric objects

The following examples will:
* Walk through the three stages of running an analysis tool in sequential lines of code, passing the output of one step as input to the next step
* Examine intermediate results

In [None]:
prepResults = atool.prep(objectTable, band='i')
processResults = atool.process(prepResults, band='i')
produceResults = atool.produce(processResults, band='i', skymap=None, plotInfo=_StandinPlotInfo())

Inspect the intermediate results

In [None]:
prepResults

In [None]:
processResults

## Workflow examples

### Running analysis_tools as part of a pipeline

* **All examples in this notebook should use the simple pipeline executor** (here is how you do it in a notebook)
* We have a PipelineTask for each data product. A task can run multiple AnalysisTools that each produce a set of plots or set of metrics and are subclasses of AnalysisPipelineTask.
* Discuss an example yaml pipeline file (load the yaml)
* Provide the command to run the pipeline
* Show how to configure the pipeline, e.g., turning on or off different metrics and plots or changing other parameters

**WARNING:** If you are using your own processed instance of rc_subset, run the following cells to run analysis_tools as part of a pipeline from the notebook. If you are pointing to a shared sandbox instance of rc2_subset, skip the remaining cells in this section.

In [None]:
from lsst.ctrl.mpexec import SimplePipelineExecutor
from lsst.pipe.base import Pipeline

# set up an output collection with your username
analysisToolsCollection = "u/%s/analysisToolsExample"%os.environ['USER']

# this can be skipped if you already have a read writable butler setup (above is read only)
butlerRW = SimplePipelineExecutor.prep_butler(repo, inputs=collections, output=analysisToolsCollection)

# load in the pipeline to run
pipeline = Pipeline.from_uri("$ANALYSIS_TOOLS_DIR/pipelines/coaddQualityCore.yaml")

# override a configuration within a certain AnalysisTool
#configKey = "atools.shapeSizeFractionalDiff.prep.selectors.snSelector.threshold"
#pipeline.addConfigOverride("analyzeObjectTableCore", configKey, 400)

# Run only the PSF size residual tool
pipeline.addConfigOverride("analyzeObjectTableCore", "atools", None)
pipeline.addConfigOverride("analyzeObjectTableCore", "atools.shapeSizeFractionalDiff", ShapeSizeFractionalDiff)

#bands = ['g', 'r', 'i', 'z']
bands = ['i']
pipeline.addConfigOverride("analyzeObjectTableCore", "bands", bands)
pipeline.addConfigOverride("catalogMatchTract", "bands", bands)
pipeline.addConfigOverride("refCatObjectTract", "bands", bands)

# restrict processing to the same dataId used above
whereString = "tract = 9813 AND skymap = 'hsc_rings_v1'"

Display the pipeline that is about to be run

In [None]:
print(pipeline)

In [None]:
# Prevent the executor from dumping plots into the notebook
backend_ =  mpl.get_backend() 
mpl.use("Agg")

executor = SimplePipelineExecutor.from_pipeline(pipeline, where=whereString, butler=butlerRW)
quanta = executor.run(True)

# Restore the ability for plots to be put into the notebook
mpl.use(backend_)

In [None]:
# Refresh our read-only butler to see the changes made. (It's generally a
# good idea to work on read-only things)
butler.registry.refresh()

## Access persisted metrics

Specify the collection that holds the results of running analysis_tools. If you are pointing to your own instance of rc2_subset, you did this already above prior to running the pipeline and you can comment out the line below.

In [None]:
# Comment out the line below if you are pointing to your own instance of rc2_subset
analysisToolsCollection = "u/bechtol/analysisToolsExample"

Check what dataset types exist in the new collection.

In [None]:
# See what datasets exist; there should now be objectTableCore_metrics
for datasetType in registry.queryDatasetTypes():
    if registry.queryDatasets(datasetType, collections=analysisToolsCollection).any(execute=False, exact=False):
        print(datasetType)

Access the persisted metrics.

In [None]:
# Get the metric that was written
refs = sorted(butler.registry.queryDatasets("objectTableCore_metrics", collections=analysisToolsCollection))
print(refs)
dataId = refs[0].dataId
objectTable_metrics = butler.get("objectTableCore_metrics", dataId=refs[0].dataId, collections=analysisToolsCollection)
pprint(objectTable_metrics)

Also, we can see that a plot has been persisted, but there isn't a way to visualize the plot from the notebook. In the next section, we'll use the `reconstructor` to recreate the plot in the notebook.

In [None]:
refs = sorted(butler.registry.queryDatasets(
    "objectTableCore_i_shapeSizeFractionalDiff_ScatterPlotWithTwoHists", 
    collections=analysisToolsCollection)
)
print(refs[0])

# The following line will throw an error
# plot = butler.get(
#     "objectTableCore_i_shapeSizeFractionalDiff_ScatterPlotWithTwoHists",
#     dataId=refs[0].dataId,
#     collections=analysisToolsCollection
# )

### Reconstruct the inputs to an `AnalysisTool`

Analysis(Tools/Actions) allow the exact state of `AnalysisTool`s to be saved into the Butler when a pipeline is run. This allows a user to 'reconstruct' things as they were when the tools were executed. This aids in debugging and deep diving into the data.

Below is an example of reconstructing one of the tasks that was run in the Pipeline above.

In [None]:
from lsst.analysis.tools.tasks.reconstructor import reconstructAnalysisTools

# Read in just one task
label = "analyzeObjectTableCore"
taskState, inputData = reconstructAnalysisTools(butler, 
                                                collection=analysisToolsCollection, 
                                                label=label, 
                                                dataId=dataId, 
                                                callback=None
)

We have access to the exact configuration that was used to run the analysis tools.

In [None]:
pprint(taskState.toDict())

We also have access to the input data that were used to produce the diagnostics, in this case, the object table.

In [None]:
inputData["data"]

Quick check to verify that the object table data are indeed the same.

In [None]:
assert np.allclose(inputData["data"]['coord_ra'], objectTable['coord_ra'])

We can now reproduce diagnostic metrics and plots interactively in the notebook

In [None]:
# The following configuration won't be needed in the future
taskState.atools.shapeSizeFractionalDiff.produce.plot.addSummaryPlot = False

taskState.atools.shapeSizeFractionalDiff(
    inputData['data'],
    band='i',
    skymap=None,
    plotInfo=_StandinPlotInfo()
)

Next, we change one of the configuration parameters to see how the results change. 

In this example, we raise the signal-to-noise threshold to 200. Notice that the metric values and plot change with this updated object selection criteria.

In [None]:
# change some configuration to see the differences
taskState.atools.shapeSizeFractionalDiff.prep.selectors.snSelector.threshold = 200
taskState.atools.shapeSizeFractionalDiff(
    inputData['data'],
    band='i',
    skymap=None,
    plotInfo=_StandinPlotInfo()
)

As before, we can also step through the calculation to check intermediate steps.

In [None]:
prepResults = taskState.atools.shapeSizeFractionalDiff.prep(objectTable, band='i')
processResults = taskState.atools.shapeSizeFractionalDiff.process(prepResults, band='i')
produceResults = taskState.atools.shapeSizeFractionalDiff.produce(processResults, band='i', skymap=None, plotInfo=_StandinPlotInfo())

## Create a new analysis tool

Example `AnalysisTool`s can be found in the [atools](https://github.com/lsst/analysis_tools/tree/main/python/lsst/analysis/tools/atools) directory of the package.

Now let's create our own `AnalysisTool`

In [None]:
from lsst.analysis.tools import AnalysisTool
from lsst.analysis.tools.actions.scalar import MedianAction, CountAction
from lsst.analysis.tools.actions.vector import SnSelector


class DemoTool(AnalysisTool):
    #parameterizedBand: bool = False
    
    def setDefaults(self):
        super().setDefaults()
        
        # select on high signal to noise obejcts
        # add in a signal to noise selector
        self.prep.selectors.snSelector = SnSelector()
        
        # set what key the selector should use when deciding SNR
        self.prep.selectors.snSelector.fluxType = "psfFlux"
        
        # select what threshold value is desireable for the selector
        self.prep.selectors.snSelector.threshold = 10
        
        # the final name in the qualification is used as a key to insert
        # the calculation into KeyedData
        self.process.calculateActions.median = MedianAction(vectorKey="psfFlux")
        self.process.calculateActions.count = CountAction(vectorKey="psfFlux")
        
        # tell the metic what the units are for the quantity
        self.produce.metric.units = {"median": "Jy",
                                     "count": "count"}
        
        # Rename the quanity prior to producing the Metric
        # (useful for resuable workflows that set a name toward the end of computation)
        #self.produce.metric.newNames = {"medianValueName": "DemoMetric"}

Examine the configuration of our new tool

In [None]:
DemoTool().toDict()

Make some synthetic data

In [None]:
# Make some synthetic data
size = 500
flux = np.logspace(1., 4., size)
fluxErr = np.sqrt(flux)
flux += np.random.normal(0, np.sqrt(flux), size)
data = {"psfFlux": flux, "psfFluxErr": fluxErr}

plt.figure()
plt.xscale('log')
plt.yscale('log')
plt.scatter(data['psfFlux'], data['psfFluxErr'])
plt.xlabel('psfFlux')
plt.ylabel('psfFluxErr')

Run the new analysis tool.

In [None]:
demoTool = DemoTool()
demoTool.populatePrepFromProcess()

# We can configure as needed
demoTool.prep.selectors.snSelector.threshold = 50

demoTool(data)

We can inspect intermediate stages of the analysis.

In [None]:
# Example: how many of the data points pass the SNR threshold?
len(demoTool.prep(data)['psfFlux'])