# Gen3 Butler Basics

## Environment setup

This tutorial runs against a built copy of the [ci_hsc_gen3](https://github.com/lsst/ci_hsc_gen3) package.  As of this writing, a copy can be found at `/project/jbosch/gen3/bootcamp/ci_hsc_gen3` on `lsst-dev` or the LSP.  The notebook was originally written against stack release `w_2019_45`, but probably works with many others around the same time; most of the interfaces demonstrated here are fairly stable, and can be run with a "small" JupyterLab instance in the LSP.

In [1]:
import os
#REPO_ROOT = "/project/jbosch/gen3/bootcamp/ci_hsc_gen3/DATA"
REPO_ROOT = "/global/cscratch1/sd/desc/DC2/gen3/Run2.2i-gen3"

## Data Repositories and Collections

As in Gen2, you initialize a `Butler` by pointing at a data repository, which is usually represented by a directory with a `butler.yaml` file in it.

Unlike Gen2, that's not enough; you need to pass a `collection` name, too:

In [2]:
from lsst.daf.butler import Butler
butler = Butler(REPO_ROOT, collections="desc/demo")

That's because a Gen3 data repository is actually more like a set of related Gen2 data repositories (_e.g._ `/datasets/hsc/repo` and all of the data repositories that (recursively) consider it a parent.  A Gen3 `collection` is like a specific Gen2 data repository, and while some collections may also be associated with subdirectories, that's not true in general.

The `shared/ci_hsc_output` collection is the one that holds the results of building ci_hsc_gen3 (and hence running a bunch of `PipelineTask`.  There are lots of other collections here:

In [3]:
butler.registry.queryCollections()

<generator object Registry.queryCollections at 0x2aaad0f825d0>

There's clearly the beginning of convention here for how to name collections, but it hasn't been worked out in detail.  From the perspective of the code, they're just strings, and the slashes don't mean anything.

## Registry and Datastore

The last call was actually on `butler.registry`, not just `butler`, and that'll be a relatively common occurrence in Gen3, because a `Butler` is really just a convenience wrapper that combines three things:

* a `Registry` instance that manages metadata and relationships between datasets via a SQL database (a `gen3.sqlite3` file in the repo root, in this case);
* a `Datastore` instance that manages the datasets themselves (files subdirectories of the repo root, in this case);
* the name of the `collection`.

You'll frequently use `butler.registry` to perform operations that don't need anything from the `Datastore`.

A `butler.datastore` attribute exists as well, but it's much less likely that you'll need to use it directly (I can't think of a reason).

Neither the `Registry` nor the `Datastore` know about the `collection` you passed when constructing the `Butler`, so when using them directly you many need to pass `butler.collection` to them.

## How to spell `get`

The most common thing you'll do with a `Butler` is call `get` (`PipelineTasks` call `put` just about as often, but usually the author of a concrete `PipelineTask` won't actually write any `get` and `put` calls).  In it's simplest form, that looks something like this:

In [4]:
#dataId = {"skymap": "discrete/ci_hsc", "tract": 3078, "patch": 69, "abstract_filter": "r"}
#dataId = {"skymap": "skymaps/imsim", "tract": 3078, "patch": 26, "abstract_filter": "z"}
dataId = {"skymap": "imsim_skymap", "tract": 3078, "patch": 26, "abstract_filter": "z"}

coadd = butler.get("deepCoadd", dataId=dataId)

There's a lot to unpack here, but let's start by making it clear that you can write this a few different ways, and they're all equivalent (and most of them are identical to Gen2, aside from what's in the data ID):

In [5]:
# Pass data ID as a positional argument:
coadd = butler.get("deepCoadd", dataId)

In [6]:
# Pass data ID as multiple keyword arguments:
coadd = butler.get("deepCoadd", skymap="imsim_skymap", tract=3078, patch=26, abstract_filter="z")

In [7]:
# Do both.  Keyword arguments override the data ID dict (considered a feature, though it may be suprising).
coadd = butler.get("deepCoadd", dataId, patch=26, abstract_filter="z")

### DatasetTypes

The first argument can also be a `DatasetType` object instead of the string that refers to one.  A `DatasetType` instance knows the data ID keys needed to identify it (we call those "dimensions") and its StorageClass, which you can think of as a mapping to the Python type you'll get back:

In [8]:
deepCoaddType = butler.registry.getDatasetType("deepCoadd")
print(deepCoaddType)

DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF)


In the Gen2 butler, all dataset types had to be pre-declared in an obs_ package.  In Gen3, they're added to the data repository as needed by the `PipelineTasks` that create them, so if you want to know what dataset types exist, you'll need to ask the registry (someday we'll hopefully find some way to put a snapshot of a "typical" registry's dataset types in the online documentation):

In [9]:
# butler.registry.getAllDatasetTypes()  Method rename HMK?

Note that many of those `DatasetTypes` are components of some other `DatasetType`, like `deepCoadd.psf`.  You can `get` components just as you would their parents, and usually that'll be much more efficient if the component is a small piece of the whole:

In [10]:
inputs = butler.get("deepCoadd.coaddInputs", dataId)
print(inputs.visits)

  id   bbox_min_x bbox_min_y bbox_max_x ... goodpix        weight       filter
          pix        pix        pix     ...                                   
------ ---------- ---------- ---------- ... -------- ------------------ ------
 13289      19900      11900      24099 ...  6199167 0.6635962142150572      z
 32683      19900      11900      24099 ... 11099088 0.7962166507011508      z
209058      19900      11900      24099 ... 16016002 0.5307254982981001      z
209066      19900      11900      24099 ... 16078479 0.5365676180201323      z
209091      19900      11900      24099 ... 16066550 0.5376226732711255      z
237907      19900      11900      24099 ...  9232702  1.044202855104819      z
240855      19900      11900      24099 ... 16132106 1.2021152736290244      z
303558      19900      11900      24099 ... 15944329 0.8779514976477992      z
426663      19900      11900      24099 ... 16211996 0.8029725996590692      z
443964      19900      11900      24099 ... 16077353

Anyhow, as promised, you can use that `DatasetType` instance in `get`:

In [11]:
coadd = butler.get(deepCoaddType, skymap="imsim_skymap", tract=3078, patch=26, abstract_filter="z")

### DatasetRefs

The Gen3 Butler makes use of the combination of a `DatasetType` and a data ID frequently enough that there is a special object for that, `DatasetRef`:

In [16]:
from lsst.daf.butler import DatasetRef
ref = DatasetRef(deepCoaddType, {"skymap": "imsim_skymap", "tract": 3078, "patch": 26, "abstract_filter": "z"})
print(ref)

deepCoadd@{abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 26}


You can pass a `DatasetRef` as the _only_ argument to `get`:

In [18]:
coadd = butler.get(ref)  

If you're familiar with Gen2, you might have noticed that a Gen2 **DataRef** is _completely_ different from a Gen3 **DatasetRef**, and if you care about this, please chime in on [DM-21448](https://jira.lsstcorp.org/browse/DM-21448) where (among other things) I'm considering renaming it.  Everybody else: maybe don't get too attached to this name yet.

### Parameters

Component datasets are used to get predefined, differently-typed pieces of a composite dataset.  For some dataset types it's desirable to get same-typed, parameterized subsets, and that's what the `parameters` keyword argument to `get` is for.  The classic case (and the only one supported right now) is a subimage:

In [19]:
from lsst.geom import Box2I, Point2I
# HMK Not sure what a good box size for DESC Run2.2i would be
#bbox = Box2I(Point2I(20000, 16000), Point2I(20200, 18000))
#parameters = {"bbox": bbox}
#subcoadd = butler.get("deepCoadd", dataId, parameters=parameters) 

In [20]:
import numpy as np
# assert np.all(subcoadd.image.array == coadd[bbox].image.array)  HMK commenting out for now

Unlike Gen2,

- The dataset type name is the same (still just `deepCoadd`, not `deepCoadd_sub`, as it was in Gen2).

- You pass all parameters as a single dict as the `parameters` kwarg, rather than as separate kwargs that could get confused with the data ID.]

Any of the alternate spellings of `get` shown above can be used with parameters, including the `DatasetRef` one - the parameters go in the call to `get`, not inside the `DatasetRef`:

In [21]:
# subcoadd = butler.get(ref, parameters=parameters) HMK commenting out for now

## Querying for Datasets

One of the most important new features of the Gen3 Butler is much more complete support for querying datasets.  That all goes through the `queryDatasets` method.  A typical query might look something like this:

In [22]:
list(butler.registry.queryDatasets("deepCoadd", collections=["desc/demo"]))

[DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 2}, id=149652, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 8}, id=149653, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 1}, id=149654, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 3}, id=149659, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 10}, id=149664, run='desc/demo/202007

We've wrapped the call in `list` because `queryDatasets` returns a single-pass iterator, not a container.  **It's also not guaranteed to return unique results**, because it might be much more expensive to do some kinds of deduplication (especially for complex queries) in the database.  We'll work on improving this in the future; right now there is a `deduplicate` option but it only deals with some kinds of duplication, and unfortunately `DatasetRef` isn't yet hashable so you can't just drop the results into a `set`.

You can also pass a single data ID, either as a single argument or (as with `get`) a number of keyword arguments:

In [23]:
list(butler.registry.queryDatasets("deepCoadd", collections=["desc/demo"],
                                   dataId={"abstract_filter": "z"}, deduplicate=True))

[DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 2}, id=149652, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 8}, id=149653, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 1}, id=149654, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 3}, id=149659, run='desc/demo/20200719T07h45m18s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 10}, id=149664, run='desc/demo/202007

That data ID doesn't even have to be directly related to the dataset; `queryDatasets` will automatically use temporal or spatial overlaps if it needs to.  Here's a query for all of the calexps that overlap a patch:

In [24]:
list(butler.registry.queryDatasets("calexp", collections=["desc/demo"],
                                   skymap="imsim_skymap", tract=3078, patch=26))

[DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: LSST-ImSim, detector: 76, visit: 443964}, id=142753, run='desc/demo/20200719T06h14m32s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: LSST-ImSim, detector: 76, visit: 443964}, id=142753, run='desc/demo/20200719T06h14m32s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: LSST-ImSim, detector: 76, visit: 443964}, id=142753, run='desc/demo/20200719T06h14m32s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: LSST-ImSim, detector: 76, visit: 443964}, id=142753, run='desc/demo/20200719T06h14m32s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_s

There's a big caveat to these spatial lookups, though: while they're guaranteed to return any dataset that overlaps the given data ID, they may also return some that don't, because the regions for observations are defined during ingest, and we pad those quite a bit to account for possibly-bad WCSs.  The above query actually returns all of the calexps in the (small) ci_hsc dataset, because there's so much padding, and what's worse, it returns some of them several times (remember that there's no guarantee about uniqueness).

But, unlike Gen2, everything it returns actually does exist in the data repository.

So far, we've passed a single dataset type and a single collection.  You can also pass `...` for either argument to look for all dataset types and/or in all collections:

In [25]:
list(butler.registry.queryDatasets(..., collections=["refcats"]))

[DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 143674}, id=12444, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 144278}, id=12445, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 144151}, id=12446, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 147054}, id=12447, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 143718}, id=12448, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 146995}, id=12449, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 144289}, id=12450, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 144293}, id=12451, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 141883}, id=12452, run='refcats'),
 DatasetRef(DatasetType(cal_ref_cat, {htm7}, SimpleCatalog), {htm7: 14182

SQL `LIKE` wildcards are also accepted for both, if you use the special `Like` object:

In [26]:
#from lsst.daf.butler.core.queries import Like  # I just noticed now that this isn't lifted to daf.butler; it should be.  # Like isn't recognized HMK
#set(ref.datasetType for ref in butler.registry.queryDatasets(Like("deepCoadd%image"), collections=...))

In [27]:
# set(ref.datasetType for ref in butler.registry.queryDatasets(..., collections=[Like("%/hsc")]))

Finally, `queryDataset` allows you to pass a [boolean expression (in mostly SQL-like syntax)](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/exprParser.html) that involves any dimension field (including metadata):

In [28]:
#list(butler.registry.queryDatasets("raw", collections=["raw/hsc"], where="visit < 903338 AND detector IN (15..50)"))
list(butler.registry.queryDatasets("raw", collections=["LSST-ImSim/raw/all"], where="visit < 903338 AND detector IN (15..50)"))

[DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {instrument: LSST-ImSim, detector: 24, exposure: 184884}, id=5957, run='LSST-ImSim/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {instrument: LSST-ImSim, detector: 21, exposure: 184884}, id=5949, run='LSST-ImSim/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {instrument: LSST-ImSim, detector: 43, exposure: 184884}, id=5946, run='LSST-ImSim/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {instrument: LSST-ImSim, detector: 42, exposure: 184884}, id=5945, run='LSST-ImSim/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {instrument: LSST-ImSim, detector: 47, exposure: 184884}, id=5942, run='LSST-ImSim/r

## Data IDs and Dimensions

Perhaps the biggest differences in `get` between Gen2 and Gen3 are in the data ID.  Here's the Gen3 data ID again, along with its Gen2 counterpart:

In [29]:
dataId_gen2 = {"tract": 3078, "patch": "5,4", "filter": "z"}
print(f"Gen3: {dataId}")
print(f"Gen2: {dataId_gen2}")

Gen3: {'skymap': 'imsim_skymap', 'tract': 3078, 'patch': 26, 'abstract_filter': 'z'}
Gen2: {'tract': 3078, 'patch': '5,4', 'filter': 'z'}


Exactly one key-value pair is the same: `tract=0`.

One key-value pair has clearly has the same intent, but has both a different key and a different value: `abstract_filter="r"`.  Gen3 distinguishes between "physical" filters, which are associated with a particular piece of glass on a particular instrument (these still have names like "HSC-R"), and "abstract" filters, which are named groups of similar filters (with names like "r").  The coadd dataset types are all defined in terms of `abstract_filter`.  That's not just so we can coadd data from multiple instruments together (this is just one of several steps we'd need to enable that) - it also helps with cameras like HSC that have two versions of the same filter (i.e. "HSC-R" and "HSC-R2" are both `physical_filters`) that we want to be able to combine.

Right now, each `physical_filter` corresponds to exactly one `abstract_filter` (it's many-to-one).  We know that reality is more complex than that (many-to-many), and we expect to generalize this in the future.  We're thinking about renaming `abstract_filter` to just `filter` along the way.

The `skymap` key is a totally new one.  In Gen3, all data IDs that involve `tract` also need to involve a `skymap` key that indicates which `skymap` defines that tract.  New skymaps can be added to a `Registry` by calling [BaseSkyMap.register](https://pipelines.lsst.io/v/weekly/py-api/lsst.skymap.BaseSkyMap.html#lsst.skymap.BaseSkyMap.register) or running the [makeGen3SkyMap.py](https://github.com/lsst/pipe_tasks/blob/master/bin.src/makeGen3Skymap.py) command-line tool (which also writes a `deepCoadd_skyMap` dataset, which we usually want).  A skymap must be added to the `Registry` before we can run any `PipelineTask` or `put` any dataset that uses it.

The `patch` key is present in both the Gen2 and Gen3 data IDs, but with a different value.  That's because (as per [RFC-365](https://jira.lsstcorp.org/browse/RFC-365)) `patch` identifiers in Gen3 are single integers that encode both the `x` and `y` indices.  We'll show later how to convert between these.

### DataCoordinate and Dimension instances

While you can still pass simple dictionaries as arguments to `Butler` and `Registry` APIs that expect data IDs, the objects we get back from the butler are always instances of `DataCoordinate`.  In fact, the data ID associated with a `DatasetRef` is a `DataCoordinate` as well:

In [30]:
ref.dataId

{abstract_filter: 'z', skymap: 'imsim_skymap', tract: 3078, patch: 26}

`DataCoordinate` instances are also dict-like objects, and they can be passed anywhere a dictionary-like data ID can be passed; many _internal_ `daf_butler` APIs require them.  They even look like vanilla dicts if you print them with `__str__` (`__repr__` is what's invoked above):

In [31]:
print(ref.dataId)

{abstract_filter: z, skymap: imsim_skymap, tract: 3078, patch: 26}


The first thing worth noting about a `DataCoordinate` is that its keys aren't actually strings; they're instances of the `Dimension` class:

In [32]:
ref.dataId.keys()

dict_keys([Dimension(abstract_filter), Dimension(skymap), Dimension(tract), Dimension(patch)])

`Dimension` instances are comparable to the strings that identify them, but they're much more than labels.  Using dimensions to identify datasets is a core concept for the Gen3 butler, and you can find a lot more information on it in the [API documentation](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html).  We'll cover the basics here.

Most - but not all - dimensions are associated with a table in the `Registry` database.  The rows in those tables contain the valid data ID values for that dimension, but they can also contain metadata fields and foreign key fields that are used to model relationships between dimensions.  The rows in the dimension tables are the same across all collections - that's important, because we want the meaning of a data ID to be consistent across (e.g.) different processing runs.  Dimensions form a sort of scaffolding or skeleton for datasets - datasets do not have relationships of their own; instead we rely on the web of dimension relationships to connect them.

The full set of dimensions known to a `Registry` is stored in a class called the `DimensionUniverse`:

In [33]:
butler.registry.dimensions

DimensionUniverse({htm7, htm9, abstract_filter, instrument, skymap, calibration_label, detector, physical_filter, subfilter, tract, visit_system, exposure, patch, visit})

You can get dimension instances via dict-like indexing on this object, and we'll use that to take a closer look at the `visit` dimension:

In [34]:
visit = butler.registry.dimensions["visit"]

Note that this object represents the _concept_ of a visit, not a particular visit.

Let's start by printing the fields of the visit table:

In [35]:
lsst.daf.butler.datastores.posixDatastore   # HMK just tossed this in here to try to get it to work
def printFieldSpecs(*specs):
    """Print the names and types from one or more FieldSpec instances."""
    for spec in specs:
        print(f"{spec.name} ({spec.dtype.__name__})")
        
printFieldSpecs(*visit.makeTableSpec().fields)

NameError: name 'lsst' is not defined

Some of these (`exposure_time`, `seeing`) are just metadata fields with no special structure.  We'll have to add more of these to `visit` in the future - this just isn't something we've tried to be comprehensive about yet.

The `region` and `datetime*` fields are present because `visit` is `spatial` and `temporal`, respectively:

In [36]:
print(visit.spatial, visit.temporal)

visit_detector_region exposure


These form implied relationships to any other dimension that is spatial and/or temporal, via the overlap of their regions and/or timespans.

The `id` field is the _primary key_ for `visit`, which means it's what you use as the value in data ID key-value pairs.  `name` is an _alternate key_, which means it also uniquely identifies a visit, and someday we plan to make those usable as data ID values (but haven't yet):

In [37]:
printFieldSpecs(visit.primaryKey)
printFieldSpecs(*visit.alternateKeys)

NameError: name 'printFieldSpecs' is not defined

The _database_ primary key for the `visit` table isn't just `id`, though - `visit` has a _required dependency_ on the `instrument` dimension, which means it has a foreign key to the `instrument` table that is also part of its (compound) primary key:

In [38]:
print(visit.graph.required)

{instrument, visit}


This required dependency means that whenever the `visit` key appears in a data ID, the `instrument` key must as well.  You've already seen another example of this: the `tract` dimension has a required dependency on the `skymap` dimension:

In [39]:
print(butler.registry.dimensions["tract"].graph.required)

{skymap, tract}


Finally, `visit` has an _implied dependency_ on the `physical_filter` dimension:

In [40]:
print(visit.implied)

{physical_filter, visit_system}


On the database side, that means that the `visit` table has a `physical_filter` field that is a foreign key but _not_ a primary key.  In terms of data IDs, this means that you _don't_ need to pass a `physical_filter` key in a data ID that involves `visit`, but the `Registry` can add one for you based on what's in the database.

### ExpandedDataCoordinate

You can get that extra dimension information from the `Registry` by calling `expandDataId`:

In [41]:
#expanded = butler.registry.expandDataId({"instrument": "HSC", "visit": 903338})
expanded = butler.registry.expandDataId({"instrument": "LSST-ImSim", "visit": 443964})

When you expand a data ID like this, you get everything the database knows about those dimensions, and it comes in a special subclass of `DataCoordinate`.  In order to make it [strictly substitutable](https://en.wikipedia.org/wiki/Liskov_substitution_principle) for `DataCoordinate`, its behavior can be a little tricky; while you can ask it for the values of implied dimensions:

In [42]:
expanded["physical_filter"]

'z'

they don't appear in `keys()` or iteration, which are still just the required dimensions:

In [43]:
expanded.keys()

dict_keys([Dimension(instrument), Dimension(visit)])

You can get a dict with the full set of keys with:

In [44]:
print(expanded.full)

{abstract_filter: z, instrument: LSST-ImSim, physical_filter: z, visit_system: 0, visit: 443964}


Note that `abstract_filter` is here, too, not just `physical_filter`.  That's because `physical_filter` has an implied dependency on `abstract_filter`, and those dependencies are expanded recursively:

In [45]:
print(butler.registry.dimensions["physical_filter"].implied)

{abstract_filter}


`ExpandedDataCoordinate.records` is a dictionary with all of the metadata of all of the dimensions:

In [46]:
# While writing this I discovered that some of these classes don't
# have a nice `__str__` or `__repr__` yet, and ought to; here's a
# snippet to do that for now.
for dimension, record in expanded.records.items():
    print(dimension)
    for field, value in record.toDict().items():
        print(f"  {field}: {value}")

abstract_filter
  name: z
instrument
  name: LSST-ImSim
  visit_max: 9999999999
  exposure_max: 9999999999
  detector_max: 999
  class_name: lsst.obs.lsst.LsstImSim
physical_filter
  instrument: LSST-ImSim
  name: z
  abstract_filter: z
visit_system
  instrument: LSST-ImSim
  id: 0
  name: one-to-one
visit
  instrument: LSST-ImSim
  id: 443964
  physical_filter: z
  visit_system: 0
  name: 443964
  exposure_time: 30.0
  seeing: None
  region: ConvexPolygon([UnitVector3d(0.3768244112345154, 0.6127673194333143, -0.6946362899620717), UnitVector3d(0.3753901489117097, 0.615992541908073, -0.6925571632823354), UnitVector3d(0.38154367879668344, 0.6262289283394811, -0.6798983383426366), UnitVector3d(0.38484217703543727, 0.6264078868818407, -0.6778714170304367), UnitVector3d(0.38832371672248395, 0.6265830269457473, -0.6757206533578362), UnitVector3d(0.4091302992395544, 0.6273995180199263, -0.6625573507497421), UnitVector3d(0.41256812509343005, 0.6274948514087825, -0.6603315482485763), UnitVector

Finally, you can ask an `ExpandedDataCoordinate` for its `region`, which is a [lsst.sphgeom.ConvexPolygon](http://doxygen.lsst.codes/stack/doxygen/x_masterDoxyDoc/classlsst_1_1sphgeom_1_1_convex_polygon.html) if the data ID corresponds to a region on the sky, or `None` if it does not:

In [47]:
expanded.region

ConvexPolygon([UnitVector3d(0.3768244112345154, 0.6127673194333143, -0.6946362899620717), UnitVector3d(0.3753901489117097, 0.615992541908073, -0.6925571632823354), UnitVector3d(0.38154367879668344, 0.6262289283394811, -0.6798983383426366), UnitVector3d(0.38484217703543727, 0.6264078868818407, -0.6778714170304367), UnitVector3d(0.38832371672248395, 0.6265830269457473, -0.6757206533578362), UnitVector3d(0.4091302992395544, 0.6273995180199263, -0.6625573507497421), UnitVector3d(0.41256812509343005, 0.6274948514087825, -0.6603315482485763), UnitVector3d(0.42723339222258555, 0.6180965277048173, -0.6598699197639112), UnitVector3d(0.42862839216167825, 0.6149909964057282, -0.6618638649849259), UnitVector3d(0.43006383361765466, 0.6117550887857098, -0.6639283171840726), UnitVector3d(0.431491158182822, 0.6085109387713126, -0.6659803434078988), UnitVector3d(0.43991019902759315, 0.5887914713062641, -0.6780882096810916), UnitVector3d(0.4337399880012369, 0.5785538987915767, -0.690756837824826), UnitV

As we discussed earlier, this region may be heavily padded to account for inaccurate initial WCSs, but they should be guaranteed contain the true region.

### DimensionGraph

The sometimes-complex system of relationships between dimensions makes it very useful to have a specialized container for them, and we've actually already seen this class (`DimensionGraph`) in use in a few places:

In [48]:
deepCoaddType.dimensions

DimensionGraph({abstract_filter, skymap, tract, patch})

In [49]:
ref.dataId.graph

DimensionGraph({abstract_filter, skymap, tract, patch})

In [50]:
visit.graph

DimensionGraph({abstract_filter, instrument, physical_filter, visit_system, visit})

`DimensionUniverse` is a special subclass of `DimensionGraph` as well.

To explore `DimensionGraph` further, we'll start by extracting an interesting and common set of dimensions from the universe - these are the ones used to label the `calexp` dataset:

In [51]:
graph = butler.registry.dimensions.extract(["detector", "visit"])
print(graph)

{abstract_filter, instrument, detector, physical_filter, visit_system, visit}


The first thing to note here is that the set of dimensions is automatically expanded to include all (recursive) required and implied dimensions.  The dimensions are also sorted topologically (dependents follow their dependencies), with string (lexicographical) comparisons to break ties.

We can ask a `DimensionGraph` for its `required` and `implied` dimensions:

In [52]:
print(graph.required)
print(graph.implied)

{instrument, detector, visit}
{abstract_filter, physical_filter, visit_system}


Note that which dimensions are `implied` depends on which dimensions are present; `physical_filter` is only `implied` here because `visit` is also in the graph.

The *required* dimensions of a graph are particularly important, because those are the keys of a `DataCoordinate` that identifies the dimensions of that graph.

You can also ask a `DimensionGraph` for its `temporal` and `spatial` dimensions:

In [53]:
print(graph.temporal)

{visit}


In [54]:
print(graph.spatial)

{visit_detector_region}


That last answer is a bit unexpected - `visit_detector_region` isn't even one of the dimensions in the graph!  Instead, it's a table that's part of the dimensions system, without being an actual `Dimension` itself.  It can't be used as a data ID key, but it is used to provide other information about true dimensions.  In this case, that extra information is the `region` associated with the _combination_ of a `visit` and a `detector`.  A `visit` has its own region, but the system knows that the one provided by `visit_detector_region` is more specific and hence a better match for this graph.

The graph's spatial dimension is what defines the `region` attribute of any `ExpandedDataCoordinate` for that graph.  That's why we want to select the "best" spatial dimension rather than reporting all of them; if a graph has more than one spatial dimension and neither is clearly better, we don't currently define a region for it (though perhaps we will someday):

In [57]:
#dataIdWithAmbiguousRegion = butler.registry.expandDataId({"skymap": "discrete/ci_hsc", "tract": 0, "instrument": "HSC", "visit": 903334})  
dataIdWithAmbiguousRegion = butler.registry.expandDataId({"skymap": "imsim_skymap", "tract": 3078, "instrument": "LSST-ImSim", "visit": 443964})

print(dataIdWithAmbiguousRegion.graph.spatial)

{tract, visit}


In [58]:
dataIdWithAmbiguousRegion.region

NotImplemented

We _return_ the special `NotImplemented` object instead of raising an exception because we don't want to completely fail to construct an `ExpandedDataCoordinate` just because its region isn't well-defined, and return `None` already expresses the slightly different case where there are no spatial dimensions.

### SkyPix Dimensions

The `htm7` and `htm9` dimensions in this universe are special; they're represented in code by the `SkyPixDimension` subclass:

In [59]:
htm9 = butler.registry.dimensions["htm9"]
htm9

SkyPixDimension(htm9)

A skypix dimension represents a particular level of a particular hierarchical pixelization of the sky, which corresponds to an instance of [lsst.sphgeom.Pixelization](http://doxygen.lsst.codes/stack/doxygen/x_masterDoxyDoc/classlsst_1_1sphgeom_1_1_pixelization.html).

In [60]:
htm9.pixelization

HtmPixelization(9)

A `Pixelization` instance knows how to go from integer IDs to regions on the sky and back:

In [61]:
print(htm9.pixelization.envelope(observationDataId.region))

NameError: name 'observationDataId' is not defined

In [62]:
print(htm9.pixelization.pixel(3031552))

ConvexPolygon([UnitVector3d(0.773010453362737, -0.6343932841636455, 0.0), UnitVector3d(0.7710605242618138, -0.6367618612362842, 0.0), UnitVector3d(0.7725653343241956, -0.6349265062178546, -0.003337049974569806)])


Because these mappings are fully defined in code, we don't also put them in the database, so `skypix` dimensions don't have their own entries in the `Registry` database.

However, we also use *one* skypix dimension, called the "common" skypix dimension, as a sort of spatial index that relates all other spatial dimensions in the database.  It still doesn't have its own table, but the database does have join tables that relate the common skypix dimension's IDs to the primary keys of other dimensions.  Usually this should be completely transparent.  You can get the common skypix dimension from the universe:

In [63]:
butler.registry.dimensions.commonSkyPix

SkyPixDimension(htm7)

### Querying Dimensions

You can use `Registry.queryDimensions` to run complex queries that return data IDs.  It accepts many of the same arguments as `queryDatasets`, and also returns iterators.  The first set of arguments is the set of dimensions the returned data IDs should include.  This can be any iterable over strings or `Dimension` instances, and will be expanded to a self-consistent `DimensionGraph` automatically:

In [64]:
for dataId in butler.registry.queryDimensions(["visit"], instrument="LSST-ImSim", physical_filter="z"):
    print(dataId)

{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 237907}
{instrument: LSST-ImSim, visit: 209058}
{instrument: LSST-ImSim, visit: 209066}
{instrument: LSST-ImSim, visit: 13289}
{instrument: LSST-ImSim, visit: 209091}
{instrument: LSST-ImSim, visit: 303558}


It's worth noting that you have to include `instrument` here if you want to pass `physical_filter` as a partial data ID (the same would be true in `queryDataset`) because we convert the given data ID into an `ExpandedDataCoordinate` *before* we run the query.  That can't work unless you specify the `instrument`, because the `physical_filter` dimension has a required dependency on the `instrument` dimension:

In [65]:
for dataId in butler.registry.queryDimensions(["visit"], physical_filter="z"):
    print(dataId)

KeyError: "No value in data ID (None) for required dimension 'instrument'."

You _can_ query on just `physical_filter` using a string expression...

In [66]:
for dataId in butler.registry.queryDimensions(["visit"], where="physical_filter = 'z'"):
    print(dataId)

{instrument: LSST-ImSim, visit: 303558}
{instrument: LSST-ImSim, visit: 209091}
{instrument: LSST-ImSim, visit: 13289}
{instrument: LSST-ImSim, visit: 209066}
{instrument: LSST-ImSim, visit: 209058}
{instrument: LSST-ImSim, visit: 237907}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 32683}


...but this may not be what you want.  In a larger, more realistic repository, this will actually search over all instruments, and while our _convention_ of putting the instrument name in the physical filter string would still save us from getting results from those other instruments here, it might be less efficient, and variations on this case involving other dimensions might produce undesired results.  From that perspective, the requirement that any data ID passed be complete and self-consistent is a feature, not a bug, and we relax it in the string-based query system because it's intended to be much more flexible (and hence can't be as careful).

Like `queryDatasets`, `queryDimensions` will automatically use spatial or temporal relationships, as well as implied dimension relationships:

In [67]:
for dataId in butler.registry.queryDimensions(["visit"], dataId={"skymap": "imsim_skymap", "tract": 3078, "patch": 26, "abstract_filter": "z"}):
    print(dataId)

{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 32683}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 443964}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 240855}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 426663}
{instrument: LSST-ImSim, visit: 237907}
{instr

And while data IDs are not dependent on any dataset type or collection, you can query for the data IDs for which some set of datasets (all) exist in one or more collections.  For example, this query returns all detectors for which a `flat` dataset exists in the `calib/hsc` collection:

In [72]:
for dataId in butler.registry.queryDimensions(["detector"], datasets={"flat": ["LSST-ImSim/calib"]}):
    print(dataId)

TypeError: Cannot pass 'datasets' without 'collections'.

Well.  That worked, but it actually gave us much more than we asked for, which was data IDs with just `detector` (and `instrument`, since that's a required dependency).  This is a bug, and I've just made a ticket ([DM-22176](https://jira.lsstcorp.org/browse/DM-22176)) for it.

Not a bad thing to close on a pretty representative example of the fact that this is all still under construction, even if it can already do a lot.

## Appendix: what about `put`?

It's just like `get`, but you pass the thing you want to write as the first argument, and you can't use `parameters`.  You can use `DatasetRef`.  There is no return value.  You can find more documentation [here](https://pipelines.lsst.io/py-api/lsst.daf.butler.Butler.html#lsst.daf.butler.Butler.put).

That won't work here (by design; there's a good chance you're using a shared example repo we don't want to break), because the `Butler` we constructed is read-only.  If you want a read-write `Butler`, construct the `Butler` with a `run` argument instead of a `collection` argument.  The value for that is the name of a special kind of collection that we call a `Run`.  The `shared/ci_hsc_output` collection we've been using is actually the name of a `Run`, but passing it as a regular collection tells the `Butler` that we don't plan to write anything to it.