# Gen3 Butler Basics

## Environment setup

This tutorial runs against a built copy of the [ci_hsc_gen3](https://github.com/lsst/ci_hsc_gen3) package, but it should run with only minor changes to collection names and data ID values in any Gen3 repo containing DRP pipeline outputs.  This version of the notebook was written against stack release `w_2020_29`.  Some of the APIs used near the end are expected to change soon, but only slightly, and a new version of the notebook will be released then.

In [1]:
import os
REPO_ROOT = "/home/jbosch/LSST/src/ci_hsc_gen3/DATA"

## Data Repositories and Collections

As in Gen2, you initialize a `Butler` by pointing at a data repository, which is usually represented by a directory.

That is _mostly_ true in Gen3 as well:

In [2]:
from lsst.daf.butler import Butler
butler = Butler(REPO_ROOT)

By default, Gen3 repository roots will have a `butler.yaml` and a `gen3.sqlite3` file.  A reasonable first thing to do when looking at a repository is to examine its _collections_, which are groups of datasets that frequently (but not always) correspond to processing runs.

In [3]:
list(butler.registry.queryCollections())

['HSC/calib',
 'skymaps',
 'HSC/raw/all',
 'shared/ci_hsc',
 'HSC/masks',
 'ref_cats',
 'shared/ci_hsc_output/20200720T22h54m33s',
 'shared/ci_hsc_output']

Note that we wrapped the result in `list(...)` - that's because most butler query operations return lazy iterators of some kind

In fact, normally we'd construct a `Butler` instance with the name of one or more collections to read from as well as the repo root.  That's because a Gen3 data repository is actually more like a set of related Gen2 data repositories (_e.g._ `/datasets/hsc/repo` and all of the data repositories that (recursively) consider it a parent.  A Gen3 `collection` is like a specific Gen2 data repository, and while some collections may also be associated with subdirectories, that's not true in general.

The `shared/ci_hsc_output` collection is the one that holds the results of building ci_hsc_gen3 (and hence running a bunch of `PipelineTask`s, so we'll reconstruct a butler with that.

In [4]:
butler = Butler(REPO_ROOT, collections="shared/ci_hsc_output")

Note that we can pass either a single string or a sequence of strings as the collections argument.  In most places where a sequence of collections is expected, you can pass a large number of things.  See [Collection Expressions](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html#collection-expressions) for more information.  In fact, [`Registry.queryCollections`](https://pipelines.lsst.io/v/weekly/py-api/lsst.daf.butler.Registry.html#lsst.daf.butler.Registry.queryCollections) is one such place:

In [5]:
import re
list(butler.registry.queryCollections(re.compile("HSC\/.*")))

['HSC/calib', 'HSC/raw/all', 'HSC/masks']

It's worth pointing out the `collections` argument to the `Butler` constructor is used for direct `Butler` methods, but isn't used by any methods on the `butler.registry` attribute that we've been using so far.  The next section explains why.

## Registry and Datastore

The last call was actually on `butler.registry`, not just `butler`, and that'll be a relatively common occurrence in Gen3, because a `Butler` is really just a convenience wrapper that combines three things:

* a `Registry` instance that manages metadata and relationships between datasets via a SQL database (a `gen3.sqlite3` file in the repo root, in this case);
* a `Datastore` instance that manages the datasets themselves (files subdirectories of the repo root, in this case);
* the name(s) of one or more collections.

You'll frequently use `butler.registry` to perform operations that don't need anything from the `Datastore`.

A `butler.datastore` attribute exists as well, but it's much less likely that you'll need to use it directly (I can't think of a reason).

Neither the `Registry` nor the `Datastore` know about the `collection` you passed when constructing the `Butler`, so when using them directly you many need to pass `butler.collections` to them (if they take a `collections` argument).

## How to spell `get`

The most common thing you'll do with a `Butler` is call `get` (`PipelineTasks` call `put` just about as often, but usually the author of a concrete `PipelineTask` won't actually write any `get` and `put` calls).  In it's simplest form, that looks something like this:

In [6]:
dataId = {"skymap": "discrete/ci_hsc", "tract": 0, "patch": 69, "abstract_filter": "r"}
coadd = butler.get("deepCoadd", dataId=dataId)

There's a lot to unpack here, but let's start by making it clear that you can write this a few different ways, and they're all equivalent (and most of them are identical to Gen2, aside from what's in the data ID):

In [7]:
# Pass data ID as a positional argument:
coadd = butler.get("deepCoadd", dataId)

In [8]:
# Pass data ID as multiple keyword arguments:
coadd = butler.get("deepCoadd", skymap="discrete/ci_hsc", tract=0, patch=69, abstract_filter="r")

In [9]:
# Do both.  Keyword arguments override the data ID dict (considered a feature, though it may be suprising).
coadd = butler.get("deepCoadd", dataId, patch=69, abstract_filter="r")

### DatasetTypes

The first argument can also be a `DatasetType` object instead of the string that refers to one.  A `DatasetType` instance knows the data ID keys needed to identify it (we call those "dimensions") and its StorageClass, which you can think of as a mapping to the Python type you'll get back:

In [10]:
deepCoaddType = butler.registry.getDatasetType("deepCoadd")
print(deepCoaddType)

DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF)


In the Gen2 butler, all dataset types had to be pre-declared in an obs_ package.  In Gen3, they're added to the data repository as needed by the `PipelineTasks` that create them, so if you want to know what dataset types exist, you'll need to ask the registry (someday we'll hopefully find some way to put a snapshot of a "typical" registry's dataset types in the online documentation):

In [12]:
list(butler.registry.queryDatasetTypes())

[DatasetType(camera, {instrument, calibration_label}, Camera),
 DatasetType(defects, {instrument, calibration_label, detector}, Defects),
 DatasetType(bfKernel, {instrument, calibration_label}, NumpyArray),
 DatasetType(transmission_optics, {instrument, calibration_label}, TransmissionCurve),
 DatasetType(transmission_sensor, {instrument, calibration_label, detector}, TransmissionCurve),
 DatasetType(transmission_filter, {abstract_filter, instrument, calibration_label, physical_filter}, TransmissionCurve),
 DatasetType(transmission_atmosphere, {instrument}, TransmissionCurve),
 DatasetType(deepCoadd_skyMap, {skymap}, SkyMap),
 DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure),
 DatasetType(brightObjectMask, {abstract_filter, skymap, tract, patch}, ObjectMaskCatalog),
 DatasetType(ps1_pv3_3pi_20170110, {htm7}, SimpleCatalog),
 DatasetType(jointcal_photoCalib, {abstract_filter, instrument, skymap, detector, physical_filter, tract, visit_system

Many of those `DatasetTypes` have implicit component dataset types, like `deepCoadd.psf`.  You can `get` components just as you would their parents, and usually that'll be much more efficient if the component is a small piece of the whole:

In [13]:
inputs = butler.get("deepCoadd.coaddInputs", dataId)
print(inputs.visits)

id   bbox_min_x bbox_min_y bbox_max_x ... goodpix        weight       filter
          pix        pix        pix     ...                                   
------ ---------- ---------- ---------- ... -------- ------------------ ------
903334      19900      15900      24099 ... 15363779 14.624842625965654  HSC-R
903336      19900      15900      24099 ... 13501340  18.56591686749947  HSC-R
903338      19900      15900      24099 ... 15847317  20.36656420365272  HSC-R
903342      19900      15900      24099 ...  3865590 12.379389122444906  HSC-R
903344      19900      15900      24099 ... 14143544 15.308676446785611  HSC-R
903346      19900      15900      24099 ... 15345474  17.62140943168528  HSC-R


Anyhow, as promised, you can use that `DatasetType` instance in `get`:

In [14]:
coadd = butler.get(deepCoaddType, skymap="discrete/ci_hsc", tract=0, patch=69, abstract_filter="r")

### DatasetRefs

The Gen3 Butler makes use of the combination of a `DatasetType` and a data ID frequently enough that there is a special object for that, `DatasetRef`:

In [15]:
from lsst.daf.butler import DatasetRef
ref = DatasetRef(deepCoaddType, {"skymap": "discrete/ci_hsc", "tract": 0, "patch": 69, "abstract_filter": "r"})
print(ref)

deepCoadd@{abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}


You can pass a `DatasetRef` as the _only_ argument to `get`:

In [16]:
coadd = butler.get(ref)

If you're familiar with Gen2, you might have noticed that a Gen2 **DataRef** (which has a `Butler`) is _quite_ different from a Gen3 **DatasetRef** (which does not).

### Parameters

Component datasets are used to get predefined, differently-typed pieces of a composite dataset.  For some dataset types it's desirable to get same-typed, parameterized subsets, and that's what the `parameters` keyword argument to `get` is for.  The classic case is a subimage:

In [17]:
from lsst.geom import Box2I, Point2I
bbox = Box2I(Point2I(20000, 16000), Point2I(20200, 18000))
parameters = {"bbox": bbox}
subcoadd = butler.get("deepCoadd", dataId, parameters=parameters)

In [18]:
import numpy as np
assert np.all(subcoadd.image.array == coadd[bbox].image.array)

Unlike Gen2,

- The dataset type name is the same (still just `deepCoadd`, not `deepCoadd_sub`, as it was in Gen2).

- You pass all parameters as a single dict as the `parameters` kwarg, rather than as separate kwargs that could get confused with the data ID.]

Any of the alternate spellings of `get` shown above can be used with parameters, including the `DatasetRef` one - the parameters go in the call to `get`, not inside the `DatasetRef`:

In [19]:
subcoadd = butler.get(ref, parameters=parameters)

## Querying for Datasets

One of the most important new features of the Gen3 Butler is much more complete support for querying datasets.  That typically goes through the `queryDatasets` method.  A typical query might look something like this:

In [20]:
list(butler.registry.queryDatasets("deepCoadd", collections=["shared/ci_hsc_output"]))

[DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}, id=1630, run='shared/ci_hsc_output/20200720T22h54m33s'),
 DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: i, skymap: discrete/ci_hsc, tract: 0, patch: 69}, id=1633, run='shared/ci_hsc_output/20200720T22h54m33s')]

Once again, we've wrapped the call in `list` because `queryDatasets` returns a single-pass iterator, not a container.  **It's also not guaranteed to return unique results**, because it might be much more expensive to do some kinds of deduplication (especially for complex queries) in the database.  We'll have a way to explicitly ask for database-side deduplication (basically `SELECT DISTICT`) very soon, but you can also just put the results in a `set`.

You can also pass a single data ID, either as a single argument or (as with `get`) a number of keyword arguments:

In [24]:
list(butler.registry.queryDatasets("deepCoadd", collections=["shared/ci_hsc_output"],
                                   dataId={"abstract_filter": "r"}, deduplicate=True))

[DatasetRef(DatasetType(deepCoadd, {abstract_filter, skymap, tract, patch}, ExposureF), {abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}, id=1630, run='shared/ci_hsc_output/20200720T22h54m33s')]

That data ID doesn't even have to be directly related to the dataset; `queryDatasets` will automatically use temporal or spatial overlaps if it needs to.  Here's a query for all of the calexps that overlap a patch:

In [25]:
set(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],
                                   skymap="discrete/ci_hsc", tract=0, patch=70))

{DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {abstract_filter: i, instrument: HSC, detector: 1, physical_filter: HSC-I, visit_system: 0, visit: 904014}, id=1582, run='shared/ci_hsc_output/20200720T22h54m33s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {abstract_filter: i, instrument: HSC, detector: 10, physical_filter: HSC-I, visit_system: 0, visit: 904010}, id=1540, run='shared/ci_hsc_output/20200720T22h54m33s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {abstract_filter: i, instrument: HSC, detector: 100, physical_filter: HSC-I, visit_system: 0, visit: 903986}, id=1497, run='shared/ci_hsc_output/20200720T22h54m33s'),
 DatasetRef(DatasetType(calexp, {abstract_filter, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {abstract_

There's a big caveat to these spatial lookups, though: while they're guaranteed to return any dataset that overlaps the given data ID, they may also return some that don't, because the regions for observations are defined during ingest, and we pad those quite a bit to account for possibly-bad WCSs.  The above query actually returns all of the calexps in the (small) ci_hsc dataset, because there's so much padding, and what's worse, it returns some of them several times (remember that there's no guarantee about uniqueness).

But, unlike Gen2, everything it returns actually does exist in the data repository.

So far, we've passed a single dataset type and a single collection.  You can also pass `...` for either argument to look for all dataset types and/or in all collections:

In [None]:
list(butler.registry.queryDatasets(..., collections=["ref_cats"]))

`re.Pattern` objects (what `re.compile` returns) are also accepted for either `collections` or `datasetType`, or both:

In [26]:
set(ref.datasetType for ref in butler.registry.queryDatasets(re.compile("deepCoadd.+image"), collections=...))

{DatasetType(deepCoadd.image, {abstract_filter, skymap, tract, patch}, ImageF, parentStorageClass=ExposureF),
 DatasetType(deepCoadd_calexp.image, {abstract_filter, skymap, tract, patch}, ImageF, parentStorageClass=ExposureF),
 DatasetType(deepCoadd_directWarp.image, {abstract_filter, instrument, skymap, physical_filter, tract, visit_system, patch, visit}, ImageF, parentStorageClass=ExposureF),
 DatasetType(deepCoadd_psfMatchedWarp.image, {abstract_filter, instrument, skymap, physical_filter, tract, visit_system, patch, visit}, ImageF, parentStorageClass=ExposureF)}

In [27]:
set(ref.datasetType for ref in butler.registry.queryDatasets(..., collections=[re.compile("HSC\/.*")]))

{DatasetType(bfKernel, {instrument, calibration_label}, NumpyArray),
 DatasetType(bias, {instrument, calibration_label, detector}, ExposureF),
 DatasetType(brightObjectMask, {abstract_filter, skymap, tract, patch}, ObjectMaskCatalog),
 DatasetType(camera, {instrument, calibration_label}, Camera),
 DatasetType(dark, {instrument, calibration_label, detector}, ExposureF),
 DatasetType(defects, {instrument, calibration_label, detector}, Defects),
 DatasetType(flat, {abstract_filter, instrument, calibration_label, detector, physical_filter}, ExposureF),
 DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure),
 DatasetType(sky, {abstract_filter, instrument, calibration_label, detector, physical_filter}, ExposureF),
 DatasetType(transmission_atmosphere, {instrument}, TransmissionCurve),
 DatasetType(transmission_filter, {abstract_filter, instrument, calibration_label, physical_filter}, TransmissionCurve),
 DatasetType(transmission_optics, {instrument, c

Finally, `queryDataset` allows you to pass a [boolean expression (in mostly SQL-like syntax)](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/exprParser.html) that involves any dimension field (including metadata):

In [29]:
list(butler.registry.queryDatasets("raw", collections=["HSC/raw/all"], where="visit < 903338 AND detector IN (15..50)"))

[DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {abstract_filter: r, instrument: HSC, detector: 16, physical_filter: HSC-R, exposure: 903334}, id=986, run='HSC/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {abstract_filter: r, instrument: HSC, detector: 22, physical_filter: HSC-R, exposure: 903334}, id=987, run='HSC/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {abstract_filter: r, instrument: HSC, detector: 23, physical_filter: HSC-R, exposure: 903334}, id=988, run='HSC/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physical_filter, exposure}, Exposure), {abstract_filter: r, instrument: HSC, detector: 17, physical_filter: HSC-R, exposure: 903336}, id=989, run='HSC/raw/all'),
 DatasetRef(DatasetType(raw, {abstract_filter, instrument, detector, physica

## Data IDs and Dimensions

Perhaps the biggest differences in `get` between Gen2 and Gen3 are in the data ID.  Here's the Gen3 data ID again, along with its Gen2 counterpart:

In [30]:
dataId_gen2 = {"tract": 0, "patch": "5,4", "filter": "HSC-R"}
print(f"Gen3: {dataId}")
print(f"Gen2: {dataId_gen2}")

Gen3: {'skymap': 'discrete/ci_hsc', 'tract': 0, 'patch': 69, 'abstract_filter': 'r'}
Gen2: {'tract': 0, 'patch': '5,4', 'filter': 'HSC-R'}


Exactly one key-value pair is the same: `tract=0`.

One key-value pair has clearly has the same intent, but has both a different key and a different value: `abstract_filter="r"`.  Gen3 distinguishes between "physical" filters, which are associated with a particular piece of glass on a particular instrument (these still have names like "HSC-R"), and "abstract" filters, which are named groups of similar filters (with names like "r").  The coadd dataset types are all defined in terms of `abstract_filter`.  That's not just so we can coadd data from multiple instruments together (this is just one of several steps we'd need to enable that) - it also helps with cameras like HSC that have two versions of the same filter (i.e. "HSC-R" and "HSC-R2" are both `physical_filters`) that we want to be able to combine.

Right now, each `physical_filter` corresponds to exactly one `abstract_filter` (it's many-to-one).  We know that reality is more complex than that (many-to-many), and we expect to generalize this in the future.  We're thinking about renaming `abstract_filter` to just `filter` along the way.

The `skymap` key is a totally new one.  In Gen3, all data IDs that involve `tract` also need to involve a `skymap` key that indicates which `skymap` defines that tract.  New skymaps can be added to a `Registry` by calling [BaseSkyMap.register](https://pipelines.lsst.io/v/weekly/py-api/lsst.skymap.BaseSkyMap.html#lsst.skymap.BaseSkyMap.register) or running the [makeGen3SkyMap.py](https://github.com/lsst/pipe_tasks/blob/master/bin.src/makeGen3Skymap.py) command-line tool (which also writes a `deepCoadd_skyMap` dataset, which we usually want).  A skymap must be added to the `Registry` before we can run any `PipelineTask` or `put` any dataset that uses it.

The `patch` key is present in both the Gen2 and Gen3 data IDs, but with a different value.  That's because (as per [RFC-365](https://jira.lsstcorp.org/browse/RFC-365)) `patch` identifiers in Gen3 are single integers that encode both the `x` and `y` indices.  We'll show later how to convert between these.

### DataCoordinate and Dimension instances

While you can still pass simple dictionaries as arguments to `Butler` and `Registry` APIs that expect data IDs, the objects we get back from the butler are always instances of `DataCoordinate`.  In fact, the data ID associated with a `DatasetRef` is a `DataCoordinate` as well:

In [31]:
ref.dataId

{abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}

Note that `DataCoordinate.__repr__` still prints something that looks _almost_ like a `dict` (note the lack of quotes around the keys), and unlike many `__repr__` strings, it isn't something you could execute to reconstruct the object.  That's because there's really nothing at all concise we could print to let you reconstruct it, and we decided to emphasize concise readability instead.

`DataCoordinate` instances are dict-like objects, and they can be passed anywhere a dictionary-like data ID can be passed; many _internal_ `daf_butler` APIs require them.

The first thing worth noting about a `DataCoordinate` is that its keys aren't actually strings; they're instances of the `Dimension` class:

In [33]:
ref.dataId.keys()

NamedValueSet({Dimension(abstract_filter), Dimension(skymap), Dimension(tract), Dimension(patch)})

`Dimension` instances are comparable to the strings that identify them, but they're much more than labels.  Using dimensions to identify datasets is a core concept for the Gen3 butler, and you can find a lot more information on it in the [API documentation](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html).  We'll cover the basics here.

Most - but not all - dimensions are associated with a table in the `Registry` database.  The rows in those tables contain the valid data ID values for that dimension, but they can also contain metadata fields and foreign key fields that are used to model relationships between dimensions.  The rows in the dimension tables are the same across all collections - that's important, because we want the meaning of a data ID to be consistent across (e.g.) different processing runs.  Dimensions form a sort of scaffolding or skeleton for datasets - datasets do not have relationships of their own; instead we rely on the web of dimension relationships to connect them.

The full set of dimensions known to a `Registry` is stored in a class called the `DimensionUniverse`:

In [34]:
butler.registry.dimensions

DimensionUniverse({htm7, htm9, abstract_filter, instrument, skymap, calibration_label, detector, physical_filter, subfilter, tract, visit_system, exposure, patch, visit})

You can get dimension instances via dict-like indexing on this object, and we'll use that to take a closer look at the `visit` dimension:

In [35]:
visit = butler.registry.dimensions["visit"]

Note that this object represents the _concept_ of a visit, not a particular visit.  The row in the database that corresponds to a single visit is represented in Python by a dynamically generated type you can access via `visit.RecordClass`.

Let's start by printing the fields of the visit table:

In [38]:
visit.RecordClass.fields

('instrument',
 'id',
 'physical_filter',
 'visit_system',
 'name',
 'exposure_time',
 'seeing',
 'region',
 'datetime_begin',
 'datetime_end')

Some of these (`exposure_time`, `seeing`) are just metadata fields with no special structure.  We'll have to add more of these to `visit` in the future - this just isn't something we've tried to be comprehensive about yet.

The `region` and `datetime*` fields are present because `visit` is `spatial` and `temporal`.  These define implied relationships to any other dimension that is spatial and/or temporal, via the overlap of their regions and/or timespans.

The `id` field is the _primary key_ for `visit`, which means it's what you use as the value in data ID key-value pairs.  `name` is an _alternate key_, which means it also uniquely identifies a visit, and someday we plan to make those usable as data ID values (but haven't yet):

In [44]:
(visit.primaryKey.name, visit.primaryKey.dtype)

('id', sqlalchemy.sql.sqltypes.BigInteger)

The _database_ primary key for the `visit` table isn't just `id`, though - `visit` has a _required dependency_ on the `instrument` dimension, which means it has a foreign key to the `instrument` table that is also part of its (compound) primary key:

In [47]:
print(visit.required)

{instrument, visit}


This required dependency means that whenever the `visit` key appears in a data ID, the `instrument` key must as well.  You've already seen another example of this: the `tract` dimension has a required dependency on the `skymap` dimension:

In [46]:
print(butler.registry.dimensions["tract"].required)

{skymap, tract}


Finally, `visit` has an _implied dependency_ on the `physical_filter` dimension:

In [48]:
print(visit.implied)

{physical_filter, visit_system}


On the database side, that means that the `visit` table has a `physical_filter` field that is a foreign key but _not_ a primary key.  In terms of data IDs, this means that you _don't_ need to pass a `physical_filter` key in a data ID that involves `visit`, but the `Registry` can add one for you based on what's in the database.

### Expanded DataCoordinates

You can get that extra dimension information from the `Registry` by calling `expandDataId`:

In [49]:
expanded = butler.registry.expandDataId({"instrument": "HSC", "visit": 903338})
expanded

{abstract_filter: r, instrument: HSC, physical_filter: HSC-R, visit_system: 0, visit: 903338}

When you expand a data ID like this, you get everything the database knows about those dimensions.  In order to make `DataCoordinate`s that do have this extra information behave compatibly with those that don't, its behavior can be a little tricky; while you can ask it for the values of implied dimensions:

In [50]:
expanded["physical_filter"]

'HSC-R'

they don't appear in `keys()` or iteration, which are still just the required dimensions:

In [51]:
expanded.keys()

NamedValueSet({Dimension(instrument), Dimension(visit)})

You can get a dict with the full set of keys with:

In [53]:
print(dict(expanded.full))

{Dimension(abstract_filter): 'r', Dimension(instrument): 'HSC', Dimension(physical_filter): 'HSC-R', Dimension(visit_system): 0, Dimension(visit): 903338}


Note that `abstract_filter` is here, too, not just `physical_filter`.  That's because `physical_filter` has an implied dependency on `abstract_filter`, and those dependencies are expanded recursively:

In [None]:
print(butler.registry.dimensions["physical_filter"].implied)

`DataCoordinate.records` is a dictionary with all of the metadata of all of the dimensions:

In [58]:
for record in expanded.records.values():
    print(record)

abstract_filter:
  name: 'r'
instrument:
  name: 'HSC'
  visit_max: 21474800
  exposure_max: 21474800
  detector_max: 200
  class_name: 'lsst.obs.subaru.HyperSuprimeCam'
physical_filter:
  instrument: 'HSC'
  name: 'HSC-R'
  abstract_filter: 'r'
visit_system:
  instrument: 'HSC'
  id: 0
  name: 'one-to-one'
visit:
  instrument: 'HSC'
  id: 903338
  physical_filter: 'HSC-R'
  visit_system: 0
  name: 'HSCA90333800'
  exposure_time: 30.0
  seeing: None
  region: ConvexPolygon([UnitVector3d(0.7660793582192826, -0.6425764814519453, -0.014760839920874362), UnitVector3d(0.7681582924262328, -0.6400512116141268, -0.016348831398683494), UnitVector3d(0.7739440529881775, -0.6330309859058488, -0.016803979507903348), UnitVector3d(0.7744855585869767, -0.6323682622614648, -0.01680774894606699), UnitVector3d(0.7801714151265831, -0.6253495087381339, -0.01644855435469963), UnitVector3d(0.7822037094777117, -0.6228462911184365, -0.014827492010303119), UnitVector3d(0.7833956402965732, -0.6214112460886911, -

Finally, you can ask a `DataCoordinate` for its `region`, which is a [lsst.sphgeom.ConvexPolygon](http://doxygen.lsst.codes/stack/doxygen/x_masterDoxyDoc/classlsst_1_1sphgeom_1_1_convex_polygon.html) if the data ID corresponds to a region on the sky, or `None` if it does not:

In [59]:
expanded.region

ConvexPolygon([UnitVector3d(0.7660793582192826, -0.6425764814519453, -0.014760839920874362), UnitVector3d(0.7681582924262328, -0.6400512116141268, -0.016348831398683494), UnitVector3d(0.7739440529881775, -0.6330309859058488, -0.016803979507903348), UnitVector3d(0.7744855585869767, -0.6323682622614648, -0.01680774894606699), UnitVector3d(0.7801714151265831, -0.6253495087381339, -0.01644855435469963), UnitVector3d(0.7822037094777117, -0.6228462911184365, -0.014827492010303119), UnitVector3d(0.7833956402965732, -0.6214112460886911, -0.011803982328964315), UnitVector3d(0.7844603097270918, -0.6201190860294672, -0.008622157784062565), UnitVector3d(0.784545983130367, -0.6200304746327203, -0.007057682390807695), UnitVector3d(0.7846146909294821, -0.6199596161776157, -0.0054645298550387205), UnitVector3d(0.784665028345312, -0.6199080082958075, -0.0038541591269394193), UnitVector3d(0.784698237292739, -0.6198739389407883, -0.0022307424100284586), UnitVector3d(0.7847098159244472, -0.619862429933087

As we discussed earlier, this region may be heavily padded to account for inaccurate initial WCSs, but they should be guaranteed to contain the true region.

### DimensionGraph

The sometimes-complex system of relationships between dimensions makes it very useful to have a specialized container for them, and we've actually already seen this class (`DimensionGraph`) in use in a few places:

In [60]:
deepCoaddType.dimensions

DimensionGraph({abstract_filter, skymap, tract, patch})

In [61]:
ref.dataId.graph

DimensionGraph({abstract_filter, skymap, tract, patch})

In [62]:
visit.graph

DimensionGraph({abstract_filter, instrument, physical_filter, visit_system, visit})

`DimensionUniverse` is a special subclass of `DimensionGraph` as well.

To explore `DimensionGraph` further, we'll start by extracting an interesting and common set of dimensions from the universe - these are the ones used to label the `calexp` dataset:

In [63]:
graph = butler.registry.dimensions.extract(["detector", "visit"])
print(graph)

{abstract_filter, instrument, detector, physical_filter, visit_system, visit}


The first thing to note here is that the set of dimensions is automatically expanded to include all (recursive) required and implied dimensions.  The dimensions are also sorted topologically (dependents follow their dependencies), with string (lexicographical) comparisons to break ties.

We can ask a `DimensionGraph` for its `required` and `implied` dimensions:

In [64]:
print(graph.required)
print(graph.implied)

{instrument, detector, visit}
{abstract_filter, physical_filter, visit_system}


Note that which dimensions are `implied` depends on which dimensions are present; `physical_filter` is only `implied` here because `visit` is also in the graph.

The *required* dimensions of a graph are particularly important, because those are the keys of a `DataCoordinate` that identifies the dimensions of that graph.

You can also ask a `DimensionGraph` for its `temporal` and `spatial` dimensions:

In [65]:
print(graph.temporal)

{visit}


In [66]:
print(graph.spatial)

{visit_detector_region}


That last answer is a bit unexpected - `visit_detector_region` isn't even one of the dimensions in the graph!  Instead, it's a table that's part of the dimensions system, without being an actual `Dimension` itself.  It can't be used as a data ID key, but it is used to provide other information about true dimensions.  In this case, that extra information is the `region` associated with the _combination_ of a `visit` and a `detector`.  A `visit` has its own region, but the system knows that the one provided by `visit_detector_region` is more specific and hence a better match for this graph.

The graph's spatial dimension is what defines the `region` attribute of any `DataCoordinate` for that graph.  That's why we want to select the "best" spatial dimension rather than reporting all of them; if a graph has more than one spatial dimension and neither is clearly better, trying to access a data IDs region will raise an exception:

In [71]:
dataIdWithAmbiguousRegion = butler.registry.expandDataId({"skymap": "discrete/ci_hsc", "tract": 0, "instrument": "HSC", "visit": 903334})
print(dataIdWithAmbiguousRegion.graph.spatial)

{tract, visit}


In [72]:
dataIdWithAmbiguousRegion.region

NotImplemented

### SkyPix Dimensions

The `htm7` and `htm9` dimensions in this universe are special; they're represented in code by the `SkyPixDimension` subclass:

In [73]:
htm9 = butler.registry.dimensions["htm9"]
htm9

SkyPixDimension(htm9)

A skypix dimension represents a particular level of a particular hierarchical pixelization of the sky, which corresponds to an instance of [lsst.sphgeom.Pixelization](http://doxygen.lsst.codes/stack/doxygen/x_masterDoxyDoc/classlsst_1_1sphgeom_1_1_pixelization.html).

In [74]:
htm9.pixelization

HtmPixelization(9)

A `Pixelization` instance knows how to go from integer IDs to regions on the sky and back:

In [76]:
print(htm9.pixelization.envelope(expanded.region))

[(3031552, 3031568), (3031573, 3031574), (3031576, 3031577), (3031578, 3031581), (3031588, 3031590), (3031591, 3031592), (3031594, 3031595), (3031596, 3031597), (3031601, 3031616), (3033344, 3033360), (3033368, 3033369), (3033370, 3033372), (3033377, 3033378), (3033380, 3033384), (3033386, 3033387), (3033388, 3033389), (3033390, 3033392), (3033393, 3033408), (3034368, 3034384), (3034392, 3034393), (3034394, 3034396), (3034404, 3034405), (3034420, 3034432), (3178752, 3178768), (3178773, 3178774), (3178776, 3178777), (3178778, 3178781), (3178788, 3178790), (3178791, 3178792), (3178794, 3178795), (3178796, 3178797), (3178801, 3178816), (3180032, 3180048), (3180050, 3180051), (3180053, 3180054), (3180056, 3180062), (3180063, 3180064), (3180068, 3180070), (3180071, 3180072), (3180074, 3180075), (3180076, 3180077), (3180081, 3180096), (3182080, 3182096), (3182104, 3182105), (3182116, 3182118), (3182119, 3182120), (3182132, 3182137), (3182138, 3182141), (3182142, 3182144)]


In [77]:
print(htm9.pixelization.pixel(3031552))

ConvexPolygon([UnitVector3d(0.773010453362737, -0.6343932841636455, 0.0), UnitVector3d(0.7710605242618138, -0.6367618612362842, 0.0), UnitVector3d(0.7725653343241956, -0.6349265062178546, -0.003337049974569806)])


Because these mappings are fully defined in code, we don't also put them in the database, so `skypix` dimensions don't have their own entries in the `Registry` database.

However, we also use *one* skypix dimension, called the "common" skypix dimension, as a sort of spatial index that relates all other spatial dimensions in the database.  It still doesn't have its own table, but the database does have join tables that relate the common skypix dimension's IDs to the primary keys of other dimensions.  Usually this should be completely transparent.  You can get the common skypix dimension from the universe:

In [78]:
butler.registry.dimensions.commonSkyPix

SkyPixDimension(htm7)

### Querying Dimensions

You can use `Registry.queryDimensions` to run complex queries that return data IDs.  It accepts many of the same arguments as `queryDatasets`, and also returns iterators.  The first set of arguments is the set of dimensions the returned data IDs should include.  This can be any iterable over strings or `Dimension` instances, and will be expanded to a self-consistent `DimensionGraph` automatically:

In [79]:
for dataId in butler.registry.queryDimensions(["visit"], instrument="HSC", physical_filter="HSC-I"):
    print(dataId)

{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903990}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904010}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904014}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903988}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903986}


It's worth noting that you have to include `instrument` here if you want to pass `physical_filter` as a partial data ID (the same would be true in `queryDataset`) because we convert the given data ID into an `DataCoordinate` *before* we run the query.  That can't work unless you specify the `instrument`, because the `physical_filter` dimension has a required dependency on the `instrument` dimension:

In [80]:
for dataId in butler.registry.queryDimensions(["visit"], physical_filter="HSC-I"):
    print(dataId)

KeyError: "No value in data ID (None) for required dimension 'instrument'."

You _can_ query on just `physical_filter` using a string expression...

In [81]:
for dataId in butler.registry.queryDimensions(["visit"], where="physical_filter = 'HSC-I'"):
    print(dataId)

{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903990}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904010}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904014}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903988}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903986}


...but this may not be what you want.  In a larger, more realistic repository, this will actually search over all instruments, and while our _convention_ of putting the instrument name in the physical filter string would still save us from getting results from those other instruments here, it might be less efficient, and variations on this case involving other dimensions might produce undesired results.  From that perspective, the requirement that any data ID passed be complete and self-consistent is a feature, not a bug, and we relax it in the string-based query system because it's intended to be much more flexible (and hence can't be as careful).

Like `queryDatasets`, `queryDimensions` will automatically use spatial or temporal relationships, as well as implied dimension relationships:

In [82]:
for dataId in butler.registry.queryDimensions(["visit"], dataId={"skymap": "discrete/ci_hsc", "tract": 0, "patch": 69, "abstract_filter": "i"}):
    print(dataId)

{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903990}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903990}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904010}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904010}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904014}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 904014}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903988}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903988}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903986}
{abstract_filter: i, instrument: HSC, physical_filter: HSC-I, visit_system: 0, visit: 903986}


And while data IDs are not dependent on any dataset type or collection, you can query for the data IDs for which some set of datasets (all) exist in one or more collections.  For example, this query returns all detectors for which a `flat` dataset exists in the `HSC/calib` collection:

In [87]:
set(butler.registry.queryDimensions(["detector"], datasets="flat", collections="HSC/calib"))

{{abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_000_HSC-I, detector: 0, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_001_HSC-I, detector: 1, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_004_HSC-I, detector: 4, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_005_HSC-I, detector: 5, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_006_HSC-I, detector: 6, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_010_HSC-I, detector: 10, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_011_HSC-I, detector: 11, physical_filter: HSC-I},
 {abstract_filter: i, instrument: HSC, calibration_label: gen2/flat_2013-11-03_012_HSC-I, detector: 12

Well.  That worked, but it actually gave us much more than we asked for, which was data IDs with just `detector` (and `instrument`, since that's a required dependency).  This is a bug, and I created a ticket for it ([DM-22176](https://jira.lsstcorp.org/browse/DM-22176)) the first time I gave (an earlier version of this tutorial) a few months ago.  Fixing it just hasn't been a priority.

Not a bad thing to close on a pretty representative example of the fact that this is all still under construction, even if it can already do a lot.

## Appendix: what about `put`?

It's just like `get`, but you pass the thing you want to write as the first argument, and you can't use `parameters`.  You can use `DatasetRef`.  There is no return value.  You can find more documentation [here](https://pipelines.lsst.io/py-api/lsst.daf.butler.Butler.html#lsst.daf.butler.Butler.put).

That won't work here (by design; there's a good chance you're using a shared example repo we don't want to break), because the `Butler` we constructed is read-only.  If you want a read-write `Butler`, construct the `Butler` with a `run` argument instead of a `collections` argument, or pass `writeable=True`.  A "run" is a special kind of collection; the [daf_butler docs](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections) have a more complete description of the types of collections.  The `shared/ci_hsc_output` collection we've been using is actually the name of a run-type collection, but passing it as a regular collection tells the `Butler` that we don't plan to write anything to it.