-
Notifications
You must be signed in to change notification settings - Fork 10
/
repodb.py
286 lines (245 loc) · 13.3 KB
/
repodb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
from __future__ import print_function, division, absolute_import
from . import common
__all__ = ("CameraDataSpec", "RepoDatabase")
class CameraDataSpec(object):
"""An object that specializes Units for a particular instrument.
Design Notes
------------
This class may become abstract, with derived classes for each camera in
the future. It should probably be integrated with the afw.cameraGeom
(which also describes the layout of sensors on the focal plane). Unlike
afw.cameraGeom, it should contain only static information (i.e. it will
not be versioned when e.g. electronics details change or positions are
perturbed slightly).
"""
def __init__(self, name, filters):
self.name = name
self.filters = filters
class RepoDatabase(object):
"""An interface to metadata in a repository, as represented by `Units`
and `Datasets`.
Design Notes
------------
`RepoDatabase` is a concrete class that is aware of the concrete set of
Units that are used when processing optical/NIR astronomical imaging data.
It probably implicitly assumes that it's backed by a SQL database (at least
it would probably be hard to implement without one), but it is unaware of
the details of the schema or the DBMS; all of that is hidden behind the
`Backend` interface.
`RepoDatabase` plays a role somewhat similar to the current
`obs.base.CameraMapper` class, which is the lowest layer in the current
`Butler`/`Mapper` system that is aware of astronomical concepts. Unlike
'`CameraMapper`, however, it puts camera-based data units (visits,
sensors) and SkyMap-based units (tracts, patches) on an even footing, and
via `Units` it demands the per-camera specializations be mapped to a
fairly rigid common data model that can be used by camera-generic
algorithmic code. A single `RepoDatabase` also explicitly supports
multiple cameras (via `CameraDataSpec`) and multiple skymaps (via
`lsst.skymap` objects).
The ideal relationship between `RepoDatabase` and `Mapper` is as yet
unclear. At least at first, `RepoDatabase` can work with the existing
`Mappers` without requiring them to be changed. In that mode, a
`RepoDatabase` would sit "next to" a `Mapper` (or, more generally, the set
of `Mapper`s in that repo's parent chain), and generate `Dataset`s that can
be translated to `datasetType` strings and dictionary-style data IDs for
use by the `Mapper`. But as `RepoDatabase` becomes more capable, we
probably want to retire the existing registry databases, and it would
probably make sense to have `Mapper` use `RepoDatabase`'s in that role
instead: dictionary-style data IDs would be expanded into fully-qualified
`Datasets` by queries against a `RepoDatabase`, and these would be used
directly to map to a location and storage for actual retrieval. This
*could* ultimately eliminate the need to specialize `Mapper` for each
camera; all camera-specific content could move to `CameraDataSpec`. I
think it's more likely that we'll want to keep camera-specific `Mapper`s
but trim them down to just YAML configurations that override the locations
for a few datasets, with all camera-specific *code* moved to
`CameraDataSpec`.
At present, `RepoDatabase` just uses `lsst.skymap` objects directly, and
considers that package's `TractInfo` and `PatchInfo` classes to be a
parallel description of what's in it's own `TractUnit` and `PatchUnit`
classes. That redundancy is more confusing than helpful, though, and once
the rest of the `RepoDatabase` design settles down, it'd make sense to
unify the tract and patch classes. A particular `SkyMap` class would then
just be responsible for generating a set of `TractUnit` and `PatchUnit`
classes that support the full functionality of `TractInfo` and
`PatchInfo`, which would then be stored directly in the `RepoDatabase`
itself (including their WCSs and detailed bounding boxes) instead of as a
separate `Butler`-accessed dataset.
Because `RepoDatabase` contains Python objects (types, in particular)
as well as a SQL database connection, we need a way to perist and unpersist
it to a special file stored in a repository. At present we use pickle,
but it probably should be stored as part of the YAML repository
configuration used by `Butler`.
The other major remaining design challenges for `RepoDatabase` are:
- How do we split storage across multiple chained repositories? The
current design represents the entire content of a repository (including
its parents) via a single database backend. This *should* be just a
`Backend`/`Butler` problem: a `RepoDatabase` shouldn't care how it is
stored, and it should (in the future) be the responsibility of a
`Butler` to construct a `Backend` for a particular endpoint repository
and a `RepoDatabase` from that. But it's quite possible there's
something in the current interface between `RepoDatabase` and `Backend`
that will need to change to keep that separation of concerns.
- How do we let cameras specialize `Unit`s, and in particular add their
own labels for camera-generic concepts like `Visit`? The general plan
is to add per-camera tables for `Unit`s that are identified as
belonging to a camera; these would be joined with the camera-generic
tables for those `Unit`s in the queries run by `makeGraph`, allowing
camera-specific labels to be used in the where clause that represents
user data ID expression. The details of that need to be worked out.
- How do we supports date-based joins between calibration `Unit`s and
visit/sensor `Unit`s?
- How do we support spatial joins between visit/sensors `Unit`s, skymap
`Unit`s, reference-catalog shard `Unit`s, and any other partition of
the sky?
- Do we need, and if so, how do we support the addition of new `Unit`
instances to a `RepoDatabase` by a `SuperTask` supervisory framework?
Normally, new `Unit`s are added by "ingest-like" steps (this includes
both ingesting raw data, ingesting raw calibrations, and adding a new
SkyMap), which define new `Unit`s and only add `Dataset`s that
represent data products that already exist. That model may not work
for master calibration data products, which are identified by date-like
`Unit`s that it doesn't make sense to ask the user to "ingest" in
advance of actually running the `Pipeline` that produces them.
- Do we need, and if so, how do we support the addition of new `Unit`
*types* without modifying the `RepoDatabase` implementation itself?
In discussions of an early version of this design, some concern about
losing the flexibility to define new data ID keys was expressed, and
while a clear use case for user-defined `Unit` classes has not been
identified, it might be prudent to find a way to support them, even
if the new `Unit`s are not treated exactly the same way as those
a `RepoDatabase` is intrinsically aware of.
"""
UNIT_CLASSES = (common.CameraUnit, common.SkyMapUnit,
common.TractUnit, common.PatchUnit,
common.FilterUnit,)
def __init__(self, backend):
self.backend = backend
self._cameras = {}
self._skyMaps = {}
def create(self):
"""Create all `Unit` tables required by the `RepoDatabase`.
This should only be once when a `RepoDatabase` is first constructed
(not merely unpersisted).
"""
for UnitClass in self.UNIT_CLASSES:
self.backend.createUnitTable(UnitClass)
def addCamera(self, camera):
"""Add `Units` to the `RepoDatabase` defined by a `CameraDataSpec`.
This adds all `FilterUnit` instances used by the camera to the
database.
Parameters
----------
camera : `CameraDataSpec`
Object describing camera-specific aspects of the data model.
Design Notes
------------
In the future, this should also add sensor `Unit`s that are not
attached to visit `Unit`s.
In the future, this should add tables for camera-specific labels for
visit, sensor, and filter `Unit`s (and possibly calibration `Unit`s
as well).
"""
cameraUnit = common.CameraUnit(name=camera.name)
self.backend.insertUnit(cameraUnit)
self._cameras[camera.name] = (camera, cameraUnit)
for f in camera.filters:
filterUnit = common.FilterUnit(name=f, camera=cameraUnit)
self.backend.insertUnit(filterUnit)
# TODO: add table for raw Dataset type
def addSkyMap(self, skyMap, name):
"""Add `Unit`s to the `RepoDatabase defined by a `SkyMap`.
This adds a `SkyMapUnit` to the databse, enabling the user
to call `addTracts` to actually add `TractUnit` and `PatchUnit`
instances to the database.
Parameters
----------
skyMap : subclass of `lsst.skymap.BaseSkyMap`
An object that describes a set of tracts and patches that tile
the sky.
name : `str`
A unique name for this skymap. This needs to uniquely identify
the skymap *instance* (i.e. including configuration), not just
its type.
"""
skyMapUnit = common.SkyMapUnit(name=name)
self.backend.insertUnit(skyMapUnit)
self._skyMaps[name] = (skyMap, skyMapUnit)
def addTracts(self, skyMapName, only=None):
"""Add `TractUnit` and `PatchUnit` instances to the database.
Parameters
----------
skyMapName : `str`
Name the skymap that generates these tracts was registered with
in the call to `addSkyMap`.
only : sequence of `int`
A list of `lsst.skymap` tract IDs (i.e. `TractUnit.number`) values
to limit which tracts to add. `None` (default) adds all tracts.
"""
skyMap, skyMapUnit = self._skyMaps[skyMapName]
allPatches = set()
if only is None:
iterable = skyMap
else:
iterable = (skyMap[t] for t in only)
for tract in iterable:
tractUnit = common.TractUnit(number=tract.getId(),
skymap=skyMapUnit)
self.backend.insertUnit(tractUnit)
for patch in tract:
x, y = patch.getIndex()
allPatches.add((x, y))
for x, y in allPatches:
patchUnit = common.PatchUnit(x=x, y=y, skymap=skyMapUnit)
self.backend.insertUnit(patchUnit)
# TODO: tract-patch join table
def registerDatasetType(self, DatasetClass):
"""Add a table for a new `Dataset` type to the database.
This is a no-op if the dataset already exists.
"""
self.backend.createDatasetTable(DatasetClass)
def addDataset(self, dataset):
"""Add an instance of a `Dataset` to the database.
This should only be called when the corresponding data product actually
exists in the repository, i.e. during ingest or after a successful call
to `Butler.put`.
"""
self.backend.insertDataset(dataset)
def makeGraph(self, UnitClasses=(), where=None,
NeededDatasets=(), FutureDatasets=()):
"""Create a `RepoGraph` that represents a possibly-restricted view
into the database.
Parameters
----------
UnitClasses : sequence of type objects that inherit from `Unit`
Include at least these unit types in the graph, which naturally
restricts the graph to the intersection (across the predefined
relationships between these units) of that is in the database for
all of these units. This sequence is expanded to include any unit
type related to the unit types in the sequence and any units
related to any `Dataset` types in `NeededDatasets` or
`FutureDatasets`, and hence can frequently be an empty sequence.
where : `str`
An optional SQL where clause operating on the tables for the
`Unit`s and `NeededDataset`s that restricts the graph.
NeededDatasets : sequence of type objects that inherit from `Dataset`.
Include these `Dataset` types in the graph, and restrict the
graph to the intersection of the instances of these `Datasets`
that already exist in the database. Typically this should be the
set of pure input datasets needed by a `Pipeline`.
FutureDatasets : sequence of type objects that inherit from `Dataset`.
Include these `Dataset` types in the graph, but do not restrict
the graph based on whether they already exist in the database.
Typically this should be the set of datasets produced by a
`Pipeline`.
Design Notes
------------
The `where` argument currently requires the user to know about the
actual database schema. We need to abstract this user-provided
expression somehow and have the `Backend` turn it into SQL.
"""
return self.backend.makeGraph(
UnitClasses=UnitClasses, where=where,
NeededDatasets=NeededDatasets,
FutureDatasets=FutureDatasets
)