Skip to content

DM-54083: add VisitImage and butler integration#7

Merged
TallJimbo merged 31 commits intomainfrom
tickets/DM-54083
Feb 25, 2026
Merged

DM-54083: add VisitImage and butler integration#7
TallJimbo merged 31 commits intomainfrom
tickets/DM-54083

Conversation

@TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Feb 20, 2026

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@TallJimbo TallJimbo force-pushed the tickets/DM-54083 branch 2 times, most recently from e2a290a to bb8b45a Compare February 20, 2026 21:59
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments for a Friday afternoon. I'll look more on Monday


def check_unhandled_parameters(self) -> None:
if self.file_descriptor.parameters:
raise RuntimeError("Parameters {")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to write something else for the error message?

import astropy.io.fits

try:
from lsst.daf.butler import Butler, DataCoordinate, DatasetRef, DatasetType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no harm in adding lsst-daf-butler to pyproject.toml as a tests optional dependency for the purposes of testing (I see it is in requirements.txt).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add it to pyproject.toml as its own feature:

# Add feature for Butler support.
butler = ["lsst-daf-butler"]

because I figured some pip users might want butler support while others didn't. Is there still a reason to add it there as a tests feature dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I only suggested for tests since I saw it in the tests package.

root = self._exit_stack.enter_context(
tempfile.TemporaryDirectory(ignore_cleanup_errors=True, delete=True)
)
Butler.makeRepo(root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Butler.makeRepo(root)
config = Butler.makeRepo(root)

then below Butler.from_config(config, ...) is in theory more efficient.

empty_data_id = DataCoordinate.make_empty(self.butler.dimensions)
for name, storage_class in self._kwargs.items():
dataset_type = DatasetType(name, self.butler.dimensions.empty, storage_class)
self.butler.registry.registerDatasetType(dataset_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line fails with KeyError Projection if you have the wrong butler set up. Maybe catch and add_note indicating that the butler is too old?

compression
Compression options.
compression_seed
A FITs tile compression seed to use whenever the configured
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A FITs tile compression seed to use whenever the configured
A FITS tile compression seed to use whenever the configured

There are a few more "FITs" in the code.


# Inherited attributes are duplicated because that improves the docs
# (some limitation in the sphinx/pydantic integration), and these are
# import docs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important?

@classmethod
def read_fits(cls, url: ResourcePathExpression, *, bbox: Box | None = None) -> MaskedImage:
"""Read a masked image from a FITS file.
"""Read an image from a FITS file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the doc string was right before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I find myself copying these a lot, and since (informally) "image" applies to all of these types, I figured it's less likely I'll get one dramatically wrong by using that.

@TallJimbo TallJimbo force-pushed the tickets/DM-54083 branch 2 times, most recently from 8e0f2e2 to 42895ff Compare February 22, 2026 15:23
Copy link
Member Author

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed your comments so far, as fixup! comments that I'll squash later.

One issue I've already spotted in starting work downstream of this ticket is that I think the kwargs on the view and copy methods of the various image types are trying to do too much; changing the units or the projection (especially on a view) is going to be a very rare thing and it's better to just ask users to use the constructor for that. Changing the start is an even bigger problem, because it's something the CellCoadd subclass of MaskedImage can't really support, which makes for a Liskov substitution error.

The mask schema is sort of an exception: copying a Mask or something holding a Mask in order to add a new mask plane seems like it might be really common. But for that we probably want a dedicated method that only copies the Mask and even then only when needed.

Since this affects both of the downstream tickets I'm planning to work on next (color images an cell coadds), I'm going to do it as one more commit on this branch, coming soon.


@classmethod
def from_index_row(cls, row: np.ndarray) -> ExtensionKey:
"""Construct from a row of the index binary table appended to the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because I never turned it on. And while I don't like not having a lint for getting the Parameters wrong, I kind of do like not having a lint for cases like this where there's only one parameter and its docs would just be a repeat of the function docstring; I'm not convinced we need Parameters there, just like I don't think we need always Returns when the type is provided by the annotation.. But maybe that's an option I can set.

header.set(key, value)


class ComponentSentinal(enum.Enum):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This word and "writable" are my nemeses. I feel I've got spelling under control except for those two 🤦‍♂️.

import astropy.io.fits

try:
from lsst.daf.butler import Butler, DataCoordinate, DatasetRef, DatasetType
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add it to pyproject.toml as its own feature:

# Add feature for Butler support.
butler = ["lsst-daf-butler"]

because I figured some pip users might want butler support while others didn't. Is there still a reason to add it there as a tests feature dependency?

@classmethod
def read_fits(cls, url: ResourcePathExpression, *, bbox: Box | None = None) -> MaskedImage:
"""Read a masked image from a FITS file.
"""Read an image from a FITS file.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I find myself copying these a lot, and since (informally) "image" applies to all of these types, I figured it's less likely I'll get one dramatically wrong by using that.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I ran out of steam before the tests since I have a load of things piling up this afternoon, but I don't want to block you on me not getting around to the test suite.

of a legacy (`lsst.afw.image`) FITS file.
"""
primary_header = header.copy(strip=True)
# No idea what these spare TAN-SIP headers are doing in the afw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests in the afw test code looking for them...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see afw tests looking for A_ORDER and B_ORDER when they appear as part of actual TAN-SIP WCSs, presumably in headers for actual image HDUs. What's bizarre here is that:

  • this is an empty primary HDU, so it has no business having a WCS;
  • there are no other WCS keys, just A_ORDER and B_ORDER.

def strip_legacy_exposure_cards(header: astropy.io.fits.Header) -> None:
"""Strip header keywords added by lsst.afw.image.Exposure."""
header.remove("AR_HDU", ignore_missing=True)
for name in (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, these are the VisitInfo headers.

    std::vector<std::string> keyList = {"EXPTIME",     "DARKTIME",    "DATE-AVG", "TIMESYS",
                                        "TIME-MID",    "MJD-AVG-UT1", "AVG-ERA",  "BORE-RA",
                                        "BORE-DEC",    "BORE-AZ",     "BORE-ALT", "BORE-AIRMASS",
                                        "BORE-ROTANG", "ROTTYPE",     "OBS-LONG", "OBS-LAT",
                                        "OBS-ELEV",    "AIRTEMP",     "AIRPRESS", "HUMIDITY",
                                        "INSTRUMENT",  "IDNUM",       "FOCUSZ",   "OBSTYPE",
                                        "PROGRAM",     "REASON",      "OBJECT",   "HAS-SIMULATED-CONTENT"};

and annoyingly they clash with some of the LSSTCam raw headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they clash with the LSSTCam raw headers, I'm guessing I can't just blindly strip them all? Or has the damage already been done by the time we get to visit_image (i.e. raw header overwritten by VisitInfo), so I should just strip them?

If necessary I can cook up some way to get a few more header keys from the raws when I rewrite the visit_image datasets, especially if this means we have to extract the ObservationInfo from the raw rather than the old visit_image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The damage has been done in that the visit info overwrites. It's a nightmare and we long ago talked about fixing visit info to not do that but never got around to it.

page_size
Minimum number of bytes to read at at once. Making this a multiple
of the FITS block size (2880) is recommended.
partial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first looked at this I assumed partial meant that the file would be left open so that additional reads can happen efficiently to minimize memory pressure. I see now that it means that you can read a subset to save on memory but the file is closed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I very much had implementing Formatter interfaces in mind here.

if name not in self._opaque_metadata.headers:
if key not in self._opaque_metadata.headers:
opaque_header = reader.header.copy(strip=True)
opaque_header.remove("EXTNAME", ignore_missing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not overly happy with this stripping being repeated in many places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to keep header stripping of a particular keyword in the same class that adds that keyword on write (or converts the legacy type that added that keyword), which is part of the reason it's spread out. But I might be able to put EXTNAME and EXTVER stripping in particular somewhere it'd be more reusable.


INVALID_COMPONENT_MODEL = enum.auto()
"""This formatter recognizes the given component, but the expected
attribute of the top-level `..serialization.ArchiveTree` did not exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure .. doesn't work for sphinx, or at least that's what I learned when I was trying to fix warnings in daf_butler. I think this has no impact because ComponentSentinel or any of the formatter classes are not in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure .. does work, but it's relative to Sphinx's hierarchy, not Python's, which are not always the same, and of course it doesn't work if you put it in a base class docstring and that gets inherited in another module.

But yeah, we can't use this as a test case until we get the daf_butler docs in good enough shape to allow a downstream subclass like this to inherit those docstrings in another package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I will need to revisit. I only managed to get a single dot working last time I tried but that might have been because all the bonus packages in butler confuse things.

return np.dtype(t.to_numpy()).kind == "u"
def is_integer(t: NumberType) -> TypeGuard[IntegerType]:
"""Test whether a `NumberType` corresponds to an integer type."""
match np.dtype(t.to_numpy()).kind:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is match that much better than:

return np.dtype(t.to_numpy()).kind in ("i", "u")

?

bit = self.bit(plane.name)
if bit.index != 0:
raise TypeError("Only mask schemas with mask_size==1 can be described in FITS.")
header.set(f"MSKN{n + 1:04d}", plane.name, f"Name for mask plane {n + 1}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MSKN and lsst.images convention or a broader convention? We should be documenting these somewhere. In some sense it's more explicit if we use HIERARCH LSST IMAGES MASK NAME and then everyone knows what's happening and that it's not a standard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new lsst.images convention I just invented now. I figured if we were going to try to push for it to become a registered convention at some point it'd be better as a non-HIERARCH keyword, but I'm happy to defer to you and @gpdf on that point, and I agree that if we're not going to push to make it a registered convention then a longer name with HIERARCH is better.

raise ValueError(
"Visit ID could not be found in butler data ID metadata and must be provided."
) from None
if unit is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this can mean you can specify a unit that doesn't match the BUNIT and confusion would reign.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a consistency check here.

md = legacy.getMetadata()
if instrument is None:
try:
instrument = str(md["LSST BUTLER DATAID INSTRUMENT"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also going to be in the INSTRUME header. Same worry below that the file could say LATISS but you pass in LSSTCam.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if that would have the exact same value? I was hoping to consistently use the butler short name, i.e. the one in data IDs.

And if they're different variants of the same name it gets a lot harder to do a consistency check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSSTCam instrume does not match the butler name so in retrospect it's not great to use it. You'd have to throw the header at a metadata translator and ask it for the standardized name but that's already in the provenance and the visit info serialization.


Parameters
----------
filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this can be a ResourcePathExpression and you can open the file with

fs, fspath = ResourcePath(filename).to_fsspec()
with fs.open(fspath) as f, astropy.io.fits.open(f) as fits_obj:
    ...

we do that in dax_images_cutout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that only works for the Astropy reads here; there's also some ExposureFitsReader usage that requires a local file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping you had written the ability for an astropy user to read in DP1 visit images without needing afw...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for a full Exposure -> VisitImage. But I can go apply this suggestion to MaskedImage.read_legacy, Image.read_legacy, and Mask.read_legacy (since those don't need to use afw readers), and give them an option to attach the FITS header WCSs as their Projection, and that would provide a way for Astropy users to read the most user-relevant bits of a DP1 visit image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed two more commits to enable this (reading URIs, attaching FITS WCS) for Image/Mask/MaskedImage. I didn't add anything to the legacy formatters in obs_base to use it, but in theory we've now got at a low level what we need for (e.g.) the DP1 cutout service to return one of those without completely dropping the WCS.

@timj
Copy link
Member

timj commented Feb 24, 2026

I don't think we can strip visit info unless we are creating observation info.

Saving multiple HDUs with same now automatically increments EXTVER.
To be extra careful about overflow we drop one bit from each element
in this case.
This is primarily motivated by Firefly's preference for multiple int32
HDUs vs. a data cube, since they also need to support naturally
higher-dimension bitmasks.  But it also ties in nicely with CFITSIO
and Astropy using int32 as the intermediate type for all image
compression.
Avoid writing types in reStructuredText when they'll be added to the
docs via annotations.

Avoid noting the serialize/deserialize signature in specific instances
when it's a generic feature of the library.
This sets Pydantic config to round-trip inf/NaN and bytes properly
through JSON.  The Pydantic docs claim these operate on each model in
the hierarchy independently (i.e. so a model could set the configs only
if it saves values that they would apply to), but this is demonstrably
not true for at least the inf/nan case, where the option needs to be
set on the outermost model to have any effect.  Safest to just set them
everywhere.
Instead of passing this as a constructor argument, we just assign to
objects after creation.  This involves more boilerplate, but it's more
consistent with treating that metadata as private.

This change also fixes a few spots where archive metadata was not being
propagated.
We can now serialize any type with both serialize and deserialize
methods and the new _get_archive_tree_type method.

The Image, Mask, and MaskedImage classes have read_fits/write_fits
wrappers for convenience and type-checking (explicit keyword arguments
instead of **kwargs).
This includes adding support for an external compression seed to
FitsOutputArchive, which we'll populate with a hash of a data ID
in the formatter.
This requires a tiny change in the formatters in obs_base, since we
can't quite make everything work with an in-memory conversion.
The inherited docstrings have way too many now-invalid relative links
to other butler types, and users don't gain much from seeing them in
the docs.
These are not obviously useful in more than a handful of rare cases
(where they're only a tiny bit more convenient than the constructor),
and they're both full of boilerplate and potentially problematic for
subclassing.
We can't extend this to VisitImage because we need to use
lsst.afw.image.ExposureFitsReader there.
@TallJimbo TallJimbo merged commit cea53f4 into main Feb 25, 2026
14 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-54083 branch February 25, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants