Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31924: new high-level interface for making IDs for catalogs and their rows #239

Merged
merged 4 commits into from Apr 27, 2023

Conversation

TallJimbo
Copy link
Member

No description provided.

Copy link
Contributor

@parejkoj parejkoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the test coverage of the new id classes? I'm having trouble figuring out how much of the new code would be executed by the tests you added, and whether the tests are actually independent of the code itself.

You say "integers" throughout this: is that intentionally "number of bits" generic, or do you mean 32-bit integers?

I think this could use some narrative documentation on how to use it, probably in doc/lsst.meas.base/config/.

python/lsst/meas/base/_id_generator.py Show resolved Hide resolved
"""Default release ID to embed in catalog IDs.

This can be changed globally to avoid having to override individual task
configs to set the release ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention then that this number gets bumped up for the pipelines release that is going to be used to generate the next data release? That procedure should get documented in the "how to make a release" docs, so that we're ready for it in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stretch to say I have a plan for that right now and I don't want to claim to have invented one, though that is more or less I how I envisioned it working (or at least that's better than trying to override this in config in all relevant tasks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough: do you think it's worth a short RFC to describe your idea for this process, and so we can think about other places we'll have to intentionally bump version numbers for releases? At least to have more than you and me being aware of this setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and a terrible thought: do we need to worry about point releases for data releases? I hope not!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll add a quick note to RFC-917 when adopting it, so (as you said) what's present now has a bit more visibility. I've got too many other things that need my attention right now to spend any more time on fleshing this scheme out.

class BaseIdGeneratorConfig(Config):
"""Base class for configuration of `IdGenerator` instances.

This class is abstract (it cannot use `abc.ABCMeta` for technical reasons),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the technical reasons? If you can't make it an explicit abc, the class docstring probably needs to not which methods need to be implemented in the children.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metaclass conflict; pex.config also has a custom metaclass.

python/lsst/meas/base/_id_generator.py Outdated Show resolved Hide resolved
python/lsst/meas/base/_id_generator.py Outdated Show resolved Hide resolved
python/lsst/meas/base/_id_generator.py Show resolved Hide resolved
self.n_data_ids = 1 << self.packer.maxBits
upper_bits = (self.n_releases - 1).bit_length() + self.packer.maxBits
self.counter_bits = IdFactory.computeReservedFromMaxBits(upper_bits)
self.n_counters = 1 << self.counter_bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring should probably note these three attributes, even though this is intended to be private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I don't think of attribute docs as going in the class docstring at all; the class Parameters is for __init__ arguments, and these are not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that: I see them as init=False fields now, with docstrings, so that's fine.

python/lsst/meas/base/_id_generator.py Outdated Show resolved Hide resolved
python/lsst/meas/base/_id_generator.py Show resolved Hide resolved
tests/test_id_generator.py Show resolved Hide resolved
def __str__(self) -> str:
"""Return a human-readable representation of the data ID (or a note
about its absence) for use in log and error messages.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dataId the only thing we want to include in these str, or should it say something about the number of bits, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was envisioning str here as being really for logs and exceptions about the catalog, not logs and exceptions about the IdGenerator itself. There are a lot of tasks whose run methods take an idGenerator and that effectively abstracts over whether there is a data ID at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this point on one of the other PRs; using the id generator to access a stringified dataId seems like a bit of an abuse of the system. Maybe just note in the IdGenerator docs that we use it as such?

@property
def catalog_id(self) -> int:
"""The integer identifier for the full catalog with this data ID, not
just one of its rows (`int`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I dig into the uses of this, I am surprised that exposureIdInfo.expId is replaced with idGenerator.catalog_id. Thinking about it, the latter makes sense (you're using it for catalog-y things, not exposure-y things), but it might surprise others when they go to look at it ("why doesn't the new packer have an exposure_id I can get?"). Probably just worth being clear about this change when you announce the merging of this ticket.

Copy link
Member Author

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the test coverage of the new id classes? I'm having trouble figuring out how much of the new code would be executed by the tests you added, and whether the tests are actually independent of the code itself.

Coverage of the new file is 89%; what's missing is mostly just the deprecated things (I've gotten accustomed to using coverage to help me write tests, too).

You say "integers" throughout this: is that intentionally "number of bits" generic, or do you mean 32-bit integers?

I specifically mean 64-bit signed integers for the full row ID, since that's what afw.table uses (FITS says we can't have uint64, and various project-level docs says 64 bits).

"""Default release ID to embed in catalog IDs.

This can be changed globally to avoid having to override individual task
configs to set the release ID.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stretch to say I have a plan for that right now and I don't want to claim to have invented one, though that is more or less I how I envisioned it working (or at least that's better than trying to override this in config in all relevant tasks).

class BaseIdGeneratorConfig(Config):
"""Base class for configuration of `IdGenerator` instances.

This class is abstract (it cannot use `abc.ABCMeta` for technical reasons),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metaclass conflict; pex.config also has a custom metaclass.

python/lsst/meas/base/_id_generator.py Outdated Show resolved Hide resolved
raise NotImplementedError("Method is abstract.")


class ExposureIdGeneratorConfig(BaseIdGeneratorConfig):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no all-encompassing name here, since InstrumentId wouldn't be specific enough to resolve "exposure" vs "visit". But I'm happy to renaming this to DetectorExposureIdGeneratorConfig (and also rename the visit one to DetectorVisitIdGeneratorConfig) if you think that's better.

return self.packer.apply(data_id)


class IdGenerator:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended to be a base class but it is not an abstract one; the base class is a concrete trivial implementation that's useful for unit tests or in other contexts where the task is being run with no data ID.

python/lsst/meas/base/_id_generator.py Show resolved Hide resolved
f"{k}={v!r}" for k, v in kwargs.items()
]
raise ValueError(
f"Integer range from numpy.arange({arg_terms}) has values that are not "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to come with a way to do that that doesn't potentially print a huge array, but couldn't come up with one that didn't involve trying to reimplement np.arange arg-handling.

python/lsst/meas/base/_id_generator.py Show resolved Hide resolved
self.n_data_ids = 1 << self.packer.maxBits
upper_bits = (self.n_releases - 1).bit_length() + self.packer.maxBits
self.counter_bits = IdFactory.computeReservedFromMaxBits(upper_bits)
self.n_counters = 1 << self.counter_bits
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I don't think of attribute docs as going in the class docstring at all; the class Parameters is for __init__ arguments, and these are not.

tests/test_id_generator.py Show resolved Hide resolved
Copy link
Contributor

@parejkoj parejkoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added narrative doc; a few comments on formatting there.


- `SkyMapIdGeneratorConfig` for ``{tract, patch}`` or ``{tract, patch, band}}`` data IDs (most object tables);
- `DetectorVisitIdGeneratorConfig` for ``{visit, detector}`` data IDs (most source tables);
- `DetectorExposureIdGeneratorConfig` for ``{exposure, detector}`` data IDs (rare, at least in the default LSST data model).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these all need the full namespace in order to get built and linked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The

.. py:currentmodule:: lsst.meas.base

at the top of the file takes care of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's cool! I wasn't aware of that. I'll have to remember it for my own docs.

- `DetectorVisitIdGeneratorConfig` for ``{visit, detector}`` data IDs (most source tables);
- `DetectorExposureIdGeneratorConfig` for ``{exposure, detector}`` data IDs (rare, at least in the default LSST data model).

These have a `~BaseIdGeneratorConfig.make_field` method that can be used to define the config field with minimal boilerplate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the full namespace is necessary here, too? And in the paragraph below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

This wraps (and slightly reimplements some of)
lsst.afw.table.IdFactory, and replaces lsst.obs.base.ExposureIdInfo.
These tasks do not really need need to generate IDs, as they take an
input catalog of some sort and use the IDs from that (forcedPhotCcd
does also make its own IDs, but I don't think they're useful and I
don't think we have room for them in the final schema).  But fixing
that is something best taken into account when replacing the tasks
rather than something we'd ever benefit from addressing in the current
ones.

I've replaced/renamed some arguments to generateMeasCat without
deprecation as I'm pretty sure those methods are effectively private
even though we didn't have the foresight to give them a leading
underscore.
This better reflects most real-world usage.
@TallJimbo TallJimbo merged commit dc3ce67 into main Apr 27, 2023
1 check passed
@TallJimbo TallJimbo deleted the tickets/DM-31924 branch April 27, 2023 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants