New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-31924: new high-level interface for making IDs for catalogs and their rows #239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the test coverage of the new id classes? I'm having trouble figuring out how much of the new code would be executed by the tests you added, and whether the tests are actually independent of the code itself.
You say "integers" throughout this: is that intentionally "number of bits" generic, or do you mean 32-bit integers?
I think this could use some narrative documentation on how to use it, probably in doc/lsst.meas.base/config/
.
"""Default release ID to embed in catalog IDs. | ||
|
||
This can be changed globally to avoid having to override individual task | ||
configs to set the release ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention then that this number gets bumped up for the pipelines release that is going to be used to generate the next data release? That procedure should get documented in the "how to make a release" docs, so that we're ready for it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a stretch to say I have a plan for that right now and I don't want to claim to have invented one, though that is more or less I how I envisioned it working (or at least that's better than trying to override this in config in all relevant tasks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough: do you think it's worth a short RFC to describe your idea for this process, and so we can think about other places we'll have to intentionally bump version numbers for releases? At least to have more than you and me being aware of this setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and a terrible thought: do we need to worry about point releases for data releases? I hope not!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll add a quick note to RFC-917 when adopting it, so (as you said) what's present now has a bit more visibility. I've got too many other things that need my attention right now to spend any more time on fleshing this scheme out.
class BaseIdGeneratorConfig(Config): | ||
"""Base class for configuration of `IdGenerator` instances. | ||
|
||
This class is abstract (it cannot use `abc.ABCMeta` for technical reasons), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the technical reasons? If you can't make it an explicit abc, the class docstring probably needs to not which methods need to be implemented in the children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metaclass conflict; pex.config also has a custom metaclass.
self.n_data_ids = 1 << self.packer.maxBits | ||
upper_bits = (self.n_releases - 1).bit_length() + self.packer.maxBits | ||
self.counter_bits = IdFactory.computeReservedFromMaxBits(upper_bits) | ||
self.n_counters = 1 << self.counter_bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class docstring should probably note these three attributes, even though this is intended to be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I don't think of attribute docs as going in the class docstring at all; the class Parameters
is for __init__
arguments, and these are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that: I see them as init=False
fields now, with docstrings, so that's fine.
def __str__(self) -> str: | ||
"""Return a human-readable representation of the data ID (or a note | ||
about its absence) for use in log and error messages. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dataId the only thing we want to include in these str
, or should it say something about the number of bits, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was envisioning str
here as being really for logs and exceptions about the catalog, not logs and exceptions about the IdGenerator itself. There are a lot of tasks whose run
methods take an idGenerator
and that effectively abstracts over whether there is a data ID at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this point on one of the other PRs; using the id generator to access a stringified dataId seems like a bit of an abuse of the system. Maybe just note in the IdGenerator
docs that we use it as such?
@property | ||
def catalog_id(self) -> int: | ||
"""The integer identifier for the full catalog with this data ID, not | ||
just one of its rows (`int`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I dig into the uses of this, I am surprised that exposureIdInfo.expId
is replaced with idGenerator.catalog_id
. Thinking about it, the latter makes sense (you're using it for catalog-y things, not exposure-y things), but it might surprise others when they go to look at it ("why doesn't the new packer have an exposure_id I can get?"). Probably just worth being clear about this change when you announce the merging of this ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the test coverage of the new id classes? I'm having trouble figuring out how much of the new code would be executed by the tests you added, and whether the tests are actually independent of the code itself.
Coverage of the new file is 89%; what's missing is mostly just the deprecated things (I've gotten accustomed to using coverage
to help me write tests, too).
You say "integers" throughout this: is that intentionally "number of bits" generic, or do you mean 32-bit integers?
I specifically mean 64-bit signed integers for the full row ID, since that's what afw.table uses (FITS says we can't have uint64, and various project-level docs says 64 bits).
"""Default release ID to embed in catalog IDs. | ||
|
||
This can be changed globally to avoid having to override individual task | ||
configs to set the release ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a stretch to say I have a plan for that right now and I don't want to claim to have invented one, though that is more or less I how I envisioned it working (or at least that's better than trying to override this in config in all relevant tasks).
class BaseIdGeneratorConfig(Config): | ||
"""Base class for configuration of `IdGenerator` instances. | ||
|
||
This class is abstract (it cannot use `abc.ABCMeta` for technical reasons), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metaclass conflict; pex.config also has a custom metaclass.
raise NotImplementedError("Method is abstract.") | ||
|
||
|
||
class ExposureIdGeneratorConfig(BaseIdGeneratorConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no all-encompassing name here, since InstrumentId
wouldn't be specific enough to resolve "exposure" vs "visit". But I'm happy to renaming this to DetectorExposureIdGeneratorConfig
(and also rename the visit one to DetectorVisitIdGeneratorConfig
) if you think that's better.
return self.packer.apply(data_id) | ||
|
||
|
||
class IdGenerator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intended to be a base class but it is not an abstract one; the base class is a concrete trivial implementation that's useful for unit tests or in other contexts where the task is being run with no data ID.
f"{k}={v!r}" for k, v in kwargs.items() | ||
] | ||
raise ValueError( | ||
f"Integer range from numpy.arange({arg_terms}) has values that are not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to come with a way to do that that doesn't potentially print a huge array, but couldn't come up with one that didn't involve trying to reimplement np.arange
arg-handling.
self.n_data_ids = 1 << self.packer.maxBits | ||
upper_bits = (self.n_releases - 1).bit_length() + self.packer.maxBits | ||
self.counter_bits = IdFactory.computeReservedFromMaxBits(upper_bits) | ||
self.n_counters = 1 << self.counter_bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I don't think of attribute docs as going in the class docstring at all; the class Parameters
is for __init__
arguments, and these are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added narrative doc; a few comments on formatting there.
|
||
- `SkyMapIdGeneratorConfig` for ``{tract, patch}`` or ``{tract, patch, band}}`` data IDs (most object tables); | ||
- `DetectorVisitIdGeneratorConfig` for ``{visit, detector}`` data IDs (most source tables); | ||
- `DetectorExposureIdGeneratorConfig` for ``{exposure, detector}`` data IDs (rare, at least in the default LSST data model). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these all need the full namespace in order to get built and linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
.. py:currentmodule:: lsst.meas.base
at the top of the file takes care of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's cool! I wasn't aware of that. I'll have to remember it for my own docs.
- `DetectorVisitIdGeneratorConfig` for ``{visit, detector}`` data IDs (most source tables); | ||
- `DetectorExposureIdGeneratorConfig` for ``{exposure, detector}`` data IDs (rare, at least in the default LSST data model). | ||
|
||
These have a `~BaseIdGeneratorConfig.make_field` method that can be used to define the config field with minimal boilerplate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the full namespace is necessary here, too? And in the paragraph below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
This wraps (and slightly reimplements some of) lsst.afw.table.IdFactory, and replaces lsst.obs.base.ExposureIdInfo.
These tasks do not really need need to generate IDs, as they take an input catalog of some sort and use the IDs from that (forcedPhotCcd does also make its own IDs, but I don't think they're useful and I don't think we have room for them in the final schema). But fixing that is something best taken into account when replacing the tasks rather than something we'd ever benefit from addressing in the current ones. I've replaced/renamed some arguments to generateMeasCat without deprecation as I'm pretty sure those methods are effectively private even though we didn't have the foresight to give them a leading underscore.
This better reflects most real-world usage.
a75724b
to
bf7d0f9
Compare
No description provided.