New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-24830: Create dataset class for processed bright star stamps #205
Conversation
8872d97
to
525dabe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I looked over it all, but I will look over it again when you have made some changes.
The git history is also messed up. This will need fixed before it is merged. In general you should rebase on master not merge master into your branch. I'm happy to help work through git if you would like.
Parameters | ||
---------- | ||
starStamps : `list` [`dict`] | ||
Sequence of star stamps, each containing keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If schema of the star stamps is fixed, dont use a dictionary. Use a NamedTuple (either type) or a (data)class [your choice of dataclass or regular]. If you choose the class approach, used slots to define your attributes.
We can talk about this if you need more info
and ``"OUTER_RADIUS"`` are not present in ``metadata`` _and_ | ||
``innerRadius`` and ``outerRadius`` are not provided. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example section does not give any useful info to the user. I would add the bit about the bbox to a Notes section, and drop the examples section. The lines of code more explain how to use a butler.
item : `dict` | ||
Mapping containing keys ``"starStamp"``, ``"gaiaGMag"``, | ||
``"gaiaId"``, and `"annularFlux"`. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you switch to some data record like I mentioned above you can drop the check function and have:
if not isinstance(item, BrightStarStamp):
raise ValueError(f"Can only add instances of BrightStarStamp, got {type(item)}")
if metadata is None: | ||
self._metadata = PropertySet() | ||
else: | ||
self._metadata = metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write this as:
self._metadata = PropertySet() if metadata is None else metadata
And being this small you may consider putting directly in init, but that is up to you.
|
||
Parameters | ||
---------- | ||
starStamps : `list` [`dict`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be a list, or would any Sequence do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any MutableSequence would do. Changed.
Note I'm not sure if this should be abc.MutableSequence
in the docstring then, since we do not import collections.abc
. Is it okay to reference an unimported library in a docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sphinx does its own imports and look-ups. MutableSequence is a derived class of Sequence, and as you are not using any mutability aspect of the input, it is better to tell people you just need a sequence (i.e. someone could pass a tuple and it would work just fine)
metadata=metadata) | ||
with self.assertRaises(AttributeError): | ||
_ = brightStarStamps.BrightStarStamps(self.starStamps, outerRadius=self.outerRadius + 1, | ||
metadata=metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about testing supplying the radius if it does not exist in the metadata, which is probably the only time the arguments will be used.
import collections.abc | ||
import numpy as np | ||
|
||
from lsst import afw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we dont import just afw. You will more commonly see
import lsst.afw.image as afwImage
import lsst.afw.fits as afwFits
import numpy as np | ||
|
||
from lsst import afw | ||
from lsst import geom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because you are using very few symbols, its probably worth the saved characters later to do
from lsst.geom import Box2I, Point2I, Extent2I
|
||
Returns | ||
------- | ||
gaiaIds : `list` [`str`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask why str here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured it made sense for unique ids to be strings, even if (as seems to be the case here with Gaia DR2) they happen to be ints. Now that I actually take a step back and think about it, I cannot really justify it, I suppose.
Would you recommend specifying int, or not providing the type at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be in whatever form it is in the original reference catalog, so that if you get the value here, it can be used to look-up w/o any transformation.
+ f"values ({metadataName}).") | ||
# if not already in metadata, a value must be provided | ||
elif radiusValue is None: | ||
raise AttributeError("No radius value provided for the AnnularFlux measurement " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This msg should say something about no value provided, and none present in the metadata, so the user can track down that there are two different places it could come from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I am not entirely sure why this exists, the user must already supply the meta data field, and the calculations for the stamps must be done prior to creating this class (I guess append could add some if you had an empty list) The radii should be known at the point of creating this class then. Why have the arguments instead of just enforcing they are in the meta data passed in. Is it just a convenience for the user? I think the class would be simpler if the radii just came in with the meta data. Its a bit confusing to have different ways of setting it, the user might not know which one to use, or think they must use both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is confusing as I tried to take into account how these objects will be generated and used. There are a lot of disparate elements that led me to write things this way:
- These
starStamps
objects have already been normalized, using their annularFlux value, for set values of annular radii. It only makes sense to store them together within oneBrightStarStamps
instance if they have all used the same annulus definition. - We will want to write a great many of these to disk; we need to include the radii used in the written fits. It seems the best way to do so is to add these values to the metadata, that will then be set within the
writeFits
method (line 233 in this version of the file). - As discussed in the thread about the metadata getter,
self._metadata
is likely to be the full metadata of theExposure
from which the stamps were extracted, with extra fields added here. The “user” who creates these instances (DM-25304) has this readily available on the one hand, and the radii values they used on the other. I thought it would make sense to have the class itself merge the two internally, so this user only needs to dump the exposure metadata, and the two radii values, at initialization. - Another use case will be the reading, concatenation, and further use (DM-25305) of several BrightStarStamps, saved to disk. We very much want to ensure we are concatenating bright star stamps that used the same annular flux radii; but since we’re reading them from disk, that information is in the metadata now.
I do think the duplication here is good, or at least necessary: we want users to be able to provide their annular radii when they create a new instance, and dump metadata at the same time. If that metadata comes from another BrightStarStamps (likely read from disk), right then is a good time to fail if the radii used do not match those in the metadata. Otherwise, they can safely be added.
Having written all this, however, I realize this would perhaps look a little less out-of-place with an extend
method, so I added one. This is what will be used for concatenation in DM-25304, rather than append
; and it makes it clearer (I think) why we want this _checkRadius
there. This led me to shuffling things around a little; _checkRadius
is called directly in the init now, and the two radii are also kept as private members.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. Its really just a question of who is responsible for adding these values to the metadata. If you want the class to handle it in the case of arguments as a convenience to the user, that's fine. I do like the addition of the extend method. However, in regards to your first built point, you actually have no mechanism in place to verify that the radii are the same in the append method. You should either require radii as args, put a big warning in the method docstring, or require adding elements be done by calling extend with a new instance.
6699203
to
e9da388
Compare
|
||
Parameters | ||
---------- | ||
item : `dict` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the doc string
from lsst.geom import Box2I, Point2I, Extent2I | ||
from lsst.daf.base import PropertySet | ||
|
||
BrightStarStamp = namedtuple("BrightStarStamp", 'starStamp gaiaGMag gaiaId annularFlux') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the new style of named tuple typing.NamedTuple
. It allows you to declare things with class syntax and put documentation with the type.
@@ -0,0 +1,274 @@ | |||
# This file is part of meas_algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need a __all__
statement for whatever classes, functions, etc you want available when you import. This is to prevent things like being able to from brightStarStamps import PropertySet
because you happened to import it here. This also helps our documentation engine know where things come from. We have been bad about doing this in the past, so older code you look at will not be good examples. It's best to get into the habit early
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __all__
should be after the license according to the style guide: https://developer.lsst.io/python/style.html?highlight=__all__#standard-code-order-should-be-followed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have sworn I had read that differently not that long ago, but git history tells me otherwise, I feel as if I am loosing my memory or something. Thanks @timj for pointing that out.
|
||
Returns | ||
------- | ||
gaiaIds : `list` [`str`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be in whatever form it is in the original reference catalog, so that if you get the value here, it can be used to look-up w/o any transformation.
modified by the caller and the changes will be written to | ||
external files. | ||
""" | ||
return self._metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the object contain new information or not makes no difference to the access pattern of the instance attribute, you could still have .metadata
even if you mutate it. An attribute is just a flatter way to access data contained by the class. However your comment does bring up some things:
- Since you are mutating the metadata passed into the class, you really should create a copy of it in
__init__
, since you don't know if your changes will affect something down stream - If you want the user to not modify the metadata, it is better to return a copy of your internal metadata object in a getter
- but if you really want to make it read only in a pythonic way, you should keep it as a private attribute
_metadata
and have a metadata property like I demoed above. This will prevent someone from reassigning the.metadata
to anything. If it is important that someone not assign to the property set you should also return a copy in the property, so that they can use values but any changes they make do not affect your class.. That will allow your class to have an interface such that someone could writebrightStars.metadata['some_property']
while still keeping the safety you want.
Putting that together would look something like:
class BrightStarStamps:
def __init__(..., metadata, ...):
...
self._metadata = metadata.copy()
...
@property
def metadata(self):
return self._metadata.copy()
Note copy is a shallow copy, so the container is new, and the keys are new, but the values are not duplicated. The amount of copying for protection is up to you, but as I suspect this is not a rapidly accessed member speed is probably not a concern.
# the one given | ||
if self._metadata.exists(metadataName): | ||
if radiusValue is not None: | ||
if self._metadata.get(name=metadataName) != radiusValue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the __getitem__
interface:
if self._metadata[metadataName] != radiusValue:
Change this throughout the code
+ f"values ({metadataName}).") | ||
# if not already in metadata, a value must be provided | ||
elif radiusValue is None: | ||
raise AttributeError("No radius value provided for the AnnularFlux measurement " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. Its really just a question of who is responsible for adding these values to the metadata. If you want the class to handle it in the case of arguments as a convenience to the user, that's fine. I do like the addition of the extend method. However, in regards to your first built point, you actually have no mechanism in place to verify that the radii are the same in the append method. You should either require radii as args, put a big warning in the method docstring, or require adding elements be done by calling extend with a new instance.
raise AttributeError("No radius value provided for the AnnularFlux measurement " | ||
+ f"({metadataName}), and none present in metadata.") | ||
else: | ||
self._metadata.add(name=metadataName, value=radiusValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use __setitem__
interface as mentioned above
bss : `BrightStarStamps` | ||
Other instance to concatenate. | ||
""" | ||
self._checkRadius(bss._innerRadius, "INNER_RADIUS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though in the case of this class it is rather limited in use, it is best to not use magic strings if it can be avoided. Magic strings are susceptible to typos by someone at some future point, which are impossible to lint, which leads to an error not being found until some future runtime error.
If you can enumerate all the possibilities (such as this case) it is better to use pythons Enum class like follows
from enum import Enum, auto
class RadiiEnum(Enum):
INNER_RADIUS = auto() # auto because we dont actually care what the enumeration number is
OUTER_RADIUS = auto()
This allows you to use and check like follows
self._checkRadius(bss._outerRadius, RadiiEnum.INNER_RADIUS)
....
def _checkRadius(self, radiusValue, metadataEnum):
if metadataEnum is RadiiEnum.INNER_RADIUS:
....
else:
....
Linters will then be able to verify everything is correct every time it is used. However we can do one better for your case:
class RadiiEnum(Enum):
INNER_RADIUS = auto()
OUTER_RADIUS = auto()
def __str__(self):
return self.name # name is the name of what ever variant the value happened to be
This will turn the enum into a string with the correct name when you use it as such:
def _checkRadius(self, radiusValue, metadataEnum):
metadataName = str(metadataEnum)
# The rest of your function can now remain the same
This way users can just use the enum values which will be checked by the linter, editors can auto complete, and it is self documenting in that all the possible options are enumerated (you know that their isnt some other metadata field you could check). The only failure can then be if you spell your enum name wrong, and that can be checked by a simple unit test.
return None | ||
|
||
def writeFits(self, filename): | ||
"""Write a single FITS file containing all bright star stamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the get/set item interfaces of property set where appropriate
"""Collection of small images (stamps), each centered on a bright star. | ||
""" | ||
|
||
__all__ = ["BrightStarStamp", "BrightStarStamps"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you will not see this in a lot of code we have, because people often get this wrong (reviewers included) but all needs to be the first thing in the file, before the copyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. https://developer.lsst.io/python/style.html?highlight=__all__#standard-code-order-should-be-followed
To be specific:
- Shebang line, #! /usr/bin/env python (only for executable scripts)
- Module-level comments (such as the license statement)
- Module-level docstring
__all__ = [...]
statement, if present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting further off topic, but I notice that __future__
imports aren't in that list in the dev guide, and I've been assuming they need to go before __all__
. Anyone know if that's correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jim if you are thinking about the annotation future import, it MUST be the absolute first thing in a file, because it changes how the interpreter works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure python complains if the __future__
is not the first line of code since the future applies to the entire file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other future things may or may not be safe to be put other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and by first, I mean before all or any other executed code, things like docstring and comments are not interpreted and or parsed, so it does not really matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you care to know why, here is the page on future https://docs.python.org/3/library/__future__.html. basically there can be a CompilerFlag that changes how code is compiled into python byte code.
fff7d36
to
190fa34
Compare
This new dataset class will allow for small image cutouts ("stamps"), each centered around an individual bright star, to be saved to disk. These can then be used to stack a large number of such stars to build an extended PSF model. Each stamp image is normalized using the flux of the stars in annulus around their center (since the central area are saturated and/or contain numerous ghosts). The radii used to define the annulus are stored in the metadata, along with each individual annularFlux.
190fa34
to
dbf4ce3
Compare
One of 3 PRs for DM-24830 (the other two being obs_base and obs_subaru).
This is the main one, and contains both the class itself, and a few tests.
Jenkins run #31959