Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-24830: Create dataset class for processed bright star stamps #205

Merged
merged 1 commit into from Jul 30, 2020

Conversation

MorganSchmitz
Copy link
Contributor

@MorganSchmitz MorganSchmitz commented May 27, 2020

One of 3 PRs for DM-24830 (the other two being obs_base and obs_subaru).

This is the main one, and contains both the class itself, and a few tests.

Jenkins run #31959

@MorganSchmitz MorganSchmitz force-pushed the tickets/DM-24830 branch 3 times, most recently from 8872d97 to 525dabe Compare May 28, 2020 09:43
Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I looked over it all, but I will look over it again when you have made some changes.

The git history is also messed up. This will need fixed before it is merged. In general you should rebase on master not merge master into your branch. I'm happy to help work through git if you would like.

Parameters
----------
starStamps : `list` [`dict`]
Sequence of star stamps, each containing keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If schema of the star stamps is fixed, dont use a dictionary. Use a NamedTuple (either type) or a (data)class [your choice of dataclass or regular]. If you choose the class approach, used slots to define your attributes.

We can talk about this if you need more info

and ``"OUTER_RADIUS"`` are not present in ``metadata`` _and_
``innerRadius`` and ``outerRadius`` are not provided.

Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example section does not give any useful info to the user. I would add the bit about the bbox to a Notes section, and drop the examples section. The lines of code more explain how to use a butler.

item : `dict`
Mapping containing keys ``"starStamp"``, ``"gaiaGMag"``,
``"gaiaId"``, and `"annularFlux"`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you switch to some data record like I mentioned above you can drop the check function and have:

if not isinstance(item, BrightStarStamp):
    raise ValueError(f"Can only add instances of BrightStarStamp, got {type(item)}") 

if metadata is None:
self._metadata = PropertySet()
else:
self._metadata = metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write this as:

self._metadata = PropertySet() if metadata is None else metadata

And being this small you may consider putting directly in init, but that is up to you.


Parameters
----------
starStamps : `list` [`dict`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a list, or would any Sequence do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any MutableSequence would do. Changed.

Note I'm not sure if this should be abc.MutableSequence in the docstring then, since we do not import collections.abc. Is it okay to reference an unimported library in a docstring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sphinx does its own imports and look-ups. MutableSequence is a derived class of Sequence, and as you are not using any mutability aspect of the input, it is better to tell people you just need a sequence (i.e. someone could pass a tuple and it would work just fine)

metadata=metadata)
with self.assertRaises(AttributeError):
_ = brightStarStamps.BrightStarStamps(self.starStamps, outerRadius=self.outerRadius + 1,
metadata=metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing supplying the radius if it does not exist in the metadata, which is probably the only time the arguments will be used.

import collections.abc
import numpy as np

from lsst import afw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we dont import just afw. You will more commonly see

import lsst.afw.image as afwImage
import lsst.afw.fits as afwFits

import numpy as np

from lsst import afw
from lsst import geom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because you are using very few symbols, its probably worth the saved characters later to do

from lsst.geom import Box2I, Point2I, Extent2I


Returns
-------
gaiaIds : `list` [`str`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why str here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured it made sense for unique ids to be strings, even if (as seems to be the case here with Gaia DR2) they happen to be ints. Now that I actually take a step back and think about it, I cannot really justify it, I suppose.

Would you recommend specifying int, or not providing the type at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be in whatever form it is in the original reference catalog, so that if you get the value here, it can be used to look-up w/o any transformation.

+ f"values ({metadataName}).")
# if not already in metadata, a value must be provided
elif radiusValue is None:
raise AttributeError("No radius value provided for the AnnularFlux measurement "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This msg should say something about no value provided, and none present in the metadata, so the user can track down that there are two different places it could come from

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not entirely sure why this exists, the user must already supply the meta data field, and the calculations for the stamps must be done prior to creating this class (I guess append could add some if you had an empty list) The radii should be known at the point of creating this class then. Why have the arguments instead of just enforcing they are in the meta data passed in. Is it just a convenience for the user? I think the class would be simpler if the radii just came in with the meta data. Its a bit confusing to have different ways of setting it, the user might not know which one to use, or think they must use both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is confusing as I tried to take into account how these objects will be generated and used. There are a lot of disparate elements that led me to write things this way:

  • These starStamps objects have already been normalized, using their annularFlux value, for set values of annular radii. It only makes sense to store them together within one BrightStarStamps instance if they have all used the same annulus definition.
  • We will want to write a great many of these to disk; we need to include the radii used in the written fits. It seems the best way to do so is to add these values to the metadata, that will then be set within the writeFits method (line 233 in this version of the file).
  • As discussed in the thread about the metadata getter, self._metadata is likely to be the full metadata of the Exposure from which the stamps were extracted, with extra fields added here. The “user” who creates these instances (DM-25304) has this readily available on the one hand, and the radii values they used on the other. I thought it would make sense to have the class itself merge the two internally, so this user only needs to dump the exposure metadata, and the two radii values, at initialization.
  • Another use case will be the reading, concatenation, and further use (DM-25305) of several BrightStarStamps, saved to disk. We very much want to ensure we are concatenating bright star stamps that used the same annular flux radii; but since we’re reading them from disk, that information is in the metadata now.

I do think the duplication here is good, or at least necessary: we want users to be able to provide their annular radii when they create a new instance, and dump metadata at the same time. If that metadata comes from another BrightStarStamps (likely read from disk), right then is a good time to fail if the radii used do not match those in the metadata. Otherwise, they can safely be added.

Having written all this, however, I realize this would perhaps look a little less out-of-place with an extend method, so I added one. This is what will be used for concatenation in DM-25304, rather than append; and it makes it clearer (I think) why we want this _checkRadius there. This led me to shuffling things around a little; _checkRadius is called directly in the init now, and the two radii are also kept as private members.

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. Its really just a question of who is responsible for adding these values to the metadata. If you want the class to handle it in the case of arguments as a convenience to the user, that's fine. I do like the addition of the extend method. However, in regards to your first built point, you actually have no mechanism in place to verify that the radii are the same in the append method. You should either require radii as args, put a big warning in the method docstring, or require adding elements be done by calling extend with a new instance.


Parameters
----------
item : `dict`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the doc string

from lsst.geom import Box2I, Point2I, Extent2I
from lsst.daf.base import PropertySet

BrightStarStamp = namedtuple("BrightStarStamp", 'starStamp gaiaGMag gaiaId annularFlux')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the new style of named tuple typing.NamedTuple. It allows you to declare things with class syntax and put documentation with the type.

@@ -0,0 +1,274 @@
# This file is part of meas_algorithms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a __all__ statement for whatever classes, functions, etc you want available when you import. This is to prevent things like being able to from brightStarStamps import PropertySet because you happened to import it here. This also helps our documentation engine know where things come from. We have been bad about doing this in the past, so older code you look at will not be good examples. It's best to get into the habit early

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __all__ should be after the license according to the style guide: https://developer.lsst.io/python/style.html?highlight=__all__#standard-code-order-should-be-followed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have sworn I had read that differently not that long ago, but git history tells me otherwise, I feel as if I am loosing my memory or something. Thanks @timj for pointing that out.


Returns
-------
gaiaIds : `list` [`str`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be in whatever form it is in the original reference catalog, so that if you get the value here, it can be used to look-up w/o any transformation.

modified by the caller and the changes will be written to
external files.
"""
return self._metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the object contain new information or not makes no difference to the access pattern of the instance attribute, you could still have .metadata even if you mutate it. An attribute is just a flatter way to access data contained by the class. However your comment does bring up some things:

  • Since you are mutating the metadata passed into the class, you really should create a copy of it in __init__, since you don't know if your changes will affect something down stream
  • If you want the user to not modify the metadata, it is better to return a copy of your internal metadata object in a getter
  • but if you really want to make it read only in a pythonic way, you should keep it as a private attribute _metadata and have a metadata property like I demoed above. This will prevent someone from reassigning the .metadata to anything. If it is important that someone not assign to the property set you should also return a copy in the property, so that they can use values but any changes they make do not affect your class.. That will allow your class to have an interface such that someone could write brightStars.metadata['some_property'] while still keeping the safety you want.

Putting that together would look something like:

class BrightStarStamps:
    def __init__(..., metadata, ...):
        ...
        self._metadata = metadata.copy()
        ...
    @property
    def metadata(self):
        return self._metadata.copy()

Note copy is a shallow copy, so the container is new, and the keys are new, but the values are not duplicated. The amount of copying for protection is up to you, but as I suspect this is not a rapidly accessed member speed is probably not a concern.

# the one given
if self._metadata.exists(metadataName):
if radiusValue is not None:
if self._metadata.get(name=metadataName) != radiusValue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the __getitem__ interface:

if self._metadata[metadataName] != radiusValue:

Change this throughout the code

+ f"values ({metadataName}).")
# if not already in metadata, a value must be provided
elif radiusValue is None:
raise AttributeError("No radius value provided for the AnnularFlux measurement "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. Its really just a question of who is responsible for adding these values to the metadata. If you want the class to handle it in the case of arguments as a convenience to the user, that's fine. I do like the addition of the extend method. However, in regards to your first built point, you actually have no mechanism in place to verify that the radii are the same in the append method. You should either require radii as args, put a big warning in the method docstring, or require adding elements be done by calling extend with a new instance.

raise AttributeError("No radius value provided for the AnnularFlux measurement "
+ f"({metadataName}), and none present in metadata.")
else:
self._metadata.add(name=metadataName, value=radiusValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use __setitem__ interface as mentioned above

bss : `BrightStarStamps`
Other instance to concatenate.
"""
self._checkRadius(bss._innerRadius, "INNER_RADIUS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though in the case of this class it is rather limited in use, it is best to not use magic strings if it can be avoided. Magic strings are susceptible to typos by someone at some future point, which are impossible to lint, which leads to an error not being found until some future runtime error.

If you can enumerate all the possibilities (such as this case) it is better to use pythons Enum class like follows

from enum import Enum, auto

class RadiiEnum(Enum):
    INNER_RADIUS = auto()  # auto because we dont actually care what the enumeration number is
    OUTER_RADIUS = auto()

This allows you to use and check like follows

self._checkRadius(bss._outerRadius, RadiiEnum.INNER_RADIUS)
....
    def _checkRadius(self, radiusValue, metadataEnum):
        if metadataEnum is RadiiEnum.INNER_RADIUS:
            ....
        else:
            ....

Linters will then be able to verify everything is correct every time it is used. However we can do one better for your case:

class RadiiEnum(Enum):
    INNER_RADIUS = auto()
    OUTER_RADIUS = auto()

    def __str__(self):
        return self.name #  name is the name of what ever variant the value happened to be

This will turn the enum into a string with the correct name when you use it as such:

    def _checkRadius(self, radiusValue, metadataEnum):
        metadataName = str(metadataEnum)
       #  The rest of your function can now remain the same

This way users can just use the enum values which will be checked by the linter, editors can auto complete, and it is self documenting in that all the possible options are enumerated (you know that their isnt some other metadata field you could check). The only failure can then be if you spell your enum name wrong, and that can be checked by a simple unit test.

return None

def writeFits(self, filename):
"""Write a single FITS file containing all bright star stamps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the get/set item interfaces of property set where appropriate

"""Collection of small images (stamps), each centered on a bright star.
"""

__all__ = ["BrightStarStamp", "BrightStarStamps"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you will not see this in a lot of code we have, because people often get this wrong (reviewers included) but all needs to be the first thing in the file, before the copyright

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. https://developer.lsst.io/python/style.html?highlight=__all__#standard-code-order-should-be-followed

To be specific:

  • Shebang line, #! /usr/bin/env python (only for executable scripts)
  • Module-level comments (such as the license statement)
  • Module-level docstring
  • __all__ = [...] statement, if present

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting further off topic, but I notice that __future__ imports aren't in that list in the dev guide, and I've been assuming they need to go before __all__. Anyone know if that's correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jim if you are thinking about the annotation future import, it MUST be the absolute first thing in a file, because it changes how the interpreter works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure python complains if the __future__ is not the first line of code since the future applies to the entire file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other future things may or may not be safe to be put other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and by first, I mean before all or any other executed code, things like docstring and comments are not interpreted and or parsed, so it does not really matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you care to know why, here is the page on future https://docs.python.org/3/library/__future__.html. basically there can be a CompilerFlag that changes how code is compiled into python byte code.

This new dataset class will allow for small image cutouts ("stamps"),
each centered around an individual bright star, to be saved to disk.
These can then be used to stack a large number of such stars to build
an extended PSF model. Each stamp image is normalized using the flux
of the stars in annulus around their center (since the central area
are saturated and/or contain numerous ghosts). The radii used to define
the annulus are stored in the metadata, along with each individual
annularFlux.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants