Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-36797: Add RhoStatistics #47

Merged
merged 10 commits into from Dec 22, 2022
Merged

DM-36797: Add RhoStatistics #47

merged 10 commits into from Dec 22, 2022

Conversation

arunkannawadi
Copy link
Member

No description provided.

@rmjarvis rmjarvis self-requested a review December 2, 2022 18:01
@arunkannawadi arunkannawadi force-pushed the tickets/DM-36797 branch 2 times, most recently from badcca1 to 3b9f385 Compare December 6, 2022 16:06
@arunkannawadi arunkannawadi marked this pull request as ready for review December 6, 2022 16:17
# chi-type ellipticity has a shear response of 2, so we need to
# divide by 2 so that the rho-stats do not depend on the
# ellipticity-type.
responsitivity = 2.0 if self.ellipticityType == "chi" else 1.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of responsitivity is new and was not present in analysis_drp. This ensures that the rho statistics always corresponds to the additive contamination to shear-shear correlation function regardless of the ellipticity type we choose.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only right in the limit that e is small. That's probably ok for this purpose, but the distortion responsivity is actually 2-<|e|^2>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e there is intrinsic ellipticity, isn't it? The point source is essentially round, so its intrinsic ellipticity is zero. Any way, this is also set to match what was used with the reGauss shapes in the HSC weak lensing analysis code.

Copy link

@rmjarvis rmjarvis Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's the PSF ellipticity. Distortion version, to be explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HSC validation, the responsitivity has been set to 2 exactly. Given that the next big validation run will be on HSC, I'd like to keep it this way for consistency. We can make it more accurate later on, but for LSST, we'd likely be using shear estimator, and not distortion, so it shouldn't matter much any way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you want to validate that you get the same (slightly) wrong answer... OK, I guess. Wouldn't be my choice though. I'd calculate the ratio you expect to be different because of this as part of the validation, with comments saying why the HSC values are a little off.

@arunkannawadi arunkannawadi force-pushed the tickets/DM-36797 branch 2 times, most recently from 3d7816c to d023606 Compare December 6, 2022 21:49
bands = ListField[str](
doc="The bands to apply the flags in, takes precedence if band supplied in kwargs",
default=["i"],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you want the default behaviour to be? If you want it to default to the band the plot is being made in then set this to [].


def setDefaults(self):
self.selectWhenTrue = ["{band}_calib_psf_used"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for this we have been using the generic flag selector with this option set, do we need a whole new selector? Are we imagining that there will be more columns needing to be set here? If so then it would be good to update all the other places that plot psf sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go with FlagSelector(selectWhenTrue=["{band}_calib_psf_used"])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would replace the entire psf selector

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would replace the entire psf selector

for dataInfo in dataId:
plotInfo[dataInfo.name] = dataId[dataInfo.name]

return plotInfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the plotInfo function in plotUtils not work for this? They look pretty much identical except that the standard one has more information in it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't because inputs[connectionName] is a list here, and it is so because this task reads in multiple datasets (of the same type, objectTable_tract in this case) whereas every other task reads in only one dataset of a given type. Handling both cases in the base function seemed a bit ugly, given that running rho statistics is the only exception.

Also, the parsePlotInfo in plotUtils.py seems unused. There's a duplicate implementation in tasks/base.py which is the default one. We should remove the former perhaps?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that this would come up more regularly, it might be good to generalise the plotUtils one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, generalize the parsePlotInfo in utils, remove the method with the same name from tasks/base.py and use the utility function both here and in tasks/base.py? That sounds like a separate ticket to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DM-37419

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think not duplicating code unnesecarily is really another ticket. Having two of these will mean that some plots might end up displaying different things and the plot info generation will get out of sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already code duplication in the main branch. The ticket is to deduplicate that. And this plotInfo will be a bit different for this task because of multiple tracts in it, and even more so when DM-37330 lands. It is okay in the object-oriented paradigm for simple and short methods like these to be overridden for special cases, without worrying about partial code duplication. I strongly prefer that over the base method handling all possible cases. Any additions to the base method, that applies to this task as well, has to be added here as well. If we miss that, we will catch it in one of the metrics meetings and can quickly add it here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reduced some duplication by refactoring the parsePlotInfo in the base class. The only duplication is the if/else branch about inputs being None. I am not sure why we need that in the first place.

A separate config class is used instead
of constructing a `~lsst.pex.config.DictField` so that mixed types can be
supported and the config can be validated at the beginning of the
execution. The ``toDict`` provides a minimal dictionary that override only
Copy link
Collaborator

@sr525 sr525 Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override -> overrides

of constructing a `~lsst.pex.config.DictField` so that mixed types can be
supported and the config can be validated at the beginning of the
execution. The ``toDict`` provides a minimal dictionary that override only
the default values and excludes the key-values pairs when the item is None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values -> value

"of the other three."
),
optional=True,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if three of them aren't set?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it gets through to TreeCorr, then TreeCorr would raise an exception in that case. Not sure if that's what you want here or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it wouldn't get to treecorr because an exception would be raised here:

req_params = (self.nbins, self.bin_size, self.min_sep, self.max_sep)
num_req_params = sum(param is not None for param in req_params)
if num_req_params != 3:
msg = (
"You must specify exactly three of ``nbins``, ``bin_size``, ``min_sep`` and ``max_sep``"
f" in treecorr_config. {num_req_params} parameters were set instead."
)
raise FieldValidationError(self.__class__.bin_size, self, msg)

This duplicates what treecorr does, but that is desirable because we want misconfiguration to be captured at config time, not at runtime, which is what would happen if we let treecorr handle it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the catch below, makes sense now.

)

split_method = ChoiceField[str](
doc=("How to split the cells in the tree when building the tree " "structure."),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd formatting, tree " "structure."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, black seems to have converted two-line docs into single line ones (rightly so) but without removing the extra quotations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is present in a couple of places.

msg = (
"You must specify exactly three of ``nbins``, ``bin_size``, ``min_sep`` and ``max_sep``"
f" in treecorr_config. {num_req_params} parameters were set instead."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this answered my question, would it be useful though to have defaults set for the parameters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a reasonable defaults can be set that is agnostic to the data/problem at hand. I believe that is why treecorr does not have defaults for these either. The default values for all other fields correspond to the default values of the corresponding quantities in treecorr. So for the sake of consistency, I'd like to leave them without any defaults.

allowed={scale: scale for scale in ("linear", "log", "symlog")},
)

xLinthresh = Field[float](
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent camel case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xLinThresh?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the others as well.

),
default=None,
optional=True,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you want num_threads to be settable either. Probably that should be set based on something in the DM configuration elsewhere, rather than let the user set it explicitly here, but I don't know enough to say what to recommend on that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed a bunch of fields incl. num_threads, because it is easier to add fields and even backport them, than to remove fields.


.. math::

\rho_0(\theta) &= \left\langle \frac{\delta T_{PSF}}{T_{PSF}}(x) \frac{\delta T_{PSF}}{T_{PSF}}(x+\theta) \right\rangle # noqa: E501, W505

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rho0 is traditionally what people call <e*_psf e_psf>. So I think using this name for a different statistic is likely to be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rho6 then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any publication where rho0 is defined?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I've seen it in anything published. But the SLAC people who do PSF work use that notation pretty regularly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marco used it here: https://arxiv.org/pdf/2011.03408.pdf Figure 8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a quantity not involving any PSF residuals, I'm unhappy with the use of rho0 to mean ellipticity correlations. I could go with rho6 although I suspect that'd prove to be a bad choice as well. I could name it something like mu0 but that go lead this plot becoming undiscoverable :-|

Copy link

@rmjarvis rmjarvis Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it dT_auto or something more direct like that, rather than some code letter/number? Also, what's the intended use case for it? The mean <dT> (one-point function) is relevant for WL systematics, but I don't think this combination is directly applicable to WL. Does LSS care about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-correlation function of dT/T is what was called as rho3 in Melchior et al. (2015), which you denoted as rho'3 statistic in your 2016 paper. The reason why it is interesting here, not necessarily from a scientific standpoint, but from a pipelines standpoint is that, for reasons not clear to us, we see a significant offset in PSF size on coadds. This serves as a good diagnostic tool to see how our planned changes to PSF coaddition improves the situation. This complements our other tools involving scatter plots and other one-point calculations from it.

Perhaps, I'll just follow your notation call it rho3prime or something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Peter and I disagreed at first about how dT errors propagate to xi+. Originally he thought his rho3 * <|e|^2> was the right systematic error on xi+. But I convinced him that the e terms couldn't be brought out of the angle brackets. They needed to stay as what I ended up calling rho3. <e dT/T, e dT/T>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, <e dT/T, e dT/T> make sense scientifically. For pipeline validation, the uncorrelated nature of e on large scales suppresses the problem that rho3prime or rho3alt (what I had called rho0) would show. That's why I want to have this statistic, at least until we are confident we have resolved the issue with PSF coaddition.

image

image

\rho_5(\theta) &= \left\langle (e^*_{PSF}(x) (e_{PSF}\frac{\delta T_{PSF}}{T_{PSF}}(x+\theta)) \right\rangle # noqa: E501, W505

The definition of ellipticity used in [1]_ correspond to ``epsilon``-type ellipticity, which is typically
smaller by a factor of 4 than using ``chi``-type ellipticity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epsilon and chi are not standard names for what you're referring to here. The more typical names are shear vs distortion. epsilon is sometimes used to refer to shear, and chi is sometimes distortion. But not by any stretch universally. So you should at least give the names in addition to (or instead of) the Greek letters.

doc="The type of ellipticity to calculate",
allowed={
"chi": "Distortion, defined as (Ixx - Iyy)/(Ixx + Iyy)",
"epsilon": ("Shear, defined as (Ixx - Iyy)/" "(Ixx + Iyy + 2*sqrt(Ixx*Iyy - Ixy**2))"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not actually the "definition" of shear. That's just one way to estimate shear. Shear is defined by the elements in the 2x2 transformation matrix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as per comment above, I'd prefer the "type" names be distortion and shear, rather than semi-arbitrary Greek letters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer this to another ticket (DM-37325) and hence to another PR, since this notation is present elsewhere in the code as well. Adding a TODO comment for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a vote in for understandable names not random greek letters. Remember that non WL people will see this, like me, and it would be great if it was understandable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the expression for each statistic in the y-axis label.

# chi-type ellipticity has a shear response of 2, so we need to
# divide by 2 so that the rho-stats do not depend on the
# ellipticity-type.
responsitivity = 2.0 if self.ellipticityType == "chi" else 1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only right in the limit that e is small. That's probably ok for this purpose, but the distortion responsivity is actually 2-<|e|^2>.

# Calculate the cross-correlation
xy.process(catA, catB)

_LOG.debug("Correlated %d pairs out of %d possible pairs", xy.npairs, len(ra) * (len(ra) - 1) / 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to give an error, although maybe it's just a logging warning. xy.npairs is an array, not an int. It's the number of pairs for each bin of the output correlation. It's also a bit weird to report this as being how many were correlated out of some total possible. The ones that weren't correlated were at different separations, so were they "possible"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I meant to put in sum(xy.npairs) there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number here is to give some indication about whether we have explore sufficient range in the scales or not. As we plan to run this on a regular cadence as we collect data, the angular scales that we set might evolve and this is intended to guide that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the total of npairs, either as an absolute number or as compared to N*(N-1)/2, would say much about how well you're covering scales though. Maybe S/N of the first and last bin? If the S/N of either gets largish, it could mean that it would be useful to extend the range.

self.process.calculateActions.stars.treecorr.min_sep = 0.1
self.process.calculateActions.stars.treecorr.max_sep = 100.0
self.process.calculateActions.stars.treecorr.sep_units = "arcmin"
self.process.calculateActions.stars.treecorr.metric = "Arc"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use Arc. It's just slower and doesn't add anything useful compared to the default Euclidean.

@arunkannawadi arunkannawadi force-pushed the tickets/DM-36797 branch 9 times, most recently from 507775f to 5c84f97 Compare December 21, 2022 03:30
@arunkannawadi arunkannawadi force-pushed the tickets/DM-36797 branch 2 times, most recently from 7735201 to df85018 Compare December 21, 2022 16:34
Copy link
Member Author

@arunkannawadi arunkannawadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I believe I have addressed all the comments, or created TODO ticket items to work on. This is certainly a minimum viable thing that I'd like to get this in the weekly that gets tagged tonight. @rmjarvis do I have your approval?

for dataInfo in dataId:
plotInfo[dataInfo.name] = dataId[dataInfo.name]

return plotInfo
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reduced some duplication by refactoring the parsePlotInfo in the base class. The only duplication is the if/else branch about inputs being None. I am not sure why we need that in the first place.

doc="The type of ellipticity to calculate",
allowed={
"chi": "Distortion, defined as (Ixx - Iyy)/(Ixx + Iyy)",
"epsilon": ("Shear, defined as (Ixx - Iyy)/" "(Ixx + Iyy + 2*sqrt(Ixx*Iyy - Ixy**2))"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the expression for each statistic in the y-axis label.

@rmjarvis
Copy link

@rmjarvis do I have your approval?

Sure. Looks fine to me.

`addPlotInfo` does not handle `plotInfo` that is None. Fix type hints to
reflect that, and in `makePlot` methods that call `addPlotInfo` without
checking if `plotInfo` is not None.
@arunkannawadi arunkannawadi merged commit 490ba8e into main Dec 22, 2022
@arunkannawadi arunkannawadi deleted the tickets/DM-36797 branch December 22, 2022 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants