Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-38500: Define metrics to summarize spuriousness scores for a visit #151

Merged
merged 1 commit into from Jan 9, 2024

Conversation

rai-harshit
Copy link
Contributor

No description provided.

@rai-harshit rai-harshit force-pushed the tickets/DM-38500 branch 2 times, most recently from f09d418 to a634aab Compare October 12, 2023 18:34
Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some points for you to consider.

values = values[np.isnan(values)]
result = cast(
Scalar,
float(len(values)), # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a float if it is a scalar?

mask = self.getMask(**kwargs)
values = data[self.vectorKey.format(**kwargs)]
values = values[mask]
values = values[np.logical_not(np.isnan(values))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably indicate in documentation that this rejects nans, in case someone is relying on nan defined comparison behavior.

"lt": "less than threshold",
"le": "less than or equal to threshold",
"ge": "greater than or equal to threshold",
"gt": "greater than threshold",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you might want to add eq in here.

return scalarA / scalarB


class CountThreshold(ScalarAction):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really is a shame that the CountAction name is already being squatted on. Perhapse it is worth seeing if you could merge the two with preserving behavior of the old. Perhaps with an op of ne and threshold of nan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on incorporating all your suggestions in the code, but then stumbled upon this issue- threshold only accepts float values, so passing "nan" as a value causes an exception. There are two ways I see if we wanna continue moving forward with this approach-

  1. Have some default float value passed that represents "nan"- doesn't make much sense, but a potential solution nonetheless.
  2. Have another parameter, called "nan" that governs if NaNs should be part of the counting. I have put together some code using this approach and would like your thoughts on it.
    Of course, if you have any other potential solution, I'm happy to include it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you make the pex_config field type optional = True and not give it a default value, then in the code accessing the field will resolve to None. i.e. if self.threshold is None. This could function as the sentinal you were talking about and you can take None to mean threshold on NaN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your suggestion for the following three scenarios-

  1. Count just the NaNs - works if threshold = None and op = "eq"
  2. Count just the non-NaNs - works if threshold = None and op = "ne"
  3. Count NaNs as well as non-NaNs - doesn't work since I couldn't think of a way to tell the code to count NaNs and non-NaNs just with those two parameters. We could leave this on the user to calculate by adding NaN and non-NaN counts, but it seems a basic metric that should already be baked in the code.

Maybe you could go through the code that I've already pushed and see if that makes sense? It already works for all the above 3 scenarios and other edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a bit confused, did you push the solution that allows eq, or are you saying you want me to review what is here first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm not sure I understand your point 3 re-reading it? In what scenario would a metric could both nans and non nans in the same configured action? Are you saying you want a combination where this tool just outputs the length of the catalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking into this a bit more, you don't need the None as a sentinel value, you can take optional away again. The problem you were facing is that you tried to assign nan as a string. You have a few options here.

  1. if you import math at the top of your module, you should be able to in yaml have people write math.nan and math.inf (I believe the module resolution will work in that case)
  2. at the top of your module do a from math import (nan, inf) and then just typing the strings nan and inf in yaml will work.
  3. Don't import anything in your module and leave it up to the pipeline write to add a python block that says from math import nan if they want to use nan.

As far as the question, how do I count both nan and non nan in the same run of the tool (which again I'm not exactly sure why you would have a tool configured to do that, unless again you called it something like total) Then you could add an op called all and write in the doc that it will count everything no matter what the threshold is set to. This would count nan, numbers, inf. Alternative you could inf use le inf or ge nan, and just have logic to handle what to do when the threshold is set to inf or nan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your suggestions (importing math/math.nan in my module and passing math.nan/nan in the yaml) and it doesn't work. The module resolution doesn't work in either cases. It still complains that I passed string (nan/math.nan) instead of float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying a couple of things with the YAML file and this is what worked-
atools.numDiaSourcesNanReliability.process.calculateActions.countingAction.threshold: !!float nan
Aadding a !!float before nan allows for successful module resolution and then correct calculation of metrics. Should I go ahead with this implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed those changes for you to go through. Turns out adding !!<tag> is a standard method to explicitly specify the data type in the YAML. I also got rid of the NumDiaSourcesAllMetric function since it was using a version of the new NumDiaSourcesSelectionMetric function. Please let me know in case you have any concerns.

return result


class CountNan(ScalarAction):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above, this action would probably be better if you could combine this action, CountThreshold and CountAction all into one, simply related to the op. The logic of the action would get a bit more complicated, but not too much. It should be possible with an if else on threshold checking for NaN, or using a match` statement.

@@ -76,3 +80,74 @@ def setDefaults(self):

# the units for the quantity (count, an astropy quantity)
self.produce.metric.units = {"numDipoles": "ct"}


class NumDiaSourcesHighReliabilityMetric(AnalysisTool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the next two, AnalysisTools are not done in a wrong way per-say, but they kind of break the spirit of what AnalysisTools allow you to do.

Because these tools only differ in configuration (if you accept my recommendations above), then it is possible to have a single tool, perhapse called something like NumDiaSourcesSelection (or whatever better name you can come up with, naming is hard). It could look something like:

class NumDiaSourcesSelection(AnalysisTool):
    metricName = Field[str](doc="Name to use for output metric")

    def setDefaults(self):
        super().setDefaults()
        self.process.calculateActions.countingAction = <name_of_combined_action>
        self.produce.metric.units = {"countingAction": "ct"}

    def finalize(self):
        self.produce.metric.newNames = {"countingAction": self.metricName}

You can then use this in pipelines to setup whatever you want without needing duplicate tools, or any new tools if you want to introduce new metrics in the future. This would look something like:

      atools.numDiaSourcesHighReliability: NumDiaSourcesSelection
      # this could be a different name if you wanted
      atools.numDiaSourcesHighReliability.metricName: numDiaSourcesHighReliability
      atools.numDiaSourcesHighReliability.process.calculateActions.countingAction.op: gt
      atools.numDiaSourcesHighReliability.process.calculateActions.countingAction.threshold: 0.9
      atools.numDiaSourcesHighReliability.process.calculateActions.countingAction.vectorKey: reliability

      atools.numDiaSourcesLowReliability: NumDiaSourcesSelection
      atools.numDiaSourcesLowReliability.metricName: numDiaSourcesLowReliability
      atools.numDiaSourcesLowReliability.process.calculateActions.countingAction.op: lt
      atools.numDiaSourcesLowReliability.process.calculateActions.countingAction.threshold: 0.1
      atools.numDiaSourcesLowReliability.process.calculateActions.countingAction.vectorKey: reliability

      atools.numDiaSourcesNanReliability: NumDiaSourcesSelection
      atools.numDiaSourcesNanReliability.metricName: numDiaSourcesNanReliability
      atools.numDiaSourcesNanReliability.process.calculateActions.countingAction.op: eq
      atools.numDiaSourcesNanReliability.process.calculateActions.countingAction.threshold: NaN
      atools.numDiaSourcesNanReliability.process.calculateActions.countingAction.vectorKey: reliability

This way you can have any number of metrics all defined by the configuration of one tool. There are any number of ways to do this of course, you could also have specified your config in one line in a python block, you could have forwarded more properties to make each line smaller, etc. This gives you some scope of what is possible.

As I say what you have is not wrong, and has the benifit of an importable configuration independent of a pipeline, but it does add more overhead for introducing new tools.

if self.threshold == nan:
if self.op == "eq":
# Count number of NaNs
result = arr[np.isnan(arr)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably do np.isnan(arr).sum() here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in the code as per suggestions

return cast(Scalar, len(result))
elif self.op == "ne":
# Count number of non-NaNs
result = arr[~np.isnan(arr)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and len(arr) - np.isnan(arr).sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as well

@rai-harshit rai-harshit merged commit 5a6f919 into main Jan 9, 2024
8 checks passed
@rai-harshit rai-harshit deleted the tickets/DM-38500 branch January 9, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants