Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-40633: Ignore many numpy warnings #155

Merged
merged 6 commits into from Dec 11, 2023
Merged

DM-40633: Ignore many numpy warnings #155

merged 6 commits into from Dec 11, 2023

Conversation

taranu
Copy link
Contributor

@taranu taranu commented Oct 12, 2023

These warnings spam the logs with mostly useless messages. You can tell which action they occur in and it's usually obvious why - divide by zero, nan, etc - but these are typically expected and not necessarily a problem for the task.

# of a task. When DM-39114 is implemented, this step should not
# be required and may be removed.
weakref.finalize(results, _plotCloser, *weakrefArgs)
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return should be out of the with context.

@@ -155,11 +155,11 @@ def makePlot(
* Statistics that are shown on the plot or used by the plotting code:
* ``approxMagDepth``
The approximate magnitude corresponding to the SN cut used.
* ``f"{self.plotName}_sigmaMAD"``
* ``f"{self.plotName}_nanmedian"``
The sigma mad of the distances to the line fit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed from sigmaMad to median?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, that was a find/replace fail I missed correcting.

The sigma mad of the distances to the line fit.
* ``f"{self.identity or ''}_median"``
The median of the distances to the line fit.
* ``f"{self.identity or ''}_hardwired_sigmaMAD"``
* ``f"{self.identity or ''}_hardwired_nanmedian"``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

@@ -36,7 +36,7 @@
from scipy.stats import binned_statistic_2d, binned_statistic_dd

from ...interfaces import KeyedData, KeyedDataSchema, PlotAction, Scalar, Vector
from ...statistics import nansigmaMad
from ...statistics import nanmedian, nansigmaMad
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nanMedian and nanSigmaMad would be easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I was following the numpy naming convention. Having said that, it might make more sense to not follow numpy so as to distinguish them instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that make sense. Less confusion about where they come from.

-------
count : `Scalar`
The number of unique rows in a given column.
"""
Copy link
Collaborator

@sr525 sr525 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we get rid of the more extensive docs? Same comment for the other places as well.

Copy link
Contributor Author

@taranu taranu Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj and/or @jonathansick can correct me if I'm wrong, but I thought the Parameters section in a class docstring is meant to document __init__, even for callable types. I don't think we should add docs for __init__ to Config classes.

If we are to keep these, they should be docs for __call__ on every Action, but IMO this is unnecessary in most cases as here the doc for count is redundant with the class docstring.


def divide(dividend: Scalar | Vector, divisor: Scalar | Vector) -> Scalar | Vector:
"""Return dividend/divisor."""
with warnings.catch_warnings():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not catch the issue rather than the warnings?

if divisor == 0:
return NaN

Or something? I guess it does the same thing but feels more honest. Also doesn't solve my problem that what I really want is it to also tell me why it has failed. Maybe we just need a metric for each whatever that is number of 0, number of NaN etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it would be more complicated than that since divisor can be a vector.

Yes, earlier I was trying to say we should probably have metrics for each algorithm's individual flag columns where they are meaningful (and in the hopefully rare case of bad values without a flag set), but see further comments below.


def nansigmaMad(vector: Vector) -> Scalar:
"""Return the sigma_MAD of a vector."""
return cast(Scalar, sps.median_abs_deviation(vector, scale="normal", nan_policy="omit"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the sigma MAD or just the MAD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale="normal" makes it sigma_MAD.

@sr525
Copy link
Collaborator

sr525 commented Nov 27, 2023

Would we be better off having a data clean up step before it gets fed to any actions? Something that makes a mask that covers all the NaNs etc and logs a message about how many there were, then we do the actions on clean data?

@taranu
Copy link
Contributor Author

taranu commented Nov 27, 2023

Would we be better off having a data clean up step before it gets fed to any actions? Something that makes a mask that covers all the NaNs etc and logs a message about how many there were, then we do the actions on clean data?

As I said above, we should have metrics for algorithmic flag columns. For the most part, numpy errors are cropping up where we are not adding the appropriate flag column(s) to the standard Visit/CoaddFlagSelector.

Whether we should add selectors for flag columns is an interesting question. For example, for a size vs magnitude plot both axes have their own relevant flags. I think it would be useful to have a labelled summary with the number of sources with an x flag set, a y flag set, neither or both. This can get more complicated with extra columns (e.g. if there is a S/N selector using a column that isn't on either axis).

@taranu taranu force-pushed the tickets/DM-40633 branch 2 times, most recently from 8e9945a to 3c1b532 Compare November 30, 2023 21:25
@taranu taranu force-pushed the tickets/DM-40633 branch 2 times, most recently from d62184d to f8c562a Compare December 7, 2023 20:41
@taranu taranu merged commit edc5d77 into main Dec 11, 2023
8 checks passed
@taranu taranu deleted the tickets/DM-40633 branch December 11, 2023 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants