Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncertainties on Spectrum.data #43

Merged
merged 10 commits into from
Apr 13, 2017
Merged

Uncertainties on Spectrum.data #43

merged 10 commits into from
Apr 13, 2017

Conversation

bplimley
Copy link
Contributor

@bplimley bplimley commented Apr 4, 2017

Closes #23.

If data input is an array of ints/floats, the data uncertainties automatically generated using sqrt(data), but with a minimum uncertainty of 1. Data can also be input as an array of uncertainties.ufloats in which case the uncertainties are not touched.

New Spectrum methods: data_vals, data_uncs

Spectra can be multiplied / divided by a factor that has uncertainty.

@@ -50,7 +51,12 @@ def __init__(self, data, bin_edges_kev=None):

if len(data) == 0:
raise SpectrumError('Empty spectrum data')
self.data = np.array(data, dtype=float)
if isinstance(data[0], UFloat):
self._data = np.array(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do this with np.issubdtype(data.dtype, UFloat)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jccurtis I like the issubdtype format but unfortunately:

In [140]: np.issubdtype(float, UFloat)
Out[140]: True

In [141]: np.issubdtype(int, UFloat)
Out[141]: True

In [144]: np.issubdtype(str, UFloat)
Out[144]: True

In [146]: np.issubdtype(np.ndarray, UFloat)
Out[146]: True

In [147]: np.issubdtype(bool, UFloat)
Out[147]: True

What the heck, UFloat? Why are you so inclusive of everyone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boooo, okay this is unfortunate. We could do:

all([isinstance(d, UFloat) for d in data])

We could also keep the simple error check for speed. Ultimately we should make a Spectra class for large quantities of spectra. This takes ~1ms on my desktop for a 8192 channel spectrum.

self._data = np.array(data)
else:
unc = np.sqrt(data)
unc[unc == 0] = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry dislike mutable variables. use maximum()?

@bplimley
Copy link
Contributor Author

bplimley commented Apr 4, 2017

I should check the docstrings and add info about any assumptions made, e.g. on Spectrum.__init__().

@@ -39,7 +42,11 @@ def __init__(self, data, bin_edges_kev=None, input_file_object=None):
"""Initialize the spectrum.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrrr, this is for the previous line ...

Can we add another kwarg for uncs (or something else appropriate) which defaults to None but allows the user to pass in uncertainties without explicitly using the uncertainties package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea. Done

jccurtis
jccurtis previously approved these changes Apr 6, 2017
Copy link
Collaborator

@jccurtis jccurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now! There is some inefficiency with the uarray type checking but it is not critical for general use.

@bplimley
Copy link
Contributor Author

bplimley commented Apr 7, 2017

So I didn't merge this yet because I wanted to think about how uncertainties work for energycal. Ideally, the uncertainty in a fit is represented by an uncertainty in the channel value of the feature, and this is passed into the calibrator for weighting the calibration curve fit. But we should also allow the user to enter a calibration point without uncertainty (in which case points are weighted equally... if there is a mix of uncertainties and non-uncertainty values, then I don't know).

I'm thinking the channels property of the EnergyCal will return an array of UFloats, and we add ch_vals and ch_uncs similar to the properties in Spectrum.

Copy link
Contributor Author

@bplimley bplimley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To review where uncertainties are used:

  • Spectrum.data always has uncertainties on it
  • EnergyCalBase.channels is always UFloat type but may have np.nan uncertainties representing unknown uncertainty. If all uncertainties are nan they are weighted equally in the fit. If some are nan they are weighted equal to the average of the non-nan points. Not sure this is the best approach.
  • The scaling factor in spectrum multiplication/division can optionally have uncertainty. (I'm not sure the use case for this, but it is pretty simple.)

ch_unclist (optional): list/tuple/array of uncertainties of the
channel values
kevlist: list/tuple/array of the corresponding energy values [keV]
pairlist: list/tuple/array of paired values, (ch, kev)

Raises:
BadInput: for bad pairlist, chlist, and/or kevlist.
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of from_points is getting a little messy due to input handling. I can break the input handling out into another (private) method if that would be preferred. I don't see how to actually simplify the code, unless we reduce the checks that are currently done:

  • Allow calpoints to be specified by a list of pairs, or a pair of lists
  • Allow inputs of np.array in addition to list/tuple
  • see each of the BadInput exception descriptions for other situations that are checked. There are 5 different errors and only 1 is duplicated in two places (for pairlist input and chlist/kevlist input).

The first point might not be strictly necessary. We could eliminate the pairlist input style, which would simplify the logic and eliminate a couple of the exception cases.

"""

return np.array(list(self._calpoints.values()), dtype=float)
return np.array(list(self._calpoints.values()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled these properties on Spectrum.data, Spectrum.data_vals, Spectrum.data_uncs.

self._set_coeff('b', b)
self._set_coeff('c', c)
# normally channel is the independent variable.
# but uncertainty is on channel, not energy. so fit the inverse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, the weights in a fit should correspond to the y-error rather than the x-error, since least squares is based on the y-residuals. Unfortunately, in an energy calibration fit, the x-values (channel) are the ones with uncertainty.

Three ways forward:

  1. Fit the inverse of calibration curve equations, so that the x-error becomes the y-error and can be used to weight the fit. The quadratic case may be doable, I was looking at the quadratic equation but it's late. But more complex equations (like NaI light yield) will not invert nicely.
  2. Ignore the uncertainty in channel value. (At least for non-invertable equations.)
  3. Use the x-error to weight the y-residuals, even though it's wrong. If the calibration curve is mostly linear, I think this is a reasonable approximation.

I've done approach 1 for the linear cal here because it's easy enough.

return unumpy.uarray(x_array, default_unc_func(x_array))


def handle_unc(x_val, x_unc, default_unc):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some overlapping code between handle_uncs and handle_unc (the former handling lists/arrays, and the latter handling scalars). Should I combine the two functions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two should be combined to prevent the need for external type checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently the Spectrum and EnergyCal methods I've written require either a single value or a list/array of values, rather than accepting either. So rather than requiring type-checking in Spectrum or EnergyCal, by calling one or the other of these handle_unc[s] methods, it does the type-checking for the external method.

They could still be combined and the same functionality could be obtained by an optional kwarg in the new handle_unc.

return False


def handle_uncs(x_array, x_uncs, default_unc_func):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was moved here because both Spectrum and EnergyCalBase are doing the same thing in terms of handling uncertainty inputs. Is core/utils.py a good path/name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the path name

if nan_weights.all():
sig = np.ones_like(nan_weights)
if nan_weights.any():
sig[nan_weights] = np.nanmean(sig)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we want to make this behavior standard across calibration curves, this nan weight handling should be in the base class, I just realized.

@bplimley
Copy link
Contributor Author

bplimley commented Apr 8, 2017

@jccurtis you should take another look at this when you have a chance. We can go over it together if you'd like. I have a number of points for discussion, in my line comments.

@jccurtis jccurtis mentioned this pull request Apr 11, 2017
@jccurtis jccurtis dismissed their stale review April 11, 2017 23:50

Too much science. Can't handle the science...

@bplimley bplimley force-pushed the uncertainties branch 2 times, most recently from 5362123 to 6950f74 Compare April 13, 2017 00:51
@bplimley
Copy link
Contributor Author

I'm taking the energycal-related stuff out of this pull request, so that we can merge the basic uncertainties stuff more easily. (The other code is preserved in feature-energycal-uncertainties for later work.)

The "outdated" in-line comments in energycal.py and utils.py are on commits that are no longer a part of this pull request, so disregard them.

@bplimley
Copy link
Contributor Author

bplimley commented Apr 13, 2017

@jccurtis the quantity of science in this PR should be manageable now. It is what we looked at together last week, when you approved, plus just a clean merge of the EnergyCal module from master branch.

Approve again and merge please?

Copy link
Collaborator

@jccurtis jccurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call moving the ecal uncs to #45

@jccurtis jccurtis merged commit bc9eb9b into master Apr 13, 2017
@jccurtis
Copy link
Collaborator

@bplimley ready to delete the uncertainties branch?

@bplimley bplimley deleted the uncertainties branch April 13, 2017 18:00
@bplimley
Copy link
Contributor Author

Thanks @jccurtis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants