add checks on input data type for Model #723

peendebak · 2021-04-11T19:19:58Z

Description

lmfit excepts input data to be in double-precision format. However, users can be unaware of this or pre-processing methods could accidentally cast to single-precision. This PR adds warnings to all model input data that is of type float32.

Notes:

Also see Fits to Models fail for x and y input data with certain datatypes, but not others #617, [WIP] assert input data for least squares is of correct type #722 , Warn users when using float32 input data to curve_fit and friends scipy/scipy#7117
The warning is generated as a DataPrecisionWarning so users can filter out the warning if they choose to. No error is generated so that the change is backwards compatible with current behaviour.
The current check is only on float32, which seems to be the most common type of mistake. Integer types are converted automatically to float type. Adding np.float16 to the check would be an option.

Type of Changes

Bug fix
New feature
Refactoring / maintenance
Documentation / examples

Tested on

Python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]

lmfit: 1.0.2+32.gc448ced.dirty, scipy: 1.6.1, numpy: 1.20.1, asteval: 0.9.23, uncertainties: 3.1.5

Verification

Have you

included docstrings that follow PEP 257?

referenced existing Issue and/or provided relevant link to mailing list?

verified that existing tests pass locally?

verified that the documentation builds locally? Partial build

squashed/minimized your commits and written descriptive commit messages?

added or updated existing tests to cover the changes?
updated the documentation and/or added an entry to the release notes (doc/whatsnew.rst)?
added an example? Not required

codecov · 2021-04-11T19:39:59Z

Codecov Report

Merging #723 (969a5ad) into master (4c31115) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #723      +/-   ##
==========================================
+ Coverage   93.64%   93.71%   +0.07%     
==========================================
  Files          10       10              
  Lines        3428     3436       +8     
==========================================
+ Hits         3210     3220      +10     
+ Misses        218      216       -2

Impacted Files	Coverage Δ
lmfit/model.py	`91.12% <100.00%> (+0.09%)`	⬆️
lmfit/minimizer.py	`90.92% <0.00%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c448ced...969a5ad. Read the comment docs.

newville · 2021-04-11T20:51:22Z

@peendebak Again, thanks for trying to work on this, but it seems like I'm not getting my main point across - I've tried to make it a couple of times now (well, in #722), but I'll try again. In this PR you say:

 lmfit excepts input data to be in double-precision format.

It might seem pedantic or like I'm splitting hairs here, but this is not exactly true. There is no such expectation on input data. Maybe this needs a FAQ entry.

The parameter values in lmfit are double-precision floats. Like, not "expected to be" but a firm "are".

For models, there is the data ("y", say) which must be numeric data. This is expected to be "1-D float64", but really more like expected to be coercible to "1-D float64".

The user typically also has independent data that is used (somehow) in the calculation of the model. Lmfit places no restrictions on what the independent data is - really, none. It does not need to be numeric data - it could be a dict or (odd, but allowed) text or a database connection. It could be numeric data of type uint32 or complex128.

The model function takes the independent data and the parameter values and calculates and returns some array that is "the model". The difference of that array with the "y" data (possibly scaled by a weighting factor) is used in the actual fit -- and that is what needs to be "1-D float64".

Importantly, the data ("y" data) that you used is np.float64. So, even if the model returns a float32 array, the difference between "data" and "model" would be promoted to float64. That is, the fit should have -- and indeed can -- work. The problem was that the steps taken in the Parameter values to create the finite-difference derivatives were too small to reliably change the result of

  (float64_scalar * float32_ndarray)  -> float32_ndarray

If the range or scale of the independent data were larger or there were more data points, the fit may have worked. Anyway, the point is that we really want to not dictate how the user does their calculation, but we do want to make sure that the fit will try to work in most cases.

I'm +1 on setting a more reasonable value for epsfcn and diff_step. I'm also OK with coercing the output of the residual function (in the Minimizer class) to be float64, as we also coerce it to be 1D. Those will both help allow for float32 data to be useable.

I'm less enthusiastic about adding a new UserWarning subclass that we use once or twice. That seems somehow like a race to see how many corner cases users can come up with.

reneeotten · 2021-04-11T21:52:33Z

I agree with what Matt said in the two PRs.

I'm +1 on setting a more reasonable value for epsfcn and diff_step.

If possible I would prefer sticking to the "default" values set by SciPy, unless the user explicitly changes them (i.e., the defaults seem to work well for them so why would we change that)? And if we do coerce data and/or function results to float64 there shouldn't be a need for changing this anymore, right? Or at least, changing these values in that case does not have anything to do anymore with the float32-issue.

I'm also OK with coercing the output of the residual function (in the Minimizer class) to be float64, as we also coerce it to be 1D. Those will both help allow for float32 data to be useable.

Without looking at the code in detail I would think indeed that only doing this should work... Perhaps the reason it didn't seem to do so for @peendebak is that this should be done as well in the _residual function in the Model class, although I would agree that doing it only in the Minimizer class should be sufficient...?

I'm less enthusiastic about adding a new UserWarning subclass that we use once or twice. That seems somehow like a race to see how many corner cases users can come up with.

I agree and don't really like adding a new subclass for this.

peendebak · 2021-04-12T09:57:37Z

@newville @reneeotten Let me try to summarize what I understand so far:

Coercing data in either Model or Minimizer does not address the issue with the minimal example.
The coercing already happens automatically in the case of the minimal example: in _residual there is the line diff = model - data. The model is the output of the user supplied function (via model = self.eval(params, **kwargs) and is in float32 (the same type as the independent variable x), and the data is the input data (type float64).

If the independent data is in float32, then the following definition of the sine

def sine(x: Union[float, np.ndarray], amplitude: float, frequency: float,
         phase: float, offset: float) -> Union[float, np.ndarray]:
    y = amplitude * np.sin(2 * np.pi * frequency * x + phase) + offset
    y=np.asarray(y, dtype=np.float64) # cast so the output is of correct type
    return y

outputs in float64, but the fitting still gives incorrect results.
The following definition of the sine does give correct results (for x in either float32 or float64)

def sine(x: Union[float, np.ndarray], amplitude: float, frequency: float,
         phase: float, offset: float) -> Union[float, np.ndarray]:
    x=np.asarray(x, dtype=np.float64) 
    y = amplitude * np.sin(2 * np.pi * frequency * x + phase) + offset
    return y

lmfit should not place any restrictions on the independent data, so coercing the independent data data or raising any warnings is not acceptable.
Coercing the output of the user supplied function might be a good addition to lmfit, but there is no support for adding a warning.
We all agree that better default values for the diff_step or epsfcn would prefarably be done in scipy, not lmfit. The default values of scipy do not work for the minimal example, but do not seem to be chosen very poorly.
The problem is really the mixing of data types: for least_sq the scipy methods are operating on the output of __residual which is float64. The fact that internally in __residual float32 is used (from the independent variables) is hidden from scipy. The value for epsfcn determined by scipy is then based on float64 type which apparently does not go well with the float32 independent variables.

With all this combined I see no options to improve the situation in lmfit for the minimal example. Unless you have additional suggestions I will close the PR.

newville · 2021-04-12T13:36:04Z

@reneeotten

If possible I would prefer sticking to the "default" values set by SciPy, unless the user explicitly changes them (i.e., the defaults seem to work well for them so why would we change that)? And if we do coerce data and/or function results to float64 there shouldn't be a need for changing this anymore, right? Or at least, changing these values in that case does not have anything to do anymore with the float32-issue.

We do have non-default values for the tolerances (stopping criteria). The scipy.optimize.leastsq docstrings are vague about what values would be used. And, just to correct myself, I would suggest epsfcn=1.e-14 --- so that the step was 1.e-7. Using actual machine precision for these values is asking for instability and inconsistencies.
And, sure that could be changed in scipy. But if we need to correct problems with scipy's implementations here, I'm fine with that too.

Without looking at the code in detail I would think indeed that only doing this should work... Perhaps the reason it didn't seem to do so for @peendebak is that this should be done as well in the _residual function in the Model class, although I would agree that doing it only in the Minimizer class should be sufficient...?

I think that doing that coercion is probably a wise thing to do. If the y data was also Float32, this would be necessary. I think we've seen such questions in the past, actually.

It isn't sufficient for this example because changes in the parameters at the 1.e-8 level combined with a Float32 x array yield no changes in the Float32 model array -- promoting that to Float64 downstream won't help. But taking a bigger step in the parameter values will.

@peendebak

Coercing data in either Model or Minimizer does not address the issue with the minimal example.

It will help with other cases of reduced precision.

The following definition of the sine does give correct results (for x in either float32 or float64)
x=np.asarray(x, dtype=np.float64) 

Yes, what you do in your model function is up to you. Promoting to float64 is a fine idea - definitely the best idea. And if you had been using one of the built-in models -- most of which do assume a 1D array for x - then concluding that these built-in models should coerce x to Float64 would be a fine idea. And might very well be a fine idea anyway.

We all agree that better default values for the diff_step or epsfcn would prefarably be done in scipy, not lmfit.

Better? Yes. Likely to happen quickly? Hm, maybe not. And I would be comfortable increasing epsfcn to have a default value in the range of 1e-10 to 1e-14. That is, doing finite-difference derivatives with relative steps of 1e-5 to 1e-7 is probably a better (safer, more reproducible) approach than using 1.5e-8.

The default values of scipy do not work for the minimal example, but do not seem to be chosen very poorly.
The problem is really the mixing of data types: for least_sq the scipy methods are operating on the output of __residual which is float64. The fact that internally in __residual float32 is used (from the independent variables) is hidden from scipy. The value for epsfcn determined by scipy is then based on float64 type which apparently does not go well with the float32 independent variables.

Really, I see the problem as not using high-precision inputs. The fitting algorithms use and assume Float64 internally. Python is liberal in its data types and numpy supports many variations. This mismatch can cause problems.

With all this combined I see no options to improve the situation in lmfit for the minimal example. Unless you have additional suggestions I will close the PR.

Some things that might be considered:

coerce the output of the residual to Float64
increase the default values of epsfcn for leastsq to somewhere between 1.e-10 and 1.e-14 and diff_step for least_squares to 1.e-5 to 1.e-7.
coerce x to Float64 in the built-in models.
add a FAQ entry.

I would not object to any of these.

peendebak · 2021-04-15T07:47:51Z

Some more details on how the different options mentioned above work out for the minimal example. For the minimal example there are three cases I considered:

i) Data float64, independent data float64
ii) Data float64, independent data float32
iii) Data float32, independent data float32

With default values the first case and third case work out fine, the second one fails for the default settings.

Option 1. This converts case iii) to case ii), so makes iii) fail. I would be hesitant therefore to do option 1) without some of the other options
Option 2. Different default values improve the situation for case ii), but can make it worse for case iii). The reason is that for case iii) the value for epsfcn is determined from float32 and becomes 1.1920928955078125e-07. If then a default value of 1e-11 is set, the fitting values. So perhaps let the default values depend on the input type? I did not see a natural place where to determine the value for epsfcn though. My first idea was somewhere within Minimizer.leastsq, but that would require an evaluation of self.__residual to get access to the type of data. For epsfcn=1e-10 all cases work fine, although there might be other examples with float32 input that will become worse.
Option 3 & 4. Seems a good idea.

@newville Thanks for the assistance. Since none of the options match the current PR I will close it.

newville · 2021-04-15T18:33:47Z

@peendebak OK. I read your closure of this PR as meaning that you do not to implement any of the things I suggested to consider.

FWIW, I think that increasing the default value of epsfcn would help avoid some (perhaps "most") of these problems.
If you're not going to make that change, I will make such a PR.

Anyway, thanks - I think the conversation was helpful.

- input 'data' will always become 'float64' or 'complex128' - coercion is only done for certain data types in case of an 'independent_var' to still allow for more complicated (user-defined) data types See: lmfit#723 Closes: lmfit#727

add checks on input data type for Model

969a5ad

peendebak force-pushed the feat/warnings_for_float32_data branch from 764d9e1 to 969a5ad Compare April 11, 2021 19:35

peendebak mentioned this pull request Apr 11, 2021

[WIP] assert input data for least squares is of correct type #722

Closed

12 tasks

peendebak changed the title ~~[WIP] add checks on input data type for Model~~ add checks on input data type for Model Apr 11, 2021

peendebak closed this Apr 15, 2021

parejkoj mentioned this pull request Mar 9, 2023

Conversion of input to float64 #850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add checks on input data type for Model #723

add checks on input data type for Model #723

peendebak commented Apr 11, 2021 •

edited

Loading

codecov bot commented Apr 11, 2021 •

edited

Loading

newville commented Apr 11, 2021

reneeotten commented Apr 11, 2021

peendebak commented Apr 12, 2021

newville commented Apr 12, 2021

peendebak commented Apr 15, 2021

newville commented Apr 15, 2021

add checks on input data type for Model #723

add checks on input data type for Model #723

Conversation

peendebak commented Apr 11, 2021 • edited Loading

Description

Type of Changes

Tested on

Verification

codecov bot commented Apr 11, 2021 • edited Loading

Codecov Report

newville commented Apr 11, 2021

reneeotten commented Apr 11, 2021

peendebak commented Apr 12, 2021

newville commented Apr 12, 2021

peendebak commented Apr 15, 2021

newville commented Apr 15, 2021

peendebak commented Apr 11, 2021 •

edited

Loading

codecov bot commented Apr 11, 2021 •

edited

Loading