Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polyfit() fails on vectors with NaN #6037

Closed
soupault opened this issue Jul 3, 2015 · 3 comments
Closed

polyfit() fails on vectors with NaN #6037

soupault opened this issue Jul 3, 2015 · 3 comments

Comments

@soupault
Copy link
Contributor

soupault commented Jul 3, 2015

I have two Pandas DataFrames:

0         NaN
1     58.4890
2     54.2880
3         NaN
4         NaN
5         NaN
6     53.7940
7         NaN
8         NaN
9     59.7240
10        NaN
11        NaN
12        NaN
13    25.9955
Name: df_a, dtype: float64

0       NaN
1     3.505
2     3.530
3       NaN
4       NaN
5       NaN
6     3.440
7       NaN
8       NaN
9     3.420
10      NaN
11      NaN
12      NaN
13    3.430
Name: df_b, dtype: float64

Then I sort these series by increasing order of df_a : [0, 2, 1, 3, 4, 5, 13, 6, 7, 8, 9, 10, 11, 12].

Apply polynomial fitting (tried for several degrees):

f = np.poly1d(np.polyfit(df_a_sorted, df_b_sorted, 1))
x_fit, y_fit = x, f(x)

And getting an error:

Intel MKL ERROR: Parameter 4 was incorrect on entry to DGELSD.
C:\Python34\lib\site-packages\numpy\lib\polynomial.py:588: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)
>> print(x_fit)
[nan, 54.288000000000004, 58.488999999999997, nan, nan, nan, 25.9955, 53.793999999999997, nan, nan, 59.723999999999997, nan, nan, nan]
>> print(y_fit)
[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan]

The warning is OK, but I still expect function to fit the polynomial. You can check that all the points are located near the imaginary line.
The problem is NaN handling, I suppose.
Also, I'm not sure what does the "Intel MKL" error mean.

SW versions:
python (3.4.2)
numpy (0.15.1)
cython (0.21.2)
pandas (0.15.2)
All libraries are taken from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Platform: Windows 7 x64.

@argriffing
Copy link
Contributor

This issue is similar to scipy/scipy#4060 -- in both cases pandas users want to use NaN to mean 'missing' in numpy/scipy interpolation or fitting, and in both cases a short term solution would be for the user to use weighted interpolation or fitting with zeros at the NaN locations. Unlike pandas, numpy and scipy do not generally interpret NaN as missing data. The longer-term solution would be to improve support for missing data across the scipy stack.

@soupault
Copy link
Contributor Author

soupault commented Jul 6, 2015

I see. Thank you for the detailed answer!

@argriffing
Copy link
Contributor

Now that I look at this more closely the NaNs are always together in (x, y) pairs, whereas the scipy issue linked above would be like just missing y values. In this case where both x and y are missing, I think it would be better to completely remove these pairs before the analysis rather than playing with weights.

By the way, I had the idea that pandas was adding functions around these numpy/scipy functions to work automatically with NaN as missing data, but I've not been following pandas development closely. I'd expect something like pandas.interpolate(something_with_nan_as_missing_data, ...) to work now or eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants