polyfit() fails on vectors with NaN #6037

soupault · 2015-07-03T08:29:47Z

I have two Pandas DataFrames:

0         NaN
1     58.4890
2     54.2880
3         NaN
4         NaN
5         NaN
6     53.7940
7         NaN
8         NaN
9     59.7240
10        NaN
11        NaN
12        NaN
13    25.9955
Name: df_a, dtype: float64

0       NaN
1     3.505
2     3.530
3       NaN
4       NaN
5       NaN
6     3.440
7       NaN
8       NaN
9     3.420
10      NaN
11      NaN
12      NaN
13    3.430
Name: df_b, dtype: float64

Then I sort these series by increasing order of df_a : [0, 2, 1, 3, 4, 5, 13, 6, 7, 8, 9, 10, 11, 12].

Apply polynomial fitting (tried for several degrees):

f = np.poly1d(np.polyfit(df_a_sorted, df_b_sorted, 1))
x_fit, y_fit = x, f(x)

And getting an error:

Intel MKL ERROR: Parameter 4 was incorrect on entry to DGELSD.
C:\Python34\lib\site-packages\numpy\lib\polynomial.py:588: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)

>> print(x_fit)
[nan, 54.288000000000004, 58.488999999999997, nan, nan, nan, 25.9955, 53.793999999999997, nan, nan, 59.723999999999997, nan, nan, nan]
>> print(y_fit)
[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan]

The warning is OK, but I still expect function to fit the polynomial. You can check that all the points are located near the imaginary line.
The problem is NaN handling, I suppose.
Also, I'm not sure what does the "Intel MKL" error mean.

SW versions:
python (3.4.2)
numpy (0.15.1)
cython (0.21.2)
pandas (0.15.2)
All libraries are taken from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Platform: Windows 7 x64.

argriffing · 2015-07-06T13:50:59Z

This issue is similar to scipy/scipy#4060 -- in both cases pandas users want to use NaN to mean 'missing' in numpy/scipy interpolation or fitting, and in both cases a short term solution would be for the user to use weighted interpolation or fitting with zeros at the NaN locations. Unlike pandas, numpy and scipy do not generally interpret NaN as missing data. The longer-term solution would be to improve support for missing data across the scipy stack.

soupault · 2015-07-06T13:57:48Z

I see. Thank you for the detailed answer!

argriffing · 2015-07-06T14:09:03Z

Now that I look at this more closely the NaNs are always together in (x, y) pairs, whereas the scipy issue linked above would be like just missing y values. In this case where both x and y are missing, I think it would be better to completely remove these pairs before the analysis rather than playing with weights.

By the way, I had the idea that pandas was adding functions around these numpy/scipy functions to work automatically with NaN as missing data, but I've not been following pandas development closely. I'd expect something like pandas.interpolate(something_with_nan_as_missing_data, ...) to work now or eventually.

soupault closed this as completed Sep 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polyfit() fails on vectors with NaN #6037

polyfit() fails on vectors with NaN #6037

soupault commented Jul 3, 2015

argriffing commented Jul 6, 2015

soupault commented Jul 6, 2015

argriffing commented Jul 6, 2015

polyfit() fails on vectors with NaN #6037

polyfit() fails on vectors with NaN #6037

Comments

soupault commented Jul 3, 2015

argriffing commented Jul 6, 2015

soupault commented Jul 6, 2015

argriffing commented Jul 6, 2015