ENH/REF: More options for interpolation and fillna #4915

TomAugspurger · 2013-09-21T13:52:05Z

closes #4434
closes #1892

I've basically just hacked out the Series interpolate and stuffed it under generic under a big if statement. Gonna make that much cleaner. Moved the interpolation and fillna specific tests in test_series.py to test_generic.py.

API question for you all. The interpolation procedures in Scipy take an array of x-values and an array of y-values that form the basis for the interpolation object. The interpolation object can then be evaluated wherever, but it maps X -> Y; f(x-values) == y-values. So we have 3 arrays to deal with:

x-values
y-values
new values

Preferences for names? The other issue is for defaults. Right now I'm thinking

x-values: index if numeric.
y-values: the Series.values if Series. Each numeric column of DataFrame if DataFrame?
new values: Fill NaNs by default. Interpolate at new values if an array is given.

jreback · 2013-09-21T14:01:35Z

pandas/core/common.py

@@ -1164,6 +1164,42 @@ def backfill_2d(values, limit=None, mask=None):
        pass
    return values

+


I would move the these routines to core/algorithms.py (not that we have more than one). in case you need helper functions will be easier

jreback · 2013-09-21T14:09:53Z

@TomAugspurger

separate the 1-d and 2-d cases:

if you are doing 1-d interp, e.g. a Series, (which can be also be done by applying interpolate on a frame)

then you are just using 1-d days, with an index (x), and values (y) which produce new values which you return as the new values.the y have nans (if they don't it should just pass thru, right?)

2-d is a completely different case, what does scipy do here? like a grid interp? (obviously these only work on frames)

also....not sure you even need to fill at all, isn't that the point of interp? (or is it possible that nans are returned from these routines?

TomAugspurger · 2013-09-21T14:57:54Z

The real difference between 1d and 2d is that the function object returned by scipy.interpolate.interp2d is now a function of two arguments: f(x, y). This could conceivably be useful on a Series with a MultiIndex (all the more reason to place all this in generic!)

also....not sure you even need to fill at all

The interpolated values shouldn't normally contain NaNs. But the reason I'm thinking about fillna is that the current behavior of Series.interpolate is to fill NaNs. So that needs to stay the default behavior of generic.interpolate. I'm a bit worried that the purpose of .interpolate is a bit broad, but fundamentally interpolation and filling NaNs are similar enough to be contained under the same method.

jreback · 2013-09-21T15:24:43Z

@TomAugspurger can you give an example in action, maybe am confused, but I thought the idea is to:

In [1]: s = Series([1,np.nan,3,np.nan,5])

In [2]: s
Out[2]: 
0     1
1   NaN
2     3
3   NaN
4     5
dtype: float64

after interpolate

In [4]: s.interpolate()
Out[4]: 
0    1
1    2
2    3
3    4
4    5
dtype: float64

TomAugspurger · 2013-09-21T15:42:12Z

That would be correct. But I think I've seen questions on SO about how to do something like:

>>> s = Series([0,1, 2, 3])
>>> s.interpolate([0.5, 1.5, 2.5])  # interpolate what the Series would be at these points.
0.5    0.5
1.5    1.5
2.5    2.5
dtype: float64

TomAugspurger · 2013-09-21T15:45:55Z

Do you see a use for that? If so does it belong under interpolate or elsewhere? I guess the connection to filling NaNs is reindexing to include the original index and the new values and filling in the NaNs:

In [66]: s = Series([0, 1, 2, 3])

In [67]: s.reindex([0, .5, 1, 1.5, 2, 2.5, 3]).interpolate()
Out[67]: 
0.0    0.0
0.5    0.5
1.0    1.0
1.5    1.5
2.0    2.0
2.5    2.5
3.0    3.0
dtype: float64

Maybe thats a better way to think about it.

jreback · 2013-09-21T15:47:02Z

so maybe a method signature of this

def interpolate(self, values=None, method='linear', axis=0, inplace=False, limit=None)

makes sense

my example would be values=None,method='linear'
yours would be values=array_of_values_same_length,method=??? (where ??? could be linear or some scipy method name)

?

jreback · 2013-09-21T15:48:18Z

I don't your 2nd example (reindex and fill) can be done exactly like that if the user wants
(you could even make this a 'method'), e.g.

values=values,method='linear' maybe should do exactly what you suggest

jreback · 2013-09-21T15:50:07Z

actually.....this might be easy, if values is not None, then just reindex by it, then apply the method (either using what Series does not via linear or use a cipy method for fancier)

TomAugspurger · 2013-09-21T15:55:02Z

I would add one more argument to the function signature: a way to specify what values to use for the x-values. By default xvalues would be the index, like we've been assuming. But if you have a DataFrame where you want the x-values to be column A and the y-values to be column B

def interpolate(self, xvalues=None, values,method='linear', axis=0, inplace=False,
    limit=None)

Gives something like

In [73]: df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [4, 5, 6, np.nan], 'C': [1, 2, 3, 5]})

In [74]: df
Out[74]: 
    A   B  C
0   1   4  1
1   2   5  2
2 NaN   6  3
3   4 NaN  5

In [75]: df.interpolate(xvalues='C', values='B')
Out[75]:

1   4
2   5
3   6
5   8

TomAugspurger · 2013-09-21T15:56:36Z

I'm liking the idea of reindexing and filling. It's helps clarify why my two use cases (filling NaNs in an existing frame and interpolation at new values) are really the same.

jreback · 2013-09-21T16:02:08Z

why don't we call it index though, clearer IMHO.

I think values and index are backwards

def interpolate(self, index=None, values=None,method='linear', axis=0, inplace=False,
    limit=None)

so cases are:

index is None, look at method and use it to fill the nans (e.g. linear)
index is an array (same length as index or an error), reindex, then apply above
index is a column name (only for frames), same as above (values is None here)
index is an array/column name, and values is array/column ref, use that as the xvalues

so most of this is really just argument interpretation....

TomAugspurger · 2013-09-21T16:12:21Z

Yes on index for x-values, values for y-values. That will be nice for 2d interpolation too. df.interpolate(index=['A', B'], values='C', method=linear2d) or just method=linear since giving index a 2d array implies a 2d interpolation.

correct
Maybe. Not sure about the different length throwing an error. What about

In [86]: s = Series([0, 1, 2, 3])
In [87]: s.interpolate(values=[1.5, 2.5])
1.5   1.5
2.5   2.5
dtype:  float64

I'lll need to think how this one is implemented.

Yes
Yes

jreback · 2013-09-21T16:18:08Z

your example for 2) doesn't fit with reindexing (well its not really indexing, more of setting an index)

I think you need to require the index to be the same length, how else would you map it (and if you really did want to map it differently, then the user should fix the series first)

TomAugspurger · 2013-09-21T16:22:42Z

Actually hold on, there's potentially another case

values is None or a column of df, index is None or a column of df, new_values is an array. Example:

In [92]: df2 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [1, 4, 9, 16]})

In [93]: df2
Out[93]: 
   A   B
0  2   1
1  4   4
2  6   9
3  8  16

In [94]: df2.interpolate(index='A', values='B', new_values=[2.5, 4.5, 6.5])
Out [94]: 

2.5   1.75
4.5   5.25
6.5   10.75

This is getting a bit hairy.  Maybe we should limit the scope for now.  Add features as needed.

TomAugspurger · 2013-09-21T16:25:12Z

This is related to your last comment @jreback so I guess your answer there applies to my last post. My disagreement with your 2.) came from me mixing up the values and new_values argument (need a better name for new_values).

jreback · 2013-09-21T16:25:41Z

@TomAugspurger not sure I buy that last case.....not even sure how that maps.....simple to start

you could always support values as say a dict if you need something like that (but later)

jreback · 2013-09-21T16:26:08Z

is new_values a parameter to the interpolation? I mean, can you show how you would be calling the scipy function in case 2)?

TomAugspurger · 2013-09-21T16:48:41Z

new_values would be an argument if you think there's a use. In the abstract, interpolate would take 3 arguments: interpolate(index, values, new_values). It may help to think of partially applying the first two arguments, which would return a function f of one argument. Again, still in the abstract:

interpolate([1, 2, 3], [10, 20, 30]) returns a function of one argument f(new_values). We can evaluate f at the array new_values to return an array.

f([1.5, 2.5]) = [15, 25]

f([1, 1.5, 2, 2.5, 3]) = [10, 15, 20, 25, 30]

Stepping down from the abstraction the arguments would be:

index: defaults to the Series or DataFrame index. Can be a column of DataFrame
values: defaults to the values in a Series, every numeric column of DataFrame. Must be the same length as index for the mapping to make sense.
new_values: defaults to None (filling NaNs). Optionally an array of values at which to interpolate. In this case we'd return a Series whose index is new_values and the values is f(new_values).

plus the method=linear argument, and the usual inplace=False, axis=0, etc.

Maybe I'll start with the nan-filling behavior first and not having a new_values argument at all (essentially forcing new_values to be None.)

TomAugspurger · 2013-09-21T18:23:19Z

Tentative signature and docstring

    def interpolate(self, index=None, values=None, new_values=None,
                    method='linear', inplace=False, limit=None, axis=0):
        """Interpolate values according to different methods.

        Parameters
        ----------
        index : arraylike. The domain of the interpolation.  Uses the
            Series' or DataFrame's index by default.
        values : arraylike. The range of the interpolation. Uses the values
            in a Series or DataFrame by default.  Can also be a column name.
            index and values *must* be of the same length.
        new_values : arraylike or None.
            If new_values is None, will fill NaNs.
            If new_values is an array, will return a Series containing
            the interpolated values and whose index is new_values.
        method : str or int. One of {'linear', 'time', 'values' 'nearest',
            'zero', 'slinear', 'quadratic', 'cubic'}. Or an integer
            specifying the order of the spline interpolator to use. Is linear
            by default. Some of the methods require scipy. TODO: Specify which ones.
        inplace : bool, default False
        limit : int, default None. Maximum number of NaNs to fill.
        axis : int, default 0

        Returns
        -------
        if new_values is None:
            Series or Frame of same shape with NaNs filled
        else:
            Series with index new_values

        See Also
        --------
        reindex, replace, fillna

        Examples
        --------

        # Filling in NaNs:
        >>> s = pd.Series([0, 1, np.nan, 3])
        # index=s.index, values=s.values; new_values is None so filling NaNs
        >>> s.interpolate()
        0    0
        1    1
        2    2
        3    3
        dtype: float64

        # Linear interpolation on Series at new values
        >>> s = pd.Series([0, 1, 2, 3])
        >>> s.interpolate(new_values=[0.5, 1.5, 2.5])
        0.5    0.5
        1.5    1.5
        2.5    2.5
        dtype: float64

        # Using two columns from a DataFrame
        >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'Y': [1, 5, 9, np.nan]})
        >>> df.interpolate(index='A', values='Y')  # fill the NaN
           A   Y
        0  1   1
        1  2   5
        2  3   9
        3  4  13
        """

jreback · 2013-09-21T18:29:42Z

   values : arraylike. The range of the interpolation. Uses the values
       in a Series or DataFrame by default.  Can also be a column name.
       index and values *must* be of the same length.

what does uses the 'values' in a DataFrame mean?

I wouldn't allow method to be anything other than a string
if you need additonal args, kwargs are passed thru (or you should), so use use those,

e.g. split, will essentially default spline_order=1 (or order ok too), or can have a method spline1, split2, etc.....

still fuzzy on the new_values...maybe your implementation/docs will clear it up!

good docstring writing!

jreback · 2013-09-21T18:30:57Z

how does your last example decide 13 is the value? (I mean looking at it I get it, but how programatically is that the case)?

TomAugspurger · 2013-09-21T18:34:06Z

what does uses the 'values' in a DataFrame mean?

Identical to applying Series.interpolate() to every (numeric) column of the DataFrame. Or a list of column names to apply it to.

how does your last example decide 13 is the value? (I mean looking at it I get it, but how programatically is that the case)?

I made the last example up, and after looking at current behavior I think it's wrong. Currently

In [3]: s = Series([0, 1, 2, np.nan])

In [4]: s.interpolate()
Out[4]: 
0    0
1    1
2    2
3    2
dtype: float64

Which is the easy way to do things! The scipy methods would do the same thing by default.

jreback · 2013-09-21T18:36:06Z

so ffill is essentially the default then?

fyi...if method is None I think you should raise, don't try to 'guess',

TomAugspurger · 2013-09-21T18:40:23Z

I think method=linear will have to be the default since this is replacing Series.interpolate(), which isn't quite ffill. Current behavior:

In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])

In [9]: s.interpolate()
Out[9]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: float64

jreback · 2013-09-21T18:42:35Z

ok..sounds ok then

get basics working (e.g. linear), then adding scipy functions should be striaght forward

TomAugspurger · 2013-09-21T18:46:50Z

One nasty thing the way Series.inteprolate() is implemented now, something like

In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])

In [9]: s.interpolate()
Out[9]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: float64

will return the same as

In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])

In [12]: s
Out[12]: 
1      0
2      1
4      2
7    NaN
11   NaN
16     5
dtype: float64

In [13]: s.interpolate()
Out[13]: 
1     0
2     1
4     2
7     3
11    4
16    5
dtype: float64

i.e. its treating each value as "equally spaced".

If we are treating interpolation the way I envision we'd expect

In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])
In [12]: s.interpolate()
1     0
2     1
4     2
7     2.75
11    3.75
16    5
dtype: float64

This may mean we have to tweak the index default. It would be index=arange(len(self)) for backwards compat, and users could choose to use index=s.index.

jreback · 2013-09-21T18:54:33Z

why don't you call the exising method=linear and what most people would do as method='equal'
and at some point can just change the default (to equal)?

TomAugspurger · 2013-09-21T19:08:49Z

That sounds like a good idea. I'll think about the names. equal sounds like what linear is doing right now.

TomAugspurger · 2013-09-21T19:11:59Z

Ah never-mind! Wes already solved this for us. One of the possible values to give to the original Series.interpolate was method=values, which uses the actual numerical index.

jreback · 2013-10-09T15:10:32Z

pandas/core/common.py

+                              "PCHIP interpolation.")
+
+    interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic']
+    if method in interp1d_methods or isinstance(method, int):


fix this to validate order if method='spline'

cpcloud · 2013-10-09T15:28:27Z

btw @TomAugspurger i'm just being a hardass ... i think this is great stuff ... very useful!

jreback · 2013-10-09T15:31:03Z

echo.. @cpcloud these are mostly just nitpicks in any event!

TomAugspurger · 2013-10-09T16:53:14Z

I'll never turn down free advice on programming. There's plenty to learn.

TomAugspurger · 2013-10-09T18:07:49Z

@jreback Your prompt to allow kwargs reminded me that I forget to wrap scipy.interpolate.UnivariateSpline. I guess spline interpolate is a bit different than polynomial interpolation. Anyway, both of those take an order type argument representing the degree. I've added that now under core.common._interpolate_scipy_wrapper

Sphinx doesn't seem to like that I added the kwargs to interpolate. It's claiming that order is not a valid keyword argument. I'll clean and rebuild from scratch.

TomAugspurger · 2013-10-09T18:08:58Z

Let me know when I should rebase and squash again.

jreback · 2013-10-09T18:21:38Z

pandas/core/common.py

+    new_x = xvalues[invalid]
+
+    if method == 'time':
+        if not xvalues.is_all_dates:


I would do it like this:

if not getattr(xvalues,'is_all_dates',None)

because if for some reason xvalues is an ndarray this would fail

jreback · 2013-10-09T18:31:40Z

@TomAugspurger you can squash and rebase when you are ready

you said above that the sphix is complaining?

TomAugspurger · 2013-10-09T18:40:52Z

Yeah.
Here's a screenshot:

Obviously they work from the prompt. I'm on this PR's branch when I build them.

jreback · 2013-10-09T18:43:33Z

its not picking up the current version when you build the docs....(its using your 0.12 version)

in doc/source/conf.py

sys.path.insert(0,'/home/jreback/pandas')

at the top of the file (which I then also have to take out as its not part of the default)

I think @cpcloud has a better way though

cpcloud · 2013-10-09T18:45:15Z

@TomAugspurger how are you building them? with make doc?

TomAugspurger · 2013-10-09T18:47:47Z

python make.py html

cpcloud · 2013-10-09T18:49:55Z

can u try make doc? that will wipe everything you've created so far ... sometimes you can get these kinds of errors from lingering builds that sphinx doesn't think it needs to rebuild

ENH: the interpolate method argument can take more values for various types of interpolation REF: Moves Series.interpolate to core/generic. DataFrame gets interpolate CLN: clean up interpolate to use blocks ENH: Add additonal 1-d scipy interpolaters. DOC: examples for df interpolate and a plot DOC: release notes DOC: Scipy links and more expanation API: Don't use fill_value BUG: Raise on panels. API: Raise on non monotonic indecies if it matters BUG: Raise on only mixed types. ENH/DOC: Add `spline` interpolation. DOC: naming consistency

TomAugspurger · 2013-10-09T19:21:06Z

make doc didn't work. Jeff's suggestion did though.

jreback · 2013-10-09T19:26:29Z

@TomAugspurger gr8...just make sure to take it out (or I can do it for you when we merge).....

I just put it in to build the docs (then take it out)...

jreback · 2013-10-09T19:43:17Z

doc/source/release.rst

@@ -174,6 +174,8 @@ Improvements to existing features
  - :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
    from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
  - ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)
+  - ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include
+    interpolation methods from scipy. (:issue:`4915`)


change this to issues 4434, 1892 as that's what this is actually closing

jreback · 2013-10-09T20:30:04Z

merged via aff7346

thanks @TomAugspurger awesome job!

TomAugspurger · 2013-10-09T20:31:31Z

Thanks for all the guidance / patience.

jreback · 2013-10-09T22:15:28Z

docs are up and look good!

http://pandas.pydata.org/pandas-docs/dev/missing_data.html#interpolation

scipy is prob too old on the build machine (hopefully will be updated soon) @changhiskhan @wesm
need scipy 0.12
the graph that compares linear,cubic,quadratic. it seems their is no differentiation?

TomAugspurger · 2013-10-09T22:34:41Z

Mistake on my part. I renamed all the s series to ser at the last minute to be more consistent with the rest of the docs. Missed that one.

Should I make a new PR to fix that quick?

jreback · 2013-10-09T22:46:36Z

sure

jreback reviewed Sep 21, 2013
View reviewed changes

jreback reviewed Oct 9, 2013
View reviewed changes

jreback closed this Oct 9, 2013

This was referenced Oct 9, 2013

ENH: Richer options for interpolate and resample #4434

Closed

limit keyword for interpolate #1892

Closed

TomAugspurger mentioned this pull request Jun 24, 2014

DOC: fix docstring for DataFrame.interpolate #7553

Merged

TomAugspurger mentioned this pull request Jan 22, 2015

API: Interpolate at new values #9340

Open

Marion-Odette-Solis mentioned this pull request Nov 5, 2020

P04 - fillna limit IIC2115/Syllabus-2020-2#135

Closed

		@@ -1164,6 +1164,42 @@ def backfill_2d(values, limit=None, mask=None):
		pass
		return values

ENH/REF: More options for interpolation and fillna #4915

ENH/REF: More options for interpolation and fillna #4915

Conversation

TomAugspurger commented Sep 21, 2013

jreback Sep 21, 2013

Choose a reason for hiding this comment

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

jreback commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

TomAugspurger commented Sep 21, 2013

jreback Oct 9, 2013

Choose a reason for hiding this comment

cpcloud commented Oct 9, 2013

jreback commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

jreback Oct 9, 2013

Choose a reason for hiding this comment

jreback commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

jreback commented Oct 9, 2013

cpcloud commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

cpcloud commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

jreback commented Oct 9, 2013

jreback Oct 9, 2013

Choose a reason for hiding this comment

jreback commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

jreback commented Oct 9, 2013

TomAugspurger commented Oct 9, 2013

jreback commented Oct 9, 2013