Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/REF: More options for interpolation and fillna #4915

Closed
wants to merge 1 commit into from

Conversation

TomAugspurger
Copy link
Contributor

closes #4434
closes #1892

I've basically just hacked out the Series interpolate and stuffed it under generic under a big if statement. Gonna make that much cleaner. Moved the interpolation and fillna specific tests in test_series.py to test_generic.py.

API question for you all. The interpolation procedures in Scipy take an array of x-values and an array of y-values that form the basis for the interpolation object. The interpolation object can then be evaluated wherever, but it maps X -> Y; f(x-values) == y-values. So we have 3 arrays to deal with:

  1. x-values
  2. y-values
  3. new values

Preferences for names? The other issue is for defaults. Right now I'm thinking

  1. x-values: index if numeric.
  2. y-values: the Series.values if Series. Each numeric column of DataFrame if DataFrame?
  3. new values: Fill NaNs by default. Interpolate at new values if an array is given.

@@ -1164,6 +1164,42 @@ def backfill_2d(values, limit=None, mask=None):
pass
return values


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the these routines to core/algorithms.py (not that we have more than one). in case you need helper functions will be easier

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@TomAugspurger

separate the 1-d and 2-d cases:

if you are doing 1-d interp, e.g. a Series, (which can be also be done by applying interpolate on a frame)

then you are just using 1-d days, with an index (x), and values (y) which produce new values which you return as the new values.the y have nans (if they don't it should just pass thru, right?)

2-d is a completely different case, what does scipy do here? like a grid interp? (obviously these only work on frames)

also....not sure you even need to fill at all, isn't that the point of interp? (or is it possible that nans are returned from these routines?

@TomAugspurger
Copy link
Contributor Author

The real difference between 1d and 2d is that the function object returned by scipy.interpolate.interp2d is now a function of two arguments: f(x, y). This could conceivably be useful on a Series with a MultiIndex (all the more reason to place all this in generic!)

also....not sure you even need to fill at all

The interpolated values shouldn't normally contain NaNs. But the reason I'm thinking about fillna is that the current behavior of Series.interpolate is to fill NaNs. So that needs to stay the default behavior of generic.interpolate. I'm a bit worried that the purpose of .interpolate is a bit broad, but fundamentally interpolation and filling NaNs are similar enough to be contained under the same method.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@TomAugspurger can you give an example in action, maybe am confused, but I thought the idea is to:

In [1]: s = Series([1,np.nan,3,np.nan,5])

In [2]: s
Out[2]: 
0     1
1   NaN
2     3
3   NaN
4     5
dtype: float64

after interpolate

In [4]: s.interpolate()
Out[4]: 
0    1
1    2
2    3
3    4
4    5
dtype: float64

@TomAugspurger
Copy link
Contributor Author

That would be correct. But I think I've seen questions on SO about how to do something like:

>>> s = Series([0,1, 2, 3])
>>> s.interpolate([0.5, 1.5, 2.5])  # interpolate what the Series would be at these points.
0.5    0.5
1.5    1.5
2.5    2.5
dtype: float64

@TomAugspurger
Copy link
Contributor Author

Do you see a use for that? If so does it belong under interpolate or elsewhere? I guess the connection to filling NaNs is reindexing to include the original index and the new values and filling in the NaNs:

In [66]: s = Series([0, 1, 2, 3])

In [67]: s.reindex([0, .5, 1, 1.5, 2, 2.5, 3]).interpolate()
Out[67]: 
0.0    0.0
0.5    0.5
1.0    1.0
1.5    1.5
2.0    2.0
2.5    2.5
3.0    3.0
dtype: float64

Maybe thats a better way to think about it.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

so maybe a method signature of this

def interpolate(self, values=None, method='linear', axis=0, inplace=False, limit=None)

makes sense

my example would be values=None,method='linear'
yours would be values=array_of_values_same_length,method=??? (where ??? could be linear or some scipy method name)

?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

I don't your 2nd example (reindex and fill) can be done exactly like that if the user wants
(you could even make this a 'method'), e.g.

values=values,method='linear' maybe should do exactly what you suggest

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

actually.....this might be easy, if values is not None, then just reindex by it, then apply the method (either using what Series does not via linear or use a cipy method for fancier)

@TomAugspurger
Copy link
Contributor Author

I would add one more argument to the function signature: a way to specify what values to use for the x-values. By default xvalues would be the index, like we've been assuming. But if you have a DataFrame where you want the x-values to be column A and the y-values to be column B

def interpolate(self, xvalues=None, values,method='linear', axis=0, inplace=False,
    limit=None)

Gives something like

In [73]: df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [4, 5, 6, np.nan], 'C': [1, 2, 3, 5]})

In [74]: df
Out[74]: 
    A   B  C
0   1   4  1
1   2   5  2
2 NaN   6  3
3   4 NaN  5

In [75]: df.interpolate(xvalues='C', values='B')
Out[75]:

1   4
2   5
3   6
5   8

@TomAugspurger
Copy link
Contributor Author

I'm liking the idea of reindexing and filling. It's helps clarify why my two use cases (filling NaNs in an existing frame and interpolation at new values) are really the same.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

why don't we call it index though, clearer IMHO.

I think values and index are backwards

def interpolate(self, index=None, values=None,method='linear', axis=0, inplace=False,
    limit=None)

so cases are:

  1. index is None, look at method and use it to fill the nans (e.g. linear)
  2. index is an array (same length as index or an error), reindex, then apply above
  3. index is a column name (only for frames), same as above (values is None here)
  4. index is an array/column name, and values is array/column ref, use that as the xvalues

so most of this is really just argument interpretation....

@TomAugspurger
Copy link
Contributor Author

Yes on index for x-values, values for y-values. That will be nice for 2d interpolation too. df.interpolate(index=['A', B'], values='C', method=linear2d) or just method=linear since giving index a 2d array implies a 2d interpolation.

  1. correct
  2. Maybe. Not sure about the different length throwing an error. What about
In [86]: s = Series([0, 1, 2, 3])
In [87]: s.interpolate(values=[1.5, 2.5])
1.5   1.5
2.5   2.5
dtype:  float64

I'lll need to think how this one is implemented.

  1. Yes
  2. Yes

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

your example for 2) doesn't fit with reindexing (well its not really indexing, more of setting an index)

I think you need to require the index to be the same length, how else would you map it (and if you really did want to map it differently, then the user should fix the series first)

@TomAugspurger
Copy link
Contributor Author

Actually hold on, there's potentially another case

  1. values is None or a column of df, index is None or a column of df, new_values is an array. Example:
In [92]: df2 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [1, 4, 9, 16]})

In [93]: df2
Out[93]: 
   A   B
0  2   1
1  4   4
2  6   9
3  8  16

In [94]: df2.interpolate(index='A', values='B', new_values=[2.5, 4.5, 6.5])
Out [94]: 

2.5   1.75
4.5   5.25
6.5   10.75

This is getting a bit hairy.  Maybe we should limit the scope for now.  Add features as needed.

@TomAugspurger
Copy link
Contributor Author

This is related to your last comment @jreback so I guess your answer there applies to my last post. My disagreement with your 2.) came from me mixing up the values and new_values argument (need a better name for new_values).

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@TomAugspurger not sure I buy that last case.....not even sure how that maps.....simple to start

you could always support values as say a dict if you need something like that (but later)

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

is new_values a parameter to the interpolation? I mean, can you show how you would be calling the scipy function in case 2)?

@TomAugspurger
Copy link
Contributor Author

new_values would be an argument if you think there's a use. In the abstract, interpolate would take 3 arguments: interpolate(index, values, new_values). It may help to think of partially applying the first two arguments, which would return a function f of one argument. Again, still in the abstract:

interpolate([1, 2, 3], [10, 20, 30]) returns a function of one argument f(new_values). We can evaluate f at the array new_values to return an array.

f([1.5, 2.5]) = [15, 25]

f([1, 1.5, 2, 2.5, 3]) = [10, 15, 20, 25, 30]

Stepping down from the abstraction the arguments would be:

index: defaults to the Series or DataFrame index. Can be a column of DataFrame
values: defaults to the values in a Series, every numeric column of DataFrame. Must be the same length as index for the mapping to make sense.
new_values: defaults to None (filling NaNs). Optionally an array of values at which to interpolate. In this case we'd return a Series whose index is new_values and the values is f(new_values).

plus the method=linear argument, and the usual inplace=False, axis=0, etc.

Maybe I'll start with the nan-filling behavior first and not having a new_values argument at all (essentially forcing new_values to be None.)

@TomAugspurger
Copy link
Contributor Author

Tentative signature and docstring

    def interpolate(self, index=None, values=None, new_values=None,
                    method='linear', inplace=False, limit=None, axis=0):
        """Interpolate values according to different methods.

        Parameters
        ----------
        index : arraylike. The domain of the interpolation.  Uses the
            Series' or DataFrame's index by default.
        values : arraylike. The range of the interpolation. Uses the values
            in a Series or DataFrame by default.  Can also be a column name.
            index and values *must* be of the same length.
        new_values : arraylike or None.
            If new_values is None, will fill NaNs.
            If new_values is an array, will return a Series containing
            the interpolated values and whose index is new_values.
        method : str or int. One of {'linear', 'time', 'values' 'nearest',
            'zero', 'slinear', 'quadratic', 'cubic'}. Or an integer
            specifying the order of the spline interpolator to use. Is linear
            by default. Some of the methods require scipy. TODO: Specify which ones.
        inplace : bool, default False
        limit : int, default None. Maximum number of NaNs to fill.
        axis : int, default 0

        Returns
        -------
        if new_values is None:
            Series or Frame of same shape with NaNs filled
        else:
            Series with index new_values

        See Also
        --------
        reindex, replace, fillna

        Examples
        --------

        # Filling in NaNs:
        >>> s = pd.Series([0, 1, np.nan, 3])
        # index=s.index, values=s.values; new_values is None so filling NaNs
        >>> s.interpolate()
        0    0
        1    1
        2    2
        3    3
        dtype: float64

        # Linear interpolation on Series at new values
        >>> s = pd.Series([0, 1, 2, 3])
        >>> s.interpolate(new_values=[0.5, 1.5, 2.5])
        0.5    0.5
        1.5    1.5
        2.5    2.5
        dtype: float64

        # Using two columns from a DataFrame
        >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'Y': [1, 5, 9, np.nan]})
        >>> df.interpolate(index='A', values='Y')  # fill the NaN
           A   Y
        0  1   1
        1  2   5
        2  3   9
        3  4  13
        """

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

   values : arraylike. The range of the interpolation. Uses the values
       in a Series or DataFrame by default.  Can also be a column name.
       index and values *must* be of the same length.

what does uses the 'values' in a DataFrame mean?

I wouldn't allow method to be anything other than a string
if you need additonal args, kwargs are passed thru (or you should), so use use those,

e.g. split, will essentially default spline_order=1 (or order ok too), or can have a method spline1, split2, etc.....

still fuzzy on the new_values...maybe your implementation/docs will clear it up!

good docstring writing!

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

how does your last example decide 13 is the value? (I mean looking at it I get it, but how programatically is that the case)?

@TomAugspurger
Copy link
Contributor Author

what does uses the 'values' in a DataFrame mean?

Identical to applying Series.interpolate() to every (numeric) column of the DataFrame. Or a list of column names to apply it to.

how does your last example decide 13 is the value? (I mean looking at it I get it, but how programatically is that the case)?

I made the last example up, and after looking at current behavior I think it's wrong. Currently

In [3]: s = Series([0, 1, 2, np.nan])

In [4]: s.interpolate()
Out[4]: 
0    0
1    1
2    2
3    2
dtype: float64

Which is the easy way to do things! The scipy methods would do the same thing by default.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

so ffill is essentially the default then?

fyi...if method is None I think you should raise, don't try to 'guess',

@TomAugspurger
Copy link
Contributor Author

I think method=linear will have to be the default since this is replacing Series.interpolate(), which isn't quite ffill. Current behavior:

In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])

In [9]: s.interpolate()
Out[9]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: float64

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

ok..sounds ok then

get basics working (e.g. linear), then adding scipy functions should be striaght forward

@TomAugspurger
Copy link
Contributor Author

One nasty thing the way Series.inteprolate() is implemented now, something like

In [8]: s = Series([0, 1, 2, np.nan, np.nan, 5])

In [9]: s.interpolate()
Out[9]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: float64

will return the same as

In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])

In [12]: s
Out[12]: 
1      0
2      1
4      2
7    NaN
11   NaN
16     5
dtype: float64

In [13]: s.interpolate()
Out[13]: 
1     0
2     1
4     2
7     3
11    4
16    5
dtype: float64

i.e. its treating each value as "equally spaced".

If we are treating interpolation the way I envision we'd expect

In [11]: s = Series([0, 1, 2, np.nan, np.nan, 5], index=[1, 2, 4, 7, 11, 16])
In [12]: s.interpolate()
1     0
2     1
4     2
7     2.75
11    3.75
16    5
dtype: float64

This may mean we have to tweak the index default. It would be index=arange(len(self)) for backwards compat, and users could choose to use index=s.index.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

why don't you call the exising method=linear and what most people would do as method='equal'
and at some point can just change the default (to equal)?

@TomAugspurger
Copy link
Contributor Author

That sounds like a good idea. I'll think about the names. equal sounds like what linear is doing right now.

@TomAugspurger
Copy link
Contributor Author

Ah never-mind! Wes already solved this for us. One of the possible values to give to the original Series.interpolate was method=values, which uses the actual numerical index.

"PCHIP interpolation.")

interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic']
if method in interp1d_methods or isinstance(method, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this to validate order if method='spline'

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

btw @TomAugspurger i'm just being a hardass ... i think this is great stuff ... very useful!

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

echo.. @cpcloud these are mostly just nitpicks in any event!

@TomAugspurger
Copy link
Contributor Author

I'll never turn down free advice on programming. There's plenty to learn.

@TomAugspurger
Copy link
Contributor Author

@jreback Your prompt to allow kwargs reminded me that I forget to wrap scipy.interpolate.UnivariateSpline. I guess spline interpolate is a bit different than polynomial interpolation. Anyway, both of those take an order type argument representing the degree. I've added that now under core.common._interpolate_scipy_wrapper

Sphinx doesn't seem to like that I added the kwargs to interpolate. It's claiming that order is not a valid keyword argument. I'll clean and rebuild from scratch.

@TomAugspurger
Copy link
Contributor Author

Let me know when I should rebase and squash again.

new_x = xvalues[invalid]

if method == 'time':
if not xvalues.is_all_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it like this:

if not getattr(xvalues,'is_all_dates',None)

because if for some reason xvalues is an ndarray this would fail

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

@TomAugspurger you can squash and rebase when you are ready

you said above that the sphix is complaining?

@TomAugspurger
Copy link
Contributor Author

Yeah.
Here's a screenshot:
screen shot 2013-10-09 at 1 38 26 pm

Obviously they work from the prompt. I'm on this PR's branch when I build them.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

its not picking up the current version when you build the docs....(its using your 0.12 version)

in doc/source/conf.py

sys.path.insert(0,'/home/jreback/pandas')

at the top of the file (which I then also have to take out as its not part of the default)

I think @cpcloud has a better way though

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

@TomAugspurger how are you building them? with make doc?

@TomAugspurger
Copy link
Contributor Author

python make.py html

@cpcloud
Copy link
Member

cpcloud commented Oct 9, 2013

can u try make doc? that will wipe everything you've created so far ... sometimes you can get these kinds of errors from lingering builds that sphinx doesn't think it needs to rebuild

ENH: the interpolate method argument can take more values
for various types of interpolation

REF: Moves Series.interpolate to core/generic. DataFrame gets
interpolate

CLN: clean up interpolate to use blocks

ENH: Add additonal 1-d scipy interpolaters.

DOC: examples for df interpolate and a plot

DOC: release notes

DOC: Scipy links and more expanation

API: Don't use fill_value

BUG: Raise on panels.

API: Raise on non monotonic indecies if it matters

BUG: Raise on only mixed types.

ENH/DOC: Add `spline` interpolation.

DOC: naming consistency
@TomAugspurger
Copy link
Contributor Author

make doc didn't work. Jeff's suggestion did though.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

@TomAugspurger gr8...just make sure to take it out (or I can do it for you when we merge).....

I just put it in to build the docs (then take it out)...

@@ -174,6 +174,8 @@ Improvements to existing features
- :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
- ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)
- ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include
interpolation methods from scipy. (:issue:`4915`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to issues 4434, 1892 as that's what this is actually closing

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

merged via aff7346

thanks @TomAugspurger awesome job!

@TomAugspurger
Copy link
Contributor Author

Thanks for all the guidance / patience.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

docs are up and look good!

http://pandas.pydata.org/pandas-docs/dev/missing_data.html#interpolation

  • scipy is prob too old on the build machine (hopefully will be updated soon) @changhiskhan @wesm
    need scipy 0.12
  • the graph that compares linear,cubic,quadratic. it seems their is no differentiation?

@TomAugspurger
Copy link
Contributor Author

Mistake on my part. I renamed all the s series to ser at the last minute to be more consistent with the rest of the docs. Missed that one.

Should I make a new PR to fix that quick?

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Richer options for interpolate and resample limit keyword for interpolate
6 participants