ENH: Richer options for `interpolate` and `resample` #4434

Closed
TomAugspurger opened this Issue Aug 1, 2013 · 27 comments

Projects

None yet

6 participants

@TomAugspurger
Contributor

related #1892, #1479

Is there any interest in giving interpolate and resample (to higher frequency) some additional methods?

For example:

from scipy import interpolate
df = pd.DataFrame({'A': np.arange(10), 'B': np.exp(np.arange(10) + np.random.randn())})
xnew = np.arange(10) + .5

In [46]: df.interpolate(xnew, method='spline')

Could return something like

In [47]: pd.DataFrame(interpolate.spline(df.A, df.B, xnew, order=4), index=xnew)
Out[47]: 
               0
0.5     1.044413
1.5     0.798392
2.5     3.341909
3.5     8.000314
4.5    22.822819
5.5    60.957659
6.5   166.844351
7.5   451.760621
8.5  1235.969910
9.5     0.000000  # falls outside the original range so interpolate.spline sets it to 0.

I have never used the DataFrame's interpolate, but a quick glance says that something like the above wouldn't be backwards compatible with the current calling convention. Maybe a different name? This may be confusing two issues: interpolating over missing values and interpolating / predicting non-existent values. Or are they similar enought that they can be treated the same. I would think so.

These are just some quick thoughts before I forget. I haven't spent much time thinking a design through yet. I'd be happy to work on this in a month or so.

Also does this fall in the realm of statsmodels?

Contributor
cpcloud commented Aug 2, 2013

@jreback thought we deprecated DataFrame.interpolate...? should we bring it back? splines sort of blur the line between pandas and statsmodels (i think leaning more towards statsmodels) but i like the idea.

Contributor

Yes, this is a basic task that really should [edit:] not call for statsmodels, in my opinion.

Contributor

Ugly workaround I offered a few days ago: http://stackoverflow.com/a/18276030/1221924

Contributor
jreback commented Aug 21, 2013

since we do use statsmodels/scipy in other parts of the code why don't u peruse sm 5.0 for some available functions here?

Contributor
jreback commented Aug 21, 2013

@jseabold do u have direct support in sm 5.0 for interpolation? or do u defer to scipy?

Contributor
jreback commented Aug 21, 2013

http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html

should be straightforward to directly call these

Contributor
jreback commented Aug 21, 2013

via a kind argument (to pandas interpolate) with some kinds passing to scipy functions which are then wrapped on the return

Contributor

@jreback agreed about the ease of wrapping scipy.interpolate. My example in the first post is just calling

interpolate.spline(df.index, df['A'], xnew)

to get the interpolated values and then wrapping them up in a Series.

I've assumed that the DatFrame's index is the original x-values, which is probably fine for a default but we'd want an argument to say "use this column".

I could probably start on this in a few weeks (I have to finish a paper, then I promised the statsmodels guys that I'd setup a vbench for them).

Contributor

Can we incorporate this into resample and reindex? Anywhere that ffill and bfill are accepted, linear and cubic should also be accepted.

And if we do that, can we give the same options that Series.interpolate provides?

Contributor

@danielballan +1 on resuing parts or all of this for resample and reindex (and possibly fillna?). I think that it would be relatively easy to handle.

Not sure how this would fit in with Jeff's refactor of Series.

Contributor
jreback commented Aug 21, 2013

so there exists right now a Series.interpolate and a generic.interpolate; Series.interpolate should be basically scrapped and it will then use the generic one (needs only a very slight mod to do this).

interpolate calls the pandas.core.internals.interpolate (which is the same routine actually called by method ffill/bfill), so this can be handled at a lower level (e.g. other kinds of fillers)

its a bit to wrap your head around, but is pretty straightforward

Contributor
jreback commented Aug 21, 2013

the key is that both Series and DataFrame BOTH now have ._data (which is the BlockManager) and then call the methods on the blocks, so they end up calling the same methods

Contributor
jreback commented Aug 21, 2013

lmk if you want to take a stab (or I could set it up for you with the structure and you can add in the other methods)

Contributor

If there's no urgency, I'm fine with going through the code to refactor generic.interpolate. I may have to bug you with a few questions though!

If Series.interpolate is refactored though, backwards compatibility may be a problem? Or on second thought maybe not... Right now Series.interploate is just for filling in NaNs. That could still be the default behavior but we could still accept an array of new values at which you want to interpolate, but default that to None. I think that should work.

Contributor
jreback commented Aug 21, 2013

@TomAugspurger that sounds fine; there should be no back compat issue (well...have to make sure, but in theory we have tests for that.....)

if the Series.interpolate is doing some that we want to keep (as the default?) which is prob linear interpolation that then can be moved to a lower level part of the code (e.g. core.common.interpolate_2d), where all of the interpolations will eventually happen

probl need some more tests to validate this

@jreback jreback referenced this issue Aug 21, 2013
Closed

CLN: Post Series subclass from NDFrame #4324

15 of 18 tasks complete
Contributor
jreback commented Aug 21, 2013

@TomAugspurger see #1892 as well; this is not conceptually much harder as the limit kw is already passed thru to these methods; actually implemented it might be a bit trickier. E.g. you might have to interpolate then throw away all but the number of limited values (using a mask for the prior-to NaN values)...sounds more complicated that actually doing it (but this is an add-on feature)

Contributor

Statsmodels uses scipy and will likely continue to do so. There is support for "benchmarking" in statsmodels, but this is such a specialized case, I don't think it's worth supporting on your end.

Contributor
jreback commented Aug 22, 2013

@jseabold good to know; I think it makes sense for pandas for have some built in methods, and a dispatch to scipy/and or sm to use other methods...

Contributor

Starting to take a look at this. Just to get some of the scaffolding straight in my head:

  • Series and DataFrame will both have .interpolate methods which will call generic.interpolate
  • generic.interpolate will call core.internals.interpolate
  • core.internals.interpolate will call core.common.interpolate_2d, which will handle both interpolation of a new array of values given by the user and filling of NaN values. This is where I'll be adding wrapper for the various new methods.

So I'll be adding bits along the way to point things down to core.common.interpolate_2d before handing it off to a scipy or stats models method, capturing that result, and reconstructing either a new Series/DataFrame in the case of interpolate and filling in an existing Series/DataFrame in the case of fillna (or resample to a higher frequency?).

A couple questions:

  • Should new unit tests should go in test_common.py? Or in test_frame.py and test_series.py?
  • Do you want a new top level function pd.interpolate? Or will Series and Frame methods suffice?
  • What about Panels? I'd need to think more about what that would look like. Maybe hold off on that for now
Contributor
jtratner commented Sep 7, 2013

If they are calling generic.interpolate, why not just define it once in core/generic and use the axes abstractions there? If you want to opt-out panel, you could just have Panel raise an error...

Contributor

I could be wrong but I think generic.interpolate is an abstraction for Series and DataFrame (and Panel) interpolate methods and core.common contains the abstraction for interpolate and fillna.

Contributor
jreback commented Sep 7, 2013

@TomAugspurger

right now core.generic.NDFrame.interpolate is where the action is. You need to eliminate core.series.Series.interpolate in favor of that. Its treated the same in the BlockManager so should be straightforward.

However, their may exist some behavior in the core.series.Series.interpolate that does not yet exist in the NDFrame one, so need to integrate this.

See core.generic.NDFrame.fillna/replace for some strategies on this.

You can do a new generic tester in test_generic and move the existing testing in test_series/frame to generic. You can put it under the appropriate classes (e.g. TestSeries) and such.

You can easily not support Panel now (and leave till the end of can do later), but just checking ndim in core.generic.NDFrame and raising. (similary you don't want to allow invalid axis, see how fillna does this.

As far as actually making the useful change (the point of this PR!). I would simply allow method to be different values (this will need some validation, prob can be done in common.interpolate_2d). And if you need other parameters you can either have them passed in (as all kwargs are already propogated down to the Block.interpolate anyhow), or maybe have method a callable. Not sure exactly what the scipy functions do.

So most of the 'real' changes will occurr in common.interpolate_2d. This deals always deals with 2d things. In fact, I would actually have you separate out into interpolate_1d, 2d, 3d.... I think (in common), and avoid some of the boilerplate, though not stricly necessary, and could still be done in one function.

This common.interpolate_2d is called from the individual Block (e.g. a single dtype). which returns an array of the same size (which could be the same one or not, but it doesn't actually matter). as the Block then wraps it up in a new Block and returns it. (This is how all of the inplace stuff is handled).

You should be getting your hands dirty here. Shout out if you need help.

Contributor
jreback commented Sep 20, 2013

@TomAugspurger how's this coming?

Contributor

I've been a bit intimidated about the internals. I'll give it some time this weekend and maybe waive the white flag if I fail. Is it on the schedule for the next release?

Contributor
jreback commented Sep 20, 2013

I think it should be.....lmk if you need help

internals are all about breaking stuff!!! lol

Contributor

the test suite's pretty good, so that's a helpful guide.

Contributor
jreback commented Oct 9, 2013

closed via #4915

@jreback jreback closed this Oct 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment