Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Interpolate at new values #9340

Open
rubennj opened this issue Jan 22, 2015 · 38 comments
Open

API: Interpolate at new values #9340

rubennj opened this issue Jan 22, 2015 · 38 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@rubennj
Copy link

rubennj commented Jan 22, 2015

First time I used the .interpolate() method I thought that it receives a new index and then interpolates on it, similar to scipy.interpolate.interp1d
From scipy web:

from scipy import interpolate
x = np.arange(0, 10)
y = np.exp(-x/3.0)
f = interpolate.interp1d(x, y)
xnew = np.arange(0,9, 0.1)
ynew = f(xnew)   # use interpolation function returned by `interp1d`

Later I saw the .reindex() method, so I understood that this role is done by .reindex(). However .reindex() is not really doing a powerful interpolation, just extending the current values using the method keyword.

The current way to achieve it (joining previous and new index and then using .reindex()), in version 0.15.0,

index_joined = df.index.join(new_index, how='outer')
df.reindex(index=index_joined).interpolate().reindex(new_index)

A simpler syntax could be accepting the methods from .interpolate() into the 'method' keyword of .reindex():

df.reindex(index=new_index, method='linear')
@TomAugspurger
Copy link
Contributor

See the (long) discussion at #4915

Basically, we wanted to keep the API of interpolate simple, but it's probably too clever since you have to be familiar with reindexing first. I'd actually favor changing the API of interploate to have a new parameter at, which is an array you evaluate the interpolation function at (xnew in your first example).

Also it could take some kind of parameter for whether to return just the interpolated values, or all the values.

@rubennj
Copy link
Author

rubennj commented Jan 24, 2015

Probably I didn't address well. Actually, I see that to reindex is what I really meant, since Pandas works with objects that already have an index and the real intention of this proposal is to change the index (and consequently to interpolate the corresponding values).

I would say that .interpolate() is well defined as already is (to act at missing datapoints), since .reindex() is present.

Anyhow, to have a simple syntax (on .reindex() or .interpolate()) to get this action is very helpful.

@rubennj
Copy link
Author

rubennj commented Feb 4, 2015

Any news? @TomAugspurger

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

@rubennj I agree, we should have some sort of interpolation method that works like reindex. See also my recent PR to add a 'nearest' method to reindex: #9258

It is a bit awkward from an internals perspective to put this on reindex, because reindex currently does not do any interpolation, but rather only takes existing (possibly repeated) values. This is sometimes advantageous: you can always reindex, even if the data values are non-numeric.

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

What about creating two interpolate methods: .interpolate_na() and .interpolate_at()? The former would be an alias for the current interpolate (eventually to be deprecated); the later would work for this new functionality.

I worry that hiding this functionality in reindex means it's unlikely to be easily found -- "interpolate" is a much more obvious name.

@rubennj
Copy link
Author

rubennj commented Feb 18, 2015

OK, I see that .reindex() shouldn't be touched.

I think that one method, .interpolate(), modulated by parameters looks more compact, in the way @TomAugspurger suggested, but I wonder how compatible would be with the current behaviour.

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

We could add a new optional parameter at to interpolate, which if not None will trigger this alternate interpolate API.

My hesitation with combining the functionality into one method is that there are at least two steps required to transform from one mode to the other, e.g., s.interpolate_na() <=> s.dropna().interpolate_at(s.index) (the equivalent to s.interpolate_at(target) is even worse, as shown in the first post). The two methods do pretty fundamentally different things, though both involve interpolation.

Another option would be move the interpolate_na functionality to fillna, and reserve interpolate for the interpolate_at functionality (at least pending deprecation cycles, etc.). That might be a slightly more awkward transition, though.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2015

I think a nice soln here is to create a cookbook recipe for this pattern and a link from the docs

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

I think a nice soln here is to create a cookbook recipe for this pattern and a link from the docs

At least in my experience and anecdotally from my colleagues, the current API of interpolate (filling missing values) is unexpected and not what we were looking for. For example, it's pretty different from what scipy and numpy's interpolate functions do. So I think some sort of API change to make this more intuitive is warranted :).

Interpolation at new values is also a very common pattern in my experience. The cookbook recipe will need to be even more complex than @rubennj's example if it is to handle propagating NA correctly, e.g., to ensure that the result of pd.Series([1, 2, np.nan, 4]).interpolate_at([2.5]) is NaN, not 2.5. It's also non-trivial to wrap SciPy directly, because SciPy uses a different meaning for NaN (see scipy/scipy#4086 and the linked issues). So I believe pretty strongly that incorporating some sort of interpolate_at functionality in pandas itself would be a good idea (regardless of what it's called).

@rubennj
Copy link
Author

rubennj commented Feb 19, 2015

I also thought that .interpolate() was doing that at first.
Interpolating with a new index looks fundamental when you want to compare datasets with different indexes but close related, e.g. several spectra or time-series measured wih different instruments.

I don't understand why overloading interpolate() is a problem, adding the new parameter at. It's not the most elegant solution, but the "two functions" solution doesn't look so ideally clear and it will be a pity to lose the interpolate() method.

To transfer the current function to fill_na() and to use interpolate() for interpolation with a new index is ideal and quite clear to understand in my opinion. However I don't know how bad the transition can be in this case.

@TomAugspurger TomAugspurger changed the title Improve of syntax to interpolate Improve of syntax to interpolate at new values Aug 14, 2015
@TomAugspurger TomAugspurger changed the title Improve of syntax to interpolate at new values API: Interpolate at new values Aug 14, 2015
@TomAugspurger TomAugspurger added this to the 0.17.0 milestone Aug 14, 2015
@TomAugspurger
Copy link
Contributor

I'm in favor of adding a new method, since the behavior is different enough from the current interpolate:

df.interpolate_at(new_values, method='linear')

@denfromufa are you interested in submitting a pull request for this? Otherwise I'll add it to my list.

@den-run-ai
Copy link

@TomAugspurger, sure I can submit, please confirm the logic:

def interpolate_at(df, new_idxs):
    return df.drop_duplicates().dropna(
    ).reindex(
        np.concatenate(
        (df.index, np.unique(new_idxs)))
        ).sort().interpolate().ix[new_idxs]

Also should axis be an input or transposing is easy enough?

@shoyer
Copy link
Member

shoyer commented Aug 15, 2015

My thoughts:

  1. The function signature should exactly match interpolate except for the values argument.
  2. The implementation should probably be at a lower level to ensure that it is makes a minimal number of copies.

On Fri, Aug 14, 2015 at 7:17 PM, denfromufa notifications@github.com
wrote:

@TomAugspurger, sure I can submit, please confirm the logic:
def interpeasy(df, new_idxs):
return df.drop_duplicates().dropna(
).reindex(
np.concatenate(
(df.index, np.unique(new_idxs)))
).sort().interpolate().ix[new_idxs]

Also should axis be an input or transposing is easy enough?

Reply to this email directly or view it on GitHub:
#9340 (comment)

@den-run-ai
Copy link

copying the signature from interpolate() looks feasible.

for lower level to avoid copies do you mean something like:

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

@shoyer
Copy link
Member

shoyer commented Aug 17, 2015

My thought was that it's probably worth looking at the implementation of df.interpolate (which calls scipy.interpolate) and using that logic at a lower level.

On Sun, Aug 16, 2015 at 10:40 PM, denfromufa notifications@github.com
wrote:

copying the signature from interpolate() looks feasible.
for lower level to avoid copies do you mean something like:

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

Reply to this email directly or view it on GitHub:
#9340 (comment)

@den-run-ai
Copy link

Keeping interpolate_at() at such low level (scipy) would force me to
duplicate all the other code related to fill_na and filtering the options
to preserve the signature of interpolate().

Please correct me if I do not understand something?

The only reason why I would go that low level if we want to implement
scattered grid interpolation on multiindex :)

On Mon, Aug 17, 2015, 12:51 AM Stephan Hoyer notifications@github.com
wrote:

My thought was that it's probably worth looking at the implementation of
df.interpolate (which calls scipy.interpolate) and using that logic at a
lower level.

On Sun, Aug 16, 2015 at 10:40 PM, denfromufa notifications@github.com
wrote:

copying the signature from interpolate() looks feasible.
for lower level to avoid copies do you mean something like:

def interpolate_at(df, new_idxs):
df=df.drop_duplicates()
df.dropna(inplace=True)
df.reindex(np.concatenate(
(df.index, np.unique(new_idxs))), inplace=True)
df.sort(inplace=True)
df.interpolate(inplace=True)
return df.ix[new_idxs]

Reply to this email directly or view it on GitHub:
#9340 (comment)


Reply to this email directly or view it on GitHub
#9340 (comment).

@shoyer
Copy link
Member

shoyer commented Aug 17, 2015

OK, fair enough -- just thought it would be worth taking a look. I agree that we don't want to duplicate that logic, but it may be possible to refactor it pretty straightforwardly to make it work for both cases.

@den-run-ai
Copy link

two incompatible options:

limit : int, default None.
Maximum number of consecutive NaNs to fill.
inplace : bool, default False
Update the NDFrame in place if possible.

@shoyer
Copy link
Member

shoyer commented Aug 18, 2015

@denfromufa agreed, you can skip those.

As far as overall API goes, I would still advocate for renaming the existing interpolate to interpolate_na, just to make it clear what these interpolation methods do and that they are on equal footing.

fillna serves a distinct purpose -- it does reindexing style assignment to the locations with NAs.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

for consistency let's call this interpolatena or interpna (the existing one).

@den-run-ai
Copy link

IMO, interpolate_na and interpolate_at are better choices, like suggested
originally, although not consistent with dropna & fillna.

On Tue, Aug 18, 2015, 4:38 PM Jeff Reback notifications@github.com wrote:

for consistency let's call this interpolatena or interpna (the existing
one).


Reply to this email directly or view it on GitHub
#9340 (comment).

@shoyer
Copy link
Member

shoyer commented Aug 19, 2015

I don't like interpolatena because the words blend together without a separating character (especially with a vowel followed by a consonant). IMO, it's clearer with the extra _ character (which also makes it PEP8 compliant).

interpna matches an R function of the same name (and function) so there is some precedent there. But it's not immediately obvious like spelling the word out fully. I recall discussing this sort of thing on my tolerance pull request.

@denfromufa one more thought on implementation. I'm pretty sure that interpolate_at is a closer fit to the signature of the scipy functions than interpolate_na. This suggests that it might actually be a better idea (more efficient) to refactor interpolate_na to call interpolate_at rather than the other way around, e.g.,

def interpolate_na(series, inplace=False):
    na_locs = series.isnull()
    target = series.index[na_locs].values
    new_values = series.values[~na_locs].interpolate_at(target)
    if not inplace:
        series = series.copy()
    series.iloc[na_locs] = new_values
    return series

@den-run-ai
Copy link

I'm looking at this code now to add interpolate_at() and have a hard time with this code:

https://github.com/pydata/pandas/blob/ba0704f336c733f89ac8fa23c8700bd22ae620d4/pandas/core/common.py#L1632

            firstIndex = valid.argmax()
            valid = valid[firstIndex:]
            invalid = invalid[firstIndex:]
            result = yvalues.copy()
            if valid.all():
                return yvalues

can anyone explain it?

@den-run-ai
Copy link

@shoyer @TomAugspurger ok, firstIndex probably means first index in Series before reaching non-null value. All values below this (assuming Series is sorted) cannot be interpolated. Then why similarly lastIndex is not defined?

@shoyer
Copy link
Member

shoyer commented Aug 30, 2015

@denfromufa It looks like the lack of this behavior at the end is a bug: #8000

@den-run-ai
Copy link

ok, I started making changes in https://github.com/denfromufa/pandas, I have some TODO items in the code

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 31, 2015
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 31, 2015
@den-run-ai
Copy link

Today I found one corner-case, hopefully this can be fixed at lower-level.

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

If after reindex() operation some indices are duplicates, then ix[new_idxs] generates some weird things with these duplicates [this part needs explanation]. Hence drop_duplicates() needs to be called after reindex().

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.drop_duplicates(inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

@den-run-ai
Copy link

It is even deeper :(

Duplicates need to be removed even before .reindex()! This is because the new_idxs and df.index may have some duplicate items.

Hopefully others do not step on the same rake while I'm finishing my interpolation pull request.

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(
        np.concatenate(
        np.unique(
        (df.index, np.unique(new_idxs)))), inplace=True)
    df.drop_duplicates(inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

@den-run-ai
Copy link

latest version, previous one had bugs and sort() is deprecated:

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df=df.reindex(
        np.unique(
        np.concatenate(
        (df.index, np.unique(new_idxs)))))
    df.drop_duplicates(inplace=True)
    df.sort_index(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

Note that this accepts both Series and DataFrames.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2016

generally don't use inplace (it doesn't offer any benefit and makes code much harder to read)

use Index operations rather than numpy functions
numpy ops don't generally handle the full dtype set very well

@den-run-ai
Copy link

Oh, I see! How is unique in pandas much faster than numpy?!

http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.Index.unique.html#pandas-index-unique

image

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

Pandas uses a hash table, whereas numpy just sorts.

On Mon, Nov 7, 2016 at 4:06 PM, denfromufa notifications@github.com wrote:

Oh, I see! How is unique in pandas much faster than numpy?!

http://pandas.pydata.org/pandas-docs/version/0.19.1/
generated/pandas.Index.unique.html#pandas-index-unique

[image: image]
https://cloud.githubusercontent.com/assets/7870949/20081379/d787d56a-a514-11e6-9556-48c9244cbd36.png


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKS1vI-hO2XYnD7KJ3MQjF6SIKSBBTXks5q77z_gaJpZM4DWDcQ
.

@den-run-ai
Copy link

den-run-ai commented Nov 8, 2016

Here is cleaned up version using only pandas machinery, also fixed one more bug:

def interpolate_at(df, new_idxs):
    new_idxs = pd.Index(new_idxs)
    df = df.drop_duplicates().dropna()
    df = df.reindex(df.index.append(new_idxs).unique())
    df = df.sort_index()
    df = df.interpolate()
    return df.ix[new_idxs]

@den-run-ai
Copy link

@Jostikas
Copy link

To be clear, is the cookbook solution by denfromufa the current "best_for_many_cases" way to do this?

@proinsias
Copy link
Contributor

@shoyer or @TomAugspurger - any more feedback for @denfromufa?

@auxym
Copy link

auxym commented Nov 30, 2021

For one thing, I believe ix is deprecated in favor of loc?

But I also would be happy to see this feature in pandas. I thought it would fit well as an additonal method in reindex, but interpolate_at works too.

@auxym
Copy link

auxym commented Nov 30, 2021

drop_duplicates() though seems to be causing problems in my usage. If anything we'd want to remove duplicate indices, but drop_duplicates removes rows with duplicate values.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

9 participants