Skip to content

Conversation

cchwala
Copy link
Contributor

@cchwala cchwala commented Feb 4, 2019

This PR introduces the new keyword max_gap for interpolate. For all NaN-gaps which are wider than max_gap no interpolation is carried out and all NaNs of the gap are preserved. This is in contrast to using the limit kwarg which does not prevent interpolating values in a longer NaN gap.

I added numpy-based implementation that searches for NaN-gaps wider than max_gap. In line with the current implementations for NaN handling in series.interpolate, a set of NaN-indices that has
to be preserved is generated. This is used in the end, after a full interpolation of all NaN is done, to restore the NaNs gaps that shall not be interpolated.

Test and documentation were also added.

It will need some small PEP8-cleanup and maybe tests using other interpolation methods then linear (edit: Done). But before I continue, I would like to get feedback if my approach is in general okay.

This PR might also be extended to close #16457 which is on interpolation directly after resampling.

Example usage:

In [1]: import numpy as np                                                                                                                                                                                                         

In [2]: import pandas as pd                                                                                                                                                                                                        

In [3]: s = pd.Series([np.nan, 1., np.nan, 2., np.nan, np.nan,  
   ...:                5., np.nan, np.nan, np.nan, -1., np.nan, np.nan])                                                                                                                                                           

# Using the new `max_gap` kwarg
In [4]: s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                               
Out[4]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     NaN
8     NaN
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

# Compare to the result when using the existing `limit` kwarg
In [5]: s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                                
Out[5]: 
0     NaN
1     1.0
2     1.5
3     2.0
4     3.0
5     4.0
6     5.0
7     3.5
8     2.0
9     NaN
10   -1.0
11    NaN
12    NaN
dtype: float64

Timing:

In [6]: %timeit s.interpolate(max_gap=2, limit_area='inside')                                                                                                                                                                       
708 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit s.interpolate(limit=2, limit_area='inside')                                                                                                                                                                        
631 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The relative speed difference is similar for larger Series.

Added numpy-based implementation that searchs for NaN-gaps wider
than `maxgap`. In line with the current implementations for
NaN handling in `series.interpolate`, a set of NaN-indices that has
to be preserved is generated.

Test and documentation were also added.
@pep8speaks
Copy link

pep8speaks commented Feb 4, 2019

Hello @cchwala! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-20 14:00:08 UTC

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

Merging #25141 into master will decrease coverage by 49.52%.
The diff coverage is 3.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25141       +/-   ##
===========================================
- Coverage   92.37%   42.84%   -49.53%     
===========================================
  Files         166      166               
  Lines       52408    52430       +22     
===========================================
- Hits        48412    22464    -25948     
- Misses       3996    29966    +25970
Flag Coverage Δ
#multiple ?
#single 42.84% <3.57%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 39.89% <ø> (-56.74%) ⬇️
pandas/core/missing.py 15.65% <3.57%> (-76.92%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e38d55...b752602. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@3c55e1e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #25141   +/-   ##
=========================================
  Coverage          ?   41.19%           
=========================================
  Files             ?      178           
  Lines             ?    50799           
  Branches          ?        0           
=========================================
  Hits              ?    20928           
  Misses            ?    29871           
  Partials          ?        0
Flag Coverage Δ
#single 41.19% <0%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c55e1e...28b442c. Read the comment docs.

@cchwala cchwala changed the title Added maxgap keyword for series.interpolate ENH: Added maxgap keyword for series.interpolate Feb 4, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a couple of examples to the doc-string so its easy to see what this is doing, include using limit and not.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Feb 6, 2019
@cchwala
Copy link
Contributor Author

cchwala commented Feb 20, 2019

@jreback Just a quick comment to let you know that updating this PR is still on my TODO list, but I have not had time to work on it for the last two weeks. Next week looks better, at least judging from now...

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

@cchwala
Copy link
Contributor Author

cchwala commented Mar 26, 2019

Resolved the tiny merge conflict. Still no other progress with the PR. Sorry. I am too overloaded till early April.

@jreback
Copy link
Contributor

jreback commented May 7, 2019

@cchwala if you have time; this is a nice patch. pls update to comments.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@pandas-dev/pandas-core if someone wants to take this over the line, pls merge master and update to comments.

@cchwala
Copy link
Contributor Author

cchwala commented Jun 11, 2019

@jreback Sorry for the long silence. I will do it today.

@cchwala
Copy link
Contributor Author

cchwala commented Jun 11, 2019

There is a problem. For s.interpolate(method='pad', max_gap=2) the max_gap keyword does not seem to have an effect. I added a test which fails in this case.

From a first quick search the cause might lie here:

if m is not None:
r = check_int_bool(self, inplace)
if r is not None:
return r
return self._interpolate_with_fill(method=m, axis=axis,
inplace=inplace, limit=limit,
fill_value=fill_value,
coerce=coerce,
downcast=downcast)
# validate the interp method
m = missing.clean_interp_method(method, **kwargs)
r = check_int_bool(self, inplace)
if r is not None:
return r
return self._interpolate(method=m, index=index, values=values,
axis=axis, limit=limit,
limit_direction=limit_direction,
limit_area=limit_area,
fill_value=fill_value, inplace=inplace,
downcast=downcast, **kwargs)

There are two different pathways for interpolate depending on the selected method. For pad, ffill and bfill missing.interpolate_2d is used which does not yet support the max_gap option. It also does not seem to recognize the keywords limit_direction and limit_area, they are silently ignored.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

@cchwala is this still active? Can you merge master

@cchwala
Copy link
Contributor Author

cchwala commented Aug 31, 2019

I resolved the merge conflict. However, I still have not found the time to work further on this PR, in particular because it somehow is blocked by #26796. It remains somewhere in the middle of my TODO list...

@cchwala
Copy link
Contributor Author

cchwala commented Sep 18, 2019

FYI, I am offline till 7th of October but plan to continue with this PR afterwards

@shoyer shoyer changed the title ENH: Added maxgap keyword for series.interpolate ENH: Added max_gap keyword for series.interpolate Sep 19, 2019
@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@cchwala is this still active?

# Conflicts:
#	pandas/core/internals/blocks.py
#	pandas/core/missing.py
@cchwala
Copy link
Contributor Author

cchwala commented Nov 13, 2019

@WillAyd Not really active... But I resolved the merge conflicts on my machine.

Unfortunately I am struggling with #29330 a conda problem on my machine, even in a fresh miniconda3 install, so I cannot run the tests. I will push, when I have verified that my merge makes sense.

@cchwala
Copy link
Contributor Author

cchwala commented Nov 20, 2019

@TomAugspurger @WillAyd This is now finally ready for another review.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot going on here, sorry only got through part of it at a glance.

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

if (method == "pad") or (method == "ffill"):
if (limit_direction == "backward") or (limit_direction == "both"):
raise ValueError(
"`limit_direction` must not be `%s` for method `%s`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use f-strings for this and the one on L 7140

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


if max_gap is not None:

def bfill_nan(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to making this a separate closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real reason. Maybe at some point I thought the function definition would make things clearer.

Should I just put the content of the function in-line starting at L350?

# convert float back to datetime64
values = values.astype(orig_values.dtype)

# if np.issubdtype(values.dtype, np.datetime64):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just forgot to remove it. This stems from when I tried to manually chose the correct fill_value. But getting the correct fill_value is handled here

@cchwala
Copy link
Contributor Author

cchwala commented Nov 20, 2019

Is there any chance that the changes to allow limit_area and limit_direction in Series.interpolate with pad can be split into its own PR? Would that make this one much smaller?

For making max_gap work with pad I had to introduce missing.interpolate_1d_fill because the existingmissing.interpolate_2d, which pad did always use, is not (easily) extendable. The cause for the wrong behavior of limit_area when using pad was the same, so this was solved somehow as a byproduct.

If we would want to split this PR up, probably most changes of this PR would move to the one to solve the issue with pad and limit_area. So yes, this one would be much smaller, but the other one would be equally large.

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2020

@cchwala can you merge master and fix up the CI error for code checks?

@cchwala
Copy link
Contributor Author

cchwala commented Jan 15, 2020

@WillAyd I separated large parts of the changes into #31048 and will continue here afterwards.

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Closing to clean queue - I think can revisit after #31048

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: new interpolate kwarg to skip gaps Pandas interpolation enhancement request : specifying the maximum gap to interpolate.
6 participants