Skip to content

Commit

Permalink
ENH/REF: Additional methods for interpolate
Browse files Browse the repository at this point in the history
ENH: the interpolate method argument can take more values
for various types of interpolation

REF: Moves Series.interpolate to core/generic. DataFrame gets
interpolate

CLN: clean up interpolate to use blocks

ENH: Add additonal 1-d scipy interpolaters.

DOC: examples for df interpolate and a plot

DOC: release notes

DOC: Scipy links and more expanation

API: Don't use fill_value

BUG: Raise on panels.

API: Raise on non monotonic indecies if it matters

BUG: Raise on only mixed types.

ENH/DOC: Add `spline` interpolation.

DOC: naming consistency
  • Loading branch information
TomAugspurger authored and jreback committed Oct 9, 2013
1 parent 27e4fb1 commit aff7346
Show file tree
Hide file tree
Showing 9 changed files with 863 additions and 240 deletions.
89 changes: 87 additions & 2 deletions doc/source/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -271,8 +271,13 @@ examined :ref:`in the API <api.dataframe.missing>`.
Interpolation
~~~~~~~~~~~~~

A linear **interpolate** method has been implemented on Series. The default
interpolation assumes equally spaced points.
.. versionadded:: 0.13.0

DataFrame now has the interpolation method.
:meth:`~pandas.Series.interpolate` also gained some additional methods.

Both Series and Dataframe objects have an ``interpolate`` method that, by default,
performs linear interpolation at missing datapoints.

.. ipython:: python
:suppress:
Expand Down Expand Up @@ -328,6 +333,86 @@ For a floating-point index, use ``method='values'``:
ser.interpolate(method='values')
You can also interpolate with a DataFrame:

.. ipython:: python
df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
df.interpolate()
The ``method`` argument gives access to fancier interpolation methods.
If you have scipy_ installed, you can set pass the name of a 1-d interpolation routine to ``method``.
You'll want to consult the full scipy interpolation documentation_ and reference guide_ for details.
The appropriate interpolation method will depend on the type of data you are working with.
For example, if you are dealing with a time series that is growing at an increasing rate,
``method='quadratic'`` may be appropriate. If you have values approximating a cumulative
distribution function, then ``method='pchip'`` should work well.

.. warning::

These methods require ``scipy``.

.. ipython:: python
df.interpolate(method='barycentric')
df.interpolate(method='pchip')
When interpolating via a polynomial or spline approximation, you must also specify
the degree or order of the approximation:

.. ipython:: python
df.interpolate(method='spline', order=2)
df.interpolate(method='polynomial', order=2)
Compare several methods:

.. ipython:: python
np.random.seed(2)
ser = Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37))
bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29, 34, 35, 36])
ser[bad] = np.nan
methods = ['linear', 'quadratic', 'cubic']
df = DataFrame({m: s.interpolate(method=m) for m in methods})
@savefig compare_interpolations.png
df.plot()
Another use case is interpolation at *new* values.
Suppose you have 100 observations from some distribution. And let's suppose
that you're particularly interested in what's happening around the middle.
You can mix pandas' ``reindex`` and ``interpolate`` methods to interpolate
at the new values.

.. ipython:: python
ser = Series(np.sort(np.random.uniform(size=100)))
# interpolate at new_index
new_index = ser.index + Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])
interp_s = ser.reindex(new_index).interpolate(method='pchip')
interp_s[49:51]
.. _scipy: http://www.scipy.org
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html


Like other pandas fill methods, ``interpolate`` accepts a ``limit`` keyword argument.
Use this to limit the number of consecutive interpolations, keeping ``NaN``s for interpolations that are too far from the last valid observation:
.. ipython:: python
ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
ser.interpolate(limit=2)
.. _missing_data.replace:
Replacing Generic Values
Expand Down
4 changes: 3 additions & 1 deletion doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Experimental Features
- Add msgpack support via ``pd.read_msgpack()`` and ``pd.to_msgpack()`` / ``df.to_msgpack()`` for serialization
of arbitrary pandas (and python objects) in a lightweight portable binary format (:issue:`686`)
- Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
- Added :mod:`pandas.io.gbq` for reading from (and writing to) Google BigQuery into a DataFrame. (:issue:`4140`)
- Added :mod:`pandas.io.gbq` for reading from (and writing to) Google BigQuery into a DataFrame. (:issue:`4140`)

Improvements to existing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -174,6 +174,8 @@ Improvements to existing features
- :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
- ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)
- ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include
interpolation methods from scipy. (:issue:`4434`, :issue:`1892`)

API Changes
~~~~~~~~~~~
Expand Down
28 changes: 28 additions & 0 deletions doc/source/v0.13.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,34 @@ Experimental

- Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.

- DataFrame has a new ``interpolate`` method, similar to Series (:issue:`4434`, :issue:`1892`)

.. ipython:: python

df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
df.interpolate()

Additionally, the ``method`` argument to ``interpolate`` has been expanded
to include 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'barycentric', 'krogh', 'piecewise_polynomial', 'pchip' or "polynomial" or 'spline'
and an integer representing the degree or order of the approximation. The new methods
require scipy_. Consult the Scipy reference guide_ and documentation_ for more information
about when the various methods are appropriate. See also the :ref:`pandas interpolation docs<missing_data.interpolate:>`.

Interpolate now also accepts a ``limit`` keyword argument.
This works similar to ``fillna``'s limit:

.. ipython:: python

ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
ser.interpolate(limit=2)

.. _scipy: http://www.scipy.org
.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html


.. _whatsnew_0130.refactoring:

Internal Refactoring
Expand Down
147 changes: 147 additions & 0 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1244,6 +1244,153 @@ def backfill_2d(values, limit=None, mask=None):
return values


def _clean_interp_method(method, order=None, **kwargs):
valid = ['linear', 'time', 'values', 'nearest', 'zero', 'slinear',
'quadratic', 'cubic', 'barycentric', 'polynomial',
'krogh', 'piecewise_polynomial',
'pchip', 'spline']
if method in ('spline', 'polynomial') and order is None:
raise ValueError("You must specify the order of the spline or "
"polynomial.")
if method not in valid:
raise ValueError("method must be one of {0}."
"Got '{1}' instead.".format(valid, method))
return method


def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
fill_value=None, bounds_error=False, **kwargs):
"""
Logic for the 1-d interpolation. The result should be 1-d, inputs
xvalues and yvalues will each be 1-d arrays of the same length.
Bounds_error is currently hardcoded to False since non-scipy ones don't
take it as an argumnet.
"""
# Treat the original, non-scipy methods first.

invalid = isnull(yvalues)
valid = ~invalid

valid_y = yvalues[valid]
valid_x = xvalues[valid]
new_x = xvalues[invalid]

if method == 'time':
if not getattr(xvalues, 'is_all_dates', None):
# if not issubclass(xvalues.dtype.type, np.datetime64):
raise ValueError('time-weighted interpolation only works '
'on Series or DataFrames with a '
'DatetimeIndex')
method = 'values'

def _interp_limit(invalid, limit):
"""mask off values that won't be filled since they exceed the limit"""
all_nans = np.where(invalid)[0]
violate = [invalid[x:x + limit + 1] for x in all_nans]
violate = np.array([x.all() & (x.size > limit) for x in violate])
return all_nans[violate] + limit

xvalues = getattr(xvalues, 'values', xvalues)
yvalues = getattr(yvalues, 'values', yvalues)

if limit:
violate_limit = _interp_limit(invalid, limit)
if valid.any():
firstIndex = valid.argmax()
valid = valid[firstIndex:]
invalid = invalid[firstIndex:]
result = yvalues.copy()
if valid.all():
return yvalues
else:
# have to call np.array(xvalues) since xvalues could be an Index
# which cant be mutated
result = np.empty_like(np.array(xvalues), dtype=np.float64)
result.fill(np.nan)
return result

if method in ['linear', 'time', 'values']:
if method in ('values', 'index'):
inds = np.asarray(xvalues)
# hack for DatetimeIndex, #1646
if issubclass(inds.dtype.type, np.datetime64):
inds = inds.view(pa.int64)

if inds.dtype == np.object_:
inds = lib.maybe_convert_objects(inds)
else:
inds = xvalues

inds = inds[firstIndex:]

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid],
yvalues[firstIndex:][valid])

if limit:
result[violate_limit] = np.nan
return result

sp_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'barycentric', 'krogh', 'spline', 'polynomial',
'piecewise_polynomial', 'pchip']
if method in sp_methods:
new_x = new_x[firstIndex:]
xvalues = xvalues[firstIndex:]

result[firstIndex:][invalid] = _interpolate_scipy_wrapper(valid_x,
valid_y, new_x, method=method, fill_value=fill_value,
bounds_error=bounds_error, **kwargs)
if limit:
result[violate_limit] = np.nan
return result


def _interpolate_scipy_wrapper(x, y, new_x, method, fill_value=None,
bounds_error=False, order=None, **kwargs):
"""
passed off to scipy.interpolate.interp1d. method is scipy's kind.
Returns an array interpolated at new_x. Add any new methods to
the list in _clean_interp_method
"""
try:
from scipy import interpolate
except ImportError:
raise ImportError('{0} interpolation requires Scipy'.format(method))

new_x = np.asarray(new_x)

# ignores some kwargs that could be passed along.
alt_methods = {
'barycentric': interpolate.barycentric_interpolate,
'krogh': interpolate.krogh_interpolate,
'piecewise_polynomial': interpolate.piecewise_polynomial_interpolate,
}

try:
alt_methods['pchip'] = interpolate.pchip_interpolate
except AttributeError:
if method == 'pchip':
raise ImportError("Your version of scipy does not support "
"PCHIP interpolation.")

interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
'polynomial']
if method in interp1d_methods:
if method == 'polynomial':
method = order
terp = interpolate.interp1d(x, y, kind=method, fill_value=fill_value,
bounds_error=bounds_error)
new_y = terp(new_x)
elif method == 'spline':
terp = interpolate.UnivariateSpline(x, y, k=order)
new_y = terp(new_x)
else:
method = alt_methods[method]
new_y = method(x, y, new_x)
return new_y


def interpolate_2d(values, method='pad', axis=0, limit=None, fill_value=None):
""" perform an actual interpolation of values, values will be make 2-d if needed
fills inplace, returns the result """
Expand Down
Loading

0 comments on commit aff7346

Please sign in to comment.