ENH/REF: Additional methods for interpolate

ENH: the interpolate method argument can take more values for various types of interpolation REF: Moves Series.interpolate to core/generic. DataFrame gets interpolate CLN: clean up interpolate to use blocks ENH: Add additonal 1-d scipy interpolaters. DOC: examples for df interpolate and a plot DOC: release notes DOC: Scipy links and more expanation API: Don't use fill_value BUG: Raise on panels. API: Raise on non monotonic indecies if it matters BUG: Raise on only mixed types. ENH/DOC: Add `spline` interpolation. DOC: naming consistency
pandas-dev · Oct 9, 2013 · aff7346 · aff7346
1 parent 27e4fb1
commit aff7346
Show file tree

Hide file tree

Showing 9 changed files with 863 additions and 240 deletions.
diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst
@@ -271,8 +271,13 @@ examined :ref:`in the API <api.dataframe.missing>`.
 Interpolation
 ~~~~~~~~~~~~~
 
-A linear **interpolate** method has been implemented on Series. The default
-interpolation assumes equally spaced points.
+.. versionadded:: 0.13.0
+
+  DataFrame now has the interpolation method.
+  :meth:`~pandas.Series.interpolate` also gained some additional methods.
+
+Both Series and Dataframe objects have an ``interpolate`` method that, by default,
+performs linear interpolation at missing datapoints.
 
 .. ipython:: python
    :suppress:
@@ -328,6 +333,86 @@ For a floating-point index, use ``method='values'``:
 
    ser.interpolate(method='values')
 
+You can also interpolate with a DataFrame:
+
+.. ipython:: python
+
+   df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
+                   'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
+   df.interpolate()
+
+The ``method`` argument gives access to fancier interpolation methods.
+If you have scipy_ installed, you can set pass the name of a 1-d interpolation routine to ``method``.
+You'll want to consult the full scipy interpolation documentation_ and reference guide_ for details.
+The appropriate interpolation method will depend on the type of data you are working with.
+For example, if you are dealing with a time series that is growing at an increasing rate,
+``method='quadratic'`` may be appropriate.  If you have values approximating a cumulative
+distribution function, then ``method='pchip'`` should work well.
+
+.. warning::
+
+    These methods require ``scipy``.
+
+.. ipython:: python
+
+  df.interpolate(method='barycentric')
+
+  df.interpolate(method='pchip')
+
+When interpolating via a polynomial or spline approximation, you must also specify
+the degree or order of the approximation:
+
+.. ipython:: python
+
+  df.interpolate(method='spline', order=2)
+
+  df.interpolate(method='polynomial', order=2)
+
+Compare several methods:
+
+.. ipython:: python
+
+  np.random.seed(2)
+
+  ser = Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37))
+  bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29, 34, 35, 36])
+  ser[bad] = np.nan
+  methods = ['linear', 'quadratic', 'cubic']
+
+  df = DataFrame({m: s.interpolate(method=m) for m in methods})
+  @savefig compare_interpolations.png
+  df.plot()
+
+Another use case is interpolation at *new* values.
+Suppose you have 100 observations from some distribution. And let's suppose
+that you're particularly interested in what's happening around the middle.
+You can mix pandas' ``reindex`` and ``interpolate`` methods to interpolate
+at the new values.
+
+.. ipython:: python
+
+  ser = Series(np.sort(np.random.uniform(size=100)))
+
+  # interpolate at new_index
+  new_index = ser.index + Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])
+
+  interp_s = ser.reindex(new_index).interpolate(method='pchip')
+
+  interp_s[49:51]
+
+.. _scipy: http://www.scipy.org
+.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
+.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
+
+
+Like other pandas fill methods, ``interpolate`` accepts a ``limit`` keyword argument.
+Use this to limit the number of consecutive interpolations, keeping ``NaN``s for interpolations that are too far from the last valid observation:
+
+.. ipython:: python
+
+  ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
+  ser.interpolate(limit=2)
+
 .. _missing_data.replace:
 
 Replacing Generic Values

diff --git a/doc/source/release.rst b/doc/source/release.rst
@@ -78,7 +78,7 @@ Experimental Features
   - Add msgpack support via ``pd.read_msgpack()`` and ``pd.to_msgpack()`` / ``df.to_msgpack()`` for serialization
     of arbitrary pandas (and python objects) in a lightweight portable binary format (:issue:`686`)
   - Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
-  - Added :mod:`pandas.io.gbq` for reading from (and writing to) Google BigQuery into a DataFrame. (:issue:`4140`) 
+  - Added :mod:`pandas.io.gbq` for reading from (and writing to) Google BigQuery into a DataFrame. (:issue:`4140`)
 
 Improvements to existing features
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -174,6 +174,8 @@ Improvements to existing features
   - :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
     from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
   - ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)
+  - ``DataFrame.interpolate()`` and ``Series.interpolate()`` have been expanded to include
+    interpolation methods from scipy. (:issue:`4434`, :issue:`1892`)
 
 API Changes
 ~~~~~~~~~~~

diff --git a/doc/source/v0.13.0.txt b/doc/source/v0.13.0.txt
@@ -614,6 +614,34 @@ Experimental
 
 - Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
 
+- DataFrame has a new ``interpolate`` method, similar to Series (:issue:`4434`, :issue:`1892`)
+
+  .. ipython:: python
+
+      df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
+                      'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
+      df.interpolate()
+
+  Additionally, the ``method`` argument to ``interpolate`` has been expanded
+  to include 'nearest', 'zero', 'slinear', 'quadratic', 'cubic',
+  'barycentric', 'krogh', 'piecewise_polynomial', 'pchip' or "polynomial" or 'spline'
+  and an integer representing the degree or order of the approximation.  The new methods
+  require scipy_. Consult the Scipy reference guide_ and documentation_ for more information
+  about when the various methods are appropriate.  See also the :ref:`pandas interpolation docs<missing_data.interpolate:>`.
+
+  Interpolate now also accepts a ``limit`` keyword argument.
+  This works similar to ``fillna``'s limit:
+
+  .. ipython:: python
+
+    ser = Series([1, 3, np.nan, np.nan, np.nan, 11])
+    ser.interpolate(limit=2)
+
+.. _scipy: http://www.scipy.org
+.. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
+.. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
+
+
 .. _whatsnew_0130.refactoring:
 
 Internal Refactoring

diff --git a/pandas/core/common.py b/pandas/core/common.py
@@ -1244,6 +1244,153 @@ def backfill_2d(values, limit=None, mask=None):
     return values
 
 
+def _clean_interp_method(method, order=None, **kwargs):
+    valid = ['linear', 'time', 'values', 'nearest', 'zero', 'slinear',
+             'quadratic', 'cubic', 'barycentric', 'polynomial',
+             'krogh', 'piecewise_polynomial',
+             'pchip', 'spline']
+    if method in ('spline', 'polynomial') and order is None:
+        raise ValueError("You must specify the order of the spline or "
+                         "polynomial.")
+    if method not in valid:
+        raise ValueError("method must be one of {0}."
+                         "Got '{1}' instead.".format(valid, method))
+    return method
+
+
+def interpolate_1d(xvalues, yvalues, method='linear', limit=None,
+                   fill_value=None, bounds_error=False, **kwargs):
+    """
+    Logic for the 1-d interpolation.  The result should be 1-d, inputs
+    xvalues and yvalues will each be 1-d arrays of the same length.
+
+    Bounds_error is currently hardcoded to False since non-scipy ones don't
+    take it as an argumnet.
+    """
+    # Treat the original, non-scipy methods first.
+
+    invalid = isnull(yvalues)
+    valid = ~invalid
+
+    valid_y = yvalues[valid]
+    valid_x = xvalues[valid]
+    new_x = xvalues[invalid]
+
+    if method == 'time':
+        if not getattr(xvalues, 'is_all_dates', None):
+        # if not issubclass(xvalues.dtype.type, np.datetime64):
+            raise ValueError('time-weighted interpolation only works '
+                             'on Series or DataFrames with a '
+                             'DatetimeIndex')
+        method = 'values'
+
+    def _interp_limit(invalid, limit):
+        """mask off values that won't be filled since they exceed the limit"""
+        all_nans = np.where(invalid)[0]
+        violate = [invalid[x:x + limit + 1] for x in all_nans]
+        violate = np.array([x.all() & (x.size > limit) for x in violate])
+        return all_nans[violate] + limit
+
+    xvalues = getattr(xvalues, 'values', xvalues)
+    yvalues = getattr(yvalues, 'values', yvalues)
+
+    if limit:
+        violate_limit = _interp_limit(invalid, limit)
+    if valid.any():
+        firstIndex = valid.argmax()
+        valid = valid[firstIndex:]
+        invalid = invalid[firstIndex:]
+        result = yvalues.copy()
+        if valid.all():
+            return yvalues
+    else:
+        # have to call np.array(xvalues) since xvalues could be an Index
+        # which cant be mutated
+        result = np.empty_like(np.array(xvalues), dtype=np.float64)
+        result.fill(np.nan)
+        return result
+
+    if method in ['linear', 'time', 'values']:
+        if method in ('values', 'index'):
+            inds = np.asarray(xvalues)
+            # hack for DatetimeIndex, #1646
+            if issubclass(inds.dtype.type, np.datetime64):
+                inds = inds.view(pa.int64)
+
+            if inds.dtype == np.object_:
+                inds = lib.maybe_convert_objects(inds)
+        else:
+            inds = xvalues
+
+        inds = inds[firstIndex:]
+
+        result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid],
+                                                 yvalues[firstIndex:][valid])
+
+        if limit:
+            result[violate_limit] = np.nan
+        return result
+
+    sp_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
+                  'barycentric', 'krogh', 'spline', 'polynomial',
+                  'piecewise_polynomial', 'pchip']
+    if method in sp_methods:
+        new_x = new_x[firstIndex:]
+        xvalues = xvalues[firstIndex:]
+
+        result[firstIndex:][invalid] = _interpolate_scipy_wrapper(valid_x,
+            valid_y, new_x, method=method, fill_value=fill_value,
+            bounds_error=bounds_error, **kwargs)
+        if limit:
+            result[violate_limit] = np.nan
+        return result
+
+
+def _interpolate_scipy_wrapper(x, y, new_x, method, fill_value=None,
+                               bounds_error=False, order=None, **kwargs):
+    """
+    passed off to scipy.interpolate.interp1d. method is scipy's kind.
+    Returns an array interpolated at new_x.  Add any new methods to
+    the list in _clean_interp_method
+    """
+    try:
+        from scipy import interpolate
+    except ImportError:
+        raise ImportError('{0} interpolation requires Scipy'.format(method))
+
+    new_x = np.asarray(new_x)
+
+    # ignores some kwargs that could be passed along.
+    alt_methods = {
+        'barycentric': interpolate.barycentric_interpolate,
+        'krogh': interpolate.krogh_interpolate,
+        'piecewise_polynomial': interpolate.piecewise_polynomial_interpolate,
+        }
+
+    try:
+        alt_methods['pchip'] = interpolate.pchip_interpolate
+    except AttributeError:
+        if method == 'pchip':
+            raise ImportError("Your version of scipy does not support "
+                              "PCHIP interpolation.")
+
+    interp1d_methods = ['nearest', 'zero', 'slinear', 'quadratic', 'cubic',
+                        'polynomial']
+    if method in interp1d_methods:
+        if method == 'polynomial':
+            method = order
+        terp = interpolate.interp1d(x, y, kind=method, fill_value=fill_value,
+                                    bounds_error=bounds_error)
+        new_y = terp(new_x)
+    elif method == 'spline':
+        terp = interpolate.UnivariateSpline(x, y, k=order)
+        new_y = terp(new_x)
+    else:
+        method = alt_methods[method]
+        new_y = method(x, y, new_x)
+    return new_y
+
+
 def interpolate_2d(values, method='pad', axis=0, limit=None, fill_value=None):
     """ perform an actual interpolation of values, values will be make 2-d if needed
         fills inplace, returns the result """