ENH: add pd.asof_merge

closes #1870 xref #2941 http://nbviewer.jupyter.org/gist/jreback/5f089d308750c89b2a7d7446b790c056 is a notebook of example usage and timings Author: Jeff Reback <jeff@reback.net> Closes #13358 from jreback/asof and squashes the following commits: 4592fa2 [Jeff Reback] TST: reorg tests/series/test_timeseries -> test_asof
pandas-dev · Jun 17, 2016 · 6d8c04c · 6d8c04c
1 parent fca35fb
commit 6d8c04c
Show file tree

Hide file tree

Showing 30 changed files with 1,975 additions and 278 deletions.
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -151,6 +151,8 @@ Data manipulations
    cut
    qcut
    merge
+   merge_ordered
+   merge_asof
    concat
    get_dummies
    factorize
@@ -943,6 +945,7 @@ Time series-related
    :toctree: generated/
 
    DataFrame.asfreq
+   DataFrame.asof
    DataFrame.shift
    DataFrame.first_valid_index
    DataFrame.last_valid_index

diff --git a/doc/source/merging.rst b/doc/source/merging.rst
@@ -104,7 +104,7 @@ some configurable handling of "what to do with the other axes":
 - ``ignore_index`` : boolean, default False. If True, do not use the index
   values on the concatenation axis. The resulting axis will be labeled 0, ...,
   n - 1. This is useful if you are concatenating objects where the
-  concatenation axis does not have meaningful indexing information. Note 
+  concatenation axis does not have meaningful indexing information. Note
   the index values on the other axes are still respected in the join.
 - ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
 
@@ -544,12 +544,12 @@ Here's a description of what each argument is for:
     can be avoided are somewhat pathological but this option is provided
     nonetheless.
   - ``indicator``: Add a column to the output DataFrame called ``_merge``
-    with information on the source of each row. ``_merge`` is Categorical-type 
-    and takes on a value of ``left_only`` for observations whose merge key 
-    only appears in ``'left'`` DataFrame, ``right_only`` for observations whose 
-    merge key only appears in ``'right'`` DataFrame, and ``both`` if the 
-    observation's merge key is found in both. 
-    
+    with information on the source of each row. ``_merge`` is Categorical-type
+    and takes on a value of ``left_only`` for observations whose merge key
+    only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
+    merge key only appears in ``'right'`` DataFrame, and ``both`` if the
+    observation's merge key is found in both.
+
     .. versionadded:: 0.17.0
 
 
@@ -718,7 +718,7 @@ The merge indicator
    df2 = DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
    merge(df1, df2, on='col1', how='outer', indicator=True)
 
-The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column. 
+The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
 
 .. ipython:: python
 
@@ -1055,34 +1055,6 @@ them together on their indexes. The same is true for ``Panel.join``.
           labels=['left', 'right', 'right2'], vertical=False);
    plt.close('all');
 
-.. _merging.ordered_merge:
-
-Merging Ordered Data
-~~~~~~~~~~~~~~~~~~~~
-
-New in v0.8.0 is the ordered_merge function for combining time series and other
-ordered data. In particular it has an optional ``fill_method`` keyword to
-fill/interpolate missing data:
-
-.. ipython:: python
-
-   left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
-                     'lv': [1, 2, 3, 4],
-                     's': ['a', 'b', 'c', 'd']})
-
-   right = DataFrame({'k': ['K1', 'K2', 'K4'],
-                      'rv': [1, 2, 3]})
-
-   result = ordered_merge(left, right, fill_method='ffill', left_by='s')
-
-.. ipython:: python
-   :suppress:
-
-   @savefig merging_ordered_merge.png
-   p.plot([left, right], result,
-          labels=['left', 'right'], vertical=True);
-   plt.close('all');
-
 .. _merging.combine_first.update:
 
 Merging together values within Series or DataFrame columns
@@ -1132,4 +1104,124 @@ values inplace:
    @savefig merging_update.png
    p.plot([df1_copy, df2], df1,
           labels=['df1', 'df2'], vertical=False);
-   plt.close('all');
+   plt.close('all');
+
+.. _merging.time_series:
+
+Timeseries friendly merging
+---------------------------
+
+.. _merging.merge_ordered:
+
+Merging Ordered Data
+~~~~~~~~~~~~~~~~~~~~
+
+The ``pd.merge_ordered()`` function allows combining time series and other
+ordered data. In particular it has an optional ``fill_method`` keyword to
+fill/interpolate missing data:
+
+.. ipython:: python
+
+   left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
+                     'lv': [1, 2, 3, 4],
+                     's': ['a', 'b', 'c', 'd']})
+
+   right = DataFrame({'k': ['K1', 'K2', 'K4'],
+                      'rv': [1, 2, 3]})
+
+   result = pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
+
+.. ipython:: python
+   :suppress:
+
+   @savefig merging_ordered_merge.png
+   p.plot([left, right], result,
+          labels=['left', 'right'], vertical=True);
+   plt.close('all');
+
+.. _merging.merge_asof:
+
+Merging AsOf
+~~~~~~~~~~~~
+
+.. versionadded:: 0.18.2
+
+An ``pd.merge_asof()`` this is similar to an ordered left-join except that we
+match on nearest key rather than equal keys.
+
+For each row in the ``left`` DataFrame, we select the last row in the ``right``
+DataFrame whose ``on`` key is less than the left's key. Both DataFrames must
+be sorted by the key.
+
+Optionally an asof merge can perform a group-wise merge. This matches the ``by`` key equally,
+in addition to the nearest match on the ``on`` key.
+
+For example; we might have ``trades`` and ``quotes`` and we want to ``asof`` merge them.
+
+.. ipython:: python
+
+   trades = pd.DataFrame({
+       'time': pd.to_datetime(['20160525 13:30:00.023',
+                               '20160525 13:30:00.038',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.048']),
+       'ticker': ['MSFT', 'MSFT',
+                  'GOOG', 'GOOG', 'AAPL'],
+       'price': [51.95, 51.95,
+                 720.77, 720.92, 98.00],
+       'quantity': [75, 155,
+                    100, 100, 100]},
+       columns=['time', 'ticker', 'price', 'quantity'])
+
+   quotes = pd.DataFrame({
+       'time': pd.to_datetime(['20160525 13:30:00.023',
+                               '20160525 13:30:00.023',
+                               '20160525 13:30:00.030',
+                               '20160525 13:30:00.041',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.049',
+                               '20160525 13:30:00.072',
+                               '20160525 13:30:00.075']),
+       'ticker': ['GOOG', 'MSFT', 'MSFT',
+                  'MSFT', 'GOOG', 'AAPL', 'GOOG',
+                  'MSFT'],
+       'bid': [720.50, 51.95, 51.97, 51.99,
+               720.50, 97.99, 720.50, 52.01],
+       'ask': [720.93, 51.96, 51.98, 52.00,
+               720.93, 98.01, 720.88, 52.03]},
+       columns=['time', 'ticker', 'bid', 'ask'])
+
+.. ipython:: python
+
+   trades
+   quotes
+
+By default we are taking the asof of the quotes.
+
+.. ipython:: python
+
+   pd.merge_asof(trades, quotes,
+                 on='time',
+                 by='ticker')
+
+We only asof within ``2ms`` betwen the quote time and the trade time.
+
+.. ipython:: python
+
+   pd.merge_asof(trades, quotes,
+                 on='time',
+                 by='ticker',
+                 tolerance=pd.Timedelta('2ms'))
+
+We only asof within ``10ms`` betwen the quote time and the trade time and we exclude exact matches on time.
+Note that though we exclude the exact matches (of the quotes), prior quotes DO propogate to that point
+in time.
+
+.. ipython:: python
+
+   pd.merge_asof(trades, quotes,
+                 on='time',
+                 by='ticker',
+                 tolerance=pd.Timedelta('10ms'),
+                 allow_exact_matches=False)
diff --git a/doc/source/whatsnew/v0.18.2.txt b/doc/source/whatsnew/v0.18.2.txt
@@ -19,6 +19,97 @@ Highlights include:
 New features
 ~~~~~~~~~~~~
 
+.. _whatsnew_0182.enhancements.asof_merge:
+
+``pd.merge_asof()`` for asof-style time-series joining
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A long-time requested feature has been added through the :func:`merge_asof` function, to
+support asof style joining of time-series. (:issue:`1870`). Full documentation is
+:ref:`here <merging.merge_asof>`
+
+The :func:`merge_asof`` performs an asof merge, which is similar to a left-join
+except that we match on nearest key rather than equal keys.
+
+.. ipython:: python
+
+   left = pd.DataFrame({'a': [1, 5, 10],
+                        'left_val': ['a', 'b', 'c']})
+   right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
+                        'right_val': [1, 2, 3, 6, 7]})
+
+   left
+   right
+
+We typically want to match exactly when possible, and use the most
+recent value otherwise.
+
+.. ipython:: python
+
+   pd.merge_asof(left, right, on='a')
+
+We can also match rows ONLY with prior data, and not an exact match.
+
+.. ipython:: python
+
+   pd.merge_asof(left, right, on='a', allow_exact_matches=False)
+
+
+In a typical time-series example, we have ``trades`` and ``quotes`` and we want to ``asof-join`` them.
+This also illustrates using the ``by`` parameter to group data before merging.
+
+.. ipython:: python
+
+   trades = pd.DataFrame({
+       'time': pd.to_datetime(['20160525 13:30:00.023',
+                               '20160525 13:30:00.038',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.048']),
+       'ticker': ['MSFT', 'MSFT',
+                  'GOOG', 'GOOG', 'AAPL'],
+       'price': [51.95, 51.95,
+                 720.77, 720.92, 98.00],
+       'quantity': [75, 155,
+                    100, 100, 100]},
+       columns=['time', 'ticker', 'price', 'quantity'])
+
+   quotes = pd.DataFrame({
+       'time': pd.to_datetime(['20160525 13:30:00.023',
+                               '20160525 13:30:00.023',
+                               '20160525 13:30:00.030',
+                               '20160525 13:30:00.041',
+                               '20160525 13:30:00.048',
+                               '20160525 13:30:00.049',
+                               '20160525 13:30:00.072',
+                               '20160525 13:30:00.075']),
+       'ticker': ['GOOG', 'MSFT', 'MSFT',
+                  'MSFT', 'GOOG', 'AAPL', 'GOOG',
+                  'MSFT'],
+       'bid': [720.50, 51.95, 51.97, 51.99,
+               720.50, 97.99, 720.50, 52.01],
+       'ask': [720.93, 51.96, 51.98, 52.00,
+               720.93, 98.01, 720.88, 52.03]},
+       columns=['time', 'ticker', 'bid', 'ask'])
+
+.. ipython:: python
+
+   trades
+   quotes
+
+An asof merge joins on the ``on``, typically a datetimelike field, which is ordered, and
+in this case we are using a grouper in the ``by`` field. This is like a left-outer join, except
+that forward filling happens automatically taking the most recent non-NaN value.
+
+.. ipython:: python
+
+   pd.merge_asof(trades, quotes,
+                 on='time',
+                 by='ticker')
+
+This returns a merged DataFrame with the entries in the same order as the original left
+passed DataFrame (``trades`` in this case). With the fields of the ``quotes`` merged.
+
 .. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:
 
 ``pd.read_csv`` has improved support for duplicate column names
@@ -124,8 +215,8 @@ Other enhancements
      idx.where([True, False, True])
 
 - ``Categorical.astype()`` now accepts an optional boolean argument ``copy``, effective when dtype is categorical (:issue:`13209`)
+- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
 - Consistent with the Python API, ``pd.read_csv()`` will now interpret ``+inf`` as positive infinity (:issue:`13274`)
-
 - The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
 - ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
 - A ``union_categorical`` function has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`)
@@ -335,6 +426,7 @@ Deprecations
 - ``compact_ints`` and ``use_unsigned`` have been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13320`)
 - ``buffer_lines`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13360`)
 - ``as_recarray`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13373`)
+- top-level ``pd.ordered_merge()`` has been renamed to ``pd.merge_ordered()`` and the original name will be removed in a future version (:issue:`13358`)
 
 .. _whatsnew_0182.performance:
 

diff --git a/pandas/__init__.py b/pandas/__init__.py
@@ -43,7 +43,8 @@
 from pandas.io.api import *
 from pandas.computation.api import *
 
-from pandas.tools.merge import merge, concat, ordered_merge
+from pandas.tools.merge import (merge, concat, ordered_merge,
+                                merge_ordered, merge_asof)
 from pandas.tools.pivot import pivot_table, crosstab
 from pandas.tools.plotting import scatter_matrix, plot_params
 from pandas.tools.tile import cut, qcut