Skip to content

Commit

Permalink
ENH: add pd.asof_merge
Browse files Browse the repository at this point in the history
closes #1870
xref #2941

http://nbviewer.jupyter.org/gist/jreback/5f089d308750c89b2a7d7446b790c056
is a notebook of example usage and timings

Author: Jeff Reback <jeff@reback.net>

Closes #13358 from jreback/asof and squashes the following commits:

4592fa2 [Jeff Reback] TST: reorg tests/series/test_timeseries -> test_asof
  • Loading branch information
jreback committed Jun 17, 2016
1 parent fca35fb commit 6d8c04c
Show file tree
Hide file tree
Showing 30 changed files with 1,975 additions and 278 deletions.
3 changes: 3 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,8 @@ Data manipulations
cut
qcut
merge
merge_ordered
merge_asof
concat
get_dummies
factorize
Expand Down Expand Up @@ -943,6 +945,7 @@ Time series-related
:toctree: generated/

DataFrame.asfreq
DataFrame.asof
DataFrame.shift
DataFrame.first_valid_index
DataFrame.last_valid_index
Expand Down
166 changes: 129 additions & 37 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ some configurable handling of "what to do with the other axes":
- ``ignore_index`` : boolean, default False. If True, do not use the index
values on the concatenation axis. The resulting axis will be labeled 0, ...,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.

Expand Down Expand Up @@ -544,12 +544,12 @@ Here's a description of what each argument is for:
can be avoided are somewhat pathological but this option is provided
nonetheless.
- ``indicator``: Add a column to the output DataFrame called ``_merge``
with information on the source of each row. ``_merge`` is Categorical-type
and takes on a value of ``left_only`` for observations whose merge key
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
observation's merge key is found in both.
with information on the source of each row. ``_merge`` is Categorical-type
and takes on a value of ``left_only`` for observations whose merge key
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
observation's merge key is found in both.

.. versionadded:: 0.17.0


Expand Down Expand Up @@ -718,7 +718,7 @@ The merge indicator
df2 = DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
merge(df1, df2, on='col1', how='outer', indicator=True)
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.

.. ipython:: python
Expand Down Expand Up @@ -1055,34 +1055,6 @@ them together on their indexes. The same is true for ``Panel.join``.
labels=['left', 'right', 'right2'], vertical=False);
plt.close('all');
.. _merging.ordered_merge:

Merging Ordered Data
~~~~~~~~~~~~~~~~~~~~

New in v0.8.0 is the ordered_merge function for combining time series and other
ordered data. In particular it has an optional ``fill_method`` keyword to
fill/interpolate missing data:

.. ipython:: python
left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
'lv': [1, 2, 3, 4],
's': ['a', 'b', 'c', 'd']})
right = DataFrame({'k': ['K1', 'K2', 'K4'],
'rv': [1, 2, 3]})
result = ordered_merge(left, right, fill_method='ffill', left_by='s')
.. ipython:: python
:suppress:
@savefig merging_ordered_merge.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=True);
plt.close('all');
.. _merging.combine_first.update:

Merging together values within Series or DataFrame columns
Expand Down Expand Up @@ -1132,4 +1104,124 @@ values inplace:
@savefig merging_update.png
p.plot([df1_copy, df2], df1,
labels=['df1', 'df2'], vertical=False);
plt.close('all');
plt.close('all');
.. _merging.time_series:

Timeseries friendly merging
---------------------------

.. _merging.merge_ordered:

Merging Ordered Data
~~~~~~~~~~~~~~~~~~~~

The ``pd.merge_ordered()`` function allows combining time series and other
ordered data. In particular it has an optional ``fill_method`` keyword to
fill/interpolate missing data:

.. ipython:: python
left = DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
'lv': [1, 2, 3, 4],
's': ['a', 'b', 'c', 'd']})
right = DataFrame({'k': ['K1', 'K2', 'K4'],
'rv': [1, 2, 3]})
result = pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
.. ipython:: python
:suppress:
@savefig merging_ordered_merge.png
p.plot([left, right], result,
labels=['left', 'right'], vertical=True);
plt.close('all');
.. _merging.merge_asof:

Merging AsOf
~~~~~~~~~~~~

.. versionadded:: 0.18.2

An ``pd.merge_asof()`` this is similar to an ordered left-join except that we
match on nearest key rather than equal keys.

For each row in the ``left`` DataFrame, we select the last row in the ``right``
DataFrame whose ``on`` key is less than the left's key. Both DataFrames must
be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the ``by`` key equally,
in addition to the nearest match on the ``on`` key.

For example; we might have ``trades`` and ``quotes`` and we want to ``asof`` merge them.

.. ipython:: python
trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT',
'GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,
720.77, 720.92, 98.00],
'quantity': [75, 155,
100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])
quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT',
'MSFT', 'GOOG', 'AAPL', 'GOOG',
'MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])
.. ipython:: python
trades
quotes
By default we are taking the asof of the quotes.

.. ipython:: python
pd.merge_asof(trades, quotes,
on='time',
by='ticker')
We only asof within ``2ms`` betwen the quote time and the trade time.

.. ipython:: python
pd.merge_asof(trades, quotes,
on='time',
by='ticker',
tolerance=pd.Timedelta('2ms'))
We only asof within ``10ms`` betwen the quote time and the trade time and we exclude exact matches on time.
Note that though we exclude the exact matches (of the quotes), prior quotes DO propogate to that point
in time.

.. ipython:: python
pd.merge_asof(trades, quotes,
on='time',
by='ticker',
tolerance=pd.Timedelta('10ms'),
allow_exact_matches=False)
94 changes: 93 additions & 1 deletion doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,97 @@ Highlights include:
New features
~~~~~~~~~~~~

.. _whatsnew_0182.enhancements.asof_merge:

``pd.merge_asof()`` for asof-style time-series joining
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A long-time requested feature has been added through the :func:`merge_asof` function, to
support asof style joining of time-series. (:issue:`1870`). Full documentation is
:ref:`here <merging.merge_asof>`

The :func:`merge_asof`` performs an asof merge, which is similar to a left-join
except that we match on nearest key rather than equal keys.

.. ipython:: python

left = pd.DataFrame({'a': [1, 5, 10],
'left_val': ['a', 'b', 'c']})
right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
'right_val': [1, 2, 3, 6, 7]})

left
right

We typically want to match exactly when possible, and use the most
recent value otherwise.

.. ipython:: python

pd.merge_asof(left, right, on='a')

We can also match rows ONLY with prior data, and not an exact match.

.. ipython:: python

pd.merge_asof(left, right, on='a', allow_exact_matches=False)


In a typical time-series example, we have ``trades`` and ``quotes`` and we want to ``asof-join`` them.
This also illustrates using the ``by`` parameter to group data before merging.

.. ipython:: python

trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT',
'GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,
720.77, 720.92, 98.00],
'quantity': [75, 155,
100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT',
'MSFT', 'GOOG', 'AAPL', 'GOOG',
'MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])

.. ipython:: python

trades
quotes

An asof merge joins on the ``on``, typically a datetimelike field, which is ordered, and
in this case we are using a grouper in the ``by`` field. This is like a left-outer join, except
that forward filling happens automatically taking the most recent non-NaN value.

.. ipython:: python

pd.merge_asof(trades, quotes,
on='time',
by='ticker')

This returns a merged DataFrame with the entries in the same order as the original left
passed DataFrame (``trades`` in this case). With the fields of the ``quotes`` merged.

.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:

``pd.read_csv`` has improved support for duplicate column names
Expand Down Expand Up @@ -124,8 +215,8 @@ Other enhancements
idx.where([True, False, True])

- ``Categorical.astype()`` now accepts an optional boolean argument ``copy``, effective when dtype is categorical (:issue:`13209`)
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
- Consistent with the Python API, ``pd.read_csv()`` will now interpret ``+inf`` as positive infinity (:issue:`13274`)

- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
- A ``union_categorical`` function has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`)
Expand Down Expand Up @@ -335,6 +426,7 @@ Deprecations
- ``compact_ints`` and ``use_unsigned`` have been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13320`)
- ``buffer_lines`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13360`)
- ``as_recarray`` has been deprecated in ``pd.read_csv()`` and will be removed in a future version (:issue:`13373`)
- top-level ``pd.ordered_merge()`` has been renamed to ``pd.merge_ordered()`` and the original name will be removed in a future version (:issue:`13358`)

.. _whatsnew_0182.performance:

Expand Down
3 changes: 2 additions & 1 deletion pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@
from pandas.io.api import *
from pandas.computation.api import *

from pandas.tools.merge import merge, concat, ordered_merge
from pandas.tools.merge import (merge, concat, ordered_merge,
merge_ordered, merge_asof)
from pandas.tools.pivot import pivot_table, crosstab
from pandas.tools.plotting import scatter_matrix, plot_params
from pandas.tools.tile import cut, qcut
Expand Down
Loading

0 comments on commit 6d8c04c

Please sign in to comment.