Add "forward" and "nearest" direction to merge_asof() #14887

Closed
chrisaycock opened this Issue Dec 15, 2016 · 2 comments

Comments

Projects
None yet
2 participants
Contributor

chrisaycock commented Dec 15, 2016 edited

Currently pd.merge_asof() looks backwards, meaning that it gets the last row from the right table whose timestamp is less the left table's timestamp. It might be nice to look forwards, meaning get the first row in the right table whose timestamps is greater than the left table's timestamp. And we could have a nearest, meaning get the row from the right table whose timestamp is closest to the left table's timestamp regardless of direction.

I propose a new direction parameter whose default value is "backward", which looks for prior timestamps as pd.merge_asof() currently does. A value of "forward" will cause cause the function to look for subsequent timestamps. A value of "nearest" will look on both directions.

Here's a modified example from the docstring:

In [16]: left
Out[16]:
    a left_val
0   1        a
1   5        b
2  10        c

In [17]: right
Out[17]:
   a  right_val
0  1          1
1  2          2
2  3          3
3  6          6
4  7          7

In [18]: pd.merge_asof(left, right, on='a')
Out[18]:
    a left_val  right_val
0   1        a          1
1   5        b          3
2  10        c          7

In [19]: pd.merge_asof(left, right, on='a', direction='forward')
Out[19]:
    a left_val  right_val
0   1        a        1.0
1   5        b        6.0
2  10        c        NaN

In [20]: pd.merge_asof(left, right, on='a', direction='nearest')
Out[20]:
    a left_val  right_val
0   1        a          1
1   5        b          6
2  10        c          7
Contributor

jreback commented Dec 15, 2016 edited

this is fine.

also prob should add an example / expl in docs about how .merge_asof differs from .reindex(...., method='nearest', tolerance=...), which is effectively asof reindexing in both directions (and min of those).

so should have:

direction='backward'|'forward'|'both' (this is how we spell it for example on Series.interpolate for limit_direction), or maybe both -> nearest

jreback added this to the Next Major Release milestone Dec 15, 2016

Contributor

chrisaycock commented Dec 16, 2016

I like the idea of a nearest direction that minimizes the absolute distance between left and right timestamps. Might need a tie-break parameter for first or last entry though, like the side parameter in np.searchsorted().

As for the difference between pd.merge_asof() and .reindex(), the former is a join between two DataFrames. I suppose someone could reindex the right DataFrame using the left's index, and then concatenate the resulting table. But that's a lot of work, doesn't handle the by parameter for exact matches (pd.merge_asof() is a single scan), and requires the timestamp values to be indexes rather than ordinary columns.

chrisaycock changed the title from Add a "forward" direction to merge_asof() to Add "forward" and "nearest" direction to merge_asof() Jan 12, 2017

@jreback jreback modified the milestone: 0.20.0, Next Major Release Jan 13, 2017

jreback closed this in cc36009 Jan 18, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@AnkurDedania Christopher C. Aycock + AnkurDedania ENH: Added 'direction' parameter to merge_asof() (#14887)
closes #14887

Author: Christopher C. Aycock <christopher.aycock@twosigma.com>

Closes #15129 from chrisaycock/GH14887 and squashes the following commits:

da38483 [Christopher C. Aycock] Description of direction parameters is now a bullet-point list
879c9f0 [Christopher C. Aycock] Tweak prose to include statement on version and parameters
ce5caaa [Christopher C. Aycock] Split Cython functions according to direction
50431ad [Christopher C. Aycock] ENH: Added 'direction' parameter to merge_asof() (#14887)
66e0e20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment