New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Faster merge_asof() through a single-pass algo #13902

Closed
chrisaycock opened this Issue Aug 4, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@chrisaycock
Contributor

chrisaycock commented Aug 4, 2016

Out of curiosity, I took a crack at a single-pass merge_asof(). My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I use PyObjectHashTable while scanning through the right DataFrame to cache the most recently found row for each "by" object.

I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like ['ticker', 'exchange']. Still investigating.

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash c4302949:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                               41.07ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.90ms
[ 50.00%] · For pandas commit hash 97de42ab:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              608.08ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           81.03ms
   before     after       ratio
  [97de42ab] [c4302949]
-   81.03ms    12.90ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  608.08ms    41.07ms      0.07  join_merge.merge_asof_by.time_merge_asof_by

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 4, 2016

Contributor

so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.

Contributor

jreback commented Aug 4, 2016

so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Aug 4, 2016

Contributor

@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.

I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...

Contributor

chrisaycock commented Aug 4, 2016

@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.

I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 4, 2016

Contributor

@chrisaycock you can use the groupby factorization (its quite cheap to do this)

In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})

In [6]: g = df.groupby(['A', 'B'])

In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)
Contributor

jreback commented Aug 4, 2016

@chrisaycock you can use the groupby factorization (its quite cheap to do this)

In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})

In [6]: g = df.groupby(['A', 'B'])

In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)
@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Aug 4, 2016

Contributor

Using the setup from the join_merge.merge_asof_by benchmark:

In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop

But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:

In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop

In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop

The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[], then we need to pay attention to performance.)

Contributor

chrisaycock commented Aug 4, 2016

Using the setup from the join_merge.merge_asof_by benchmark:

In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop

But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:

In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop

In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop

The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[], then we need to pay attention to performance.)

jreback added a commit that referenced this issue Aug 8, 2016

ENH: Faster merge_asof() performs a single pass when joining tables (#…
…13902)

This version passes existing regression tests but is ultimately wrong
because it requires the "by" column to be a single object. A proper
version  would handle int (and possily float) columns through type
differentiation.

Author: Christopher C. Aycock <christopher.aycock@twosigma.com>

Closes #13903 from chrisaycock/master and squashes the following commits:

f0d0165 [Christopher C. Aycock] ENH: Faster merge_asof() performs a single pass when joining tables (#13902)

@chrisaycock chrisaycock closed this Aug 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment