New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand type specializations and multiple "by" parameters in merge_asof() #13936

Closed
chrisaycock opened this Issue Aug 8, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@chrisaycock
Contributor

chrisaycock commented Aug 8, 2016

pd.merge_asof() can take an integer or floating-point number in the on parameter, and it can take an integer or an object in the by parameter. Specifically, the user's types are promoted to int64_t, double, or object as needed. That means, for example, that an int32_t is permitted, but we'll have to create a copy of the user's column to promote to int64_t.

This brings a question of whether we should add type specializations for every integer and floating-point number for better performance.

A second issue to consider is that only one column is permitted in the by parameter. But the user may wish to match on both ticker symbol and stock exchange, for example. To allow for this, the implementation logic would need to allow arrays of objects/integers for our grouping.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 8, 2016

Contributor

can you show some specific examples which fail and/or you want to work (in the top section). This will provide some examples / tests cases.

Contributor

jreback commented Aug 8, 2016

can you show some specific examples which fail and/or you want to work (in the top section). This will provide some examples / tests cases.

@jreback jreback added this to the Next Major Release milestone Aug 8, 2016

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Aug 8, 2016

Contributor

Performance

Using merge_asof_noby from the join_merge.py benchmark:

k = merge_asof_noby()
k.setup()

Runtime of the 64-bit case:

In [4]: %timeit k.time_merge_asof_noby()
100 loops, best of 3: 12.4 ms per loop

Now cast to 32-bit:

k.df1.time = np.int32(k.df1.time)
k.df2.time = np.int32(k.df2.time)

And re-run it:

In [7]: %timeit k.time_merge_asof_noby()
100 loops, best of 3: 18.6 ms per loop

The added overhead comes from converting the time column back to 64-bit.


Multi-By

Let's say I have these DataFrames:

trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.046',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.050']),
    'ticker': ['MSFT', 'MSFT',
               'GOOG', 'GOOG', 'AAPL'],
    'exch': ['ARCA', 'NSDQ', 'NSDQ', 'BATS', 'NSDQ'],
    'price': [51.95, 51.95,
              720.77, 720.92, 98.00],
    'quantity': [75, 155,
                 100, 100, 100]},
    columns=['time', 'ticker', 'exch',
             'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.030',
                            '20160525 13:30:00.041',
                            '20160525 13:30:00.045',
                            '20160525 13:30:00.049']),
    'ticker': ['GOOG', 'MSFT', 'MSFT',
               'MSFT', 'GOOG', 'AAPL'],
    'exch': ['BATS', 'NSDQ', 'ARCA', 'ARCA',
             'NSDQ', 'ARCA'],
    'bid': [720.51, 51.95, 51.97, 51.99,
            720.50, 97.99],
    'ask': [720.92, 51.96, 51.98, 52.00,
            720.93, 98.01]},
    columns=['time', 'ticker', 'exch', 'bid', 'ask'])

I want to join by both the ticker symbol and the stock exchange:

In [80]: trades
Out[80]: 
                     time ticker  exch   price  quantity
0 2016-05-25 13:30:00.023   MSFT  ARCA   51.95        75
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95       155
2 2016-05-25 13:30:00.046   GOOG  NSDQ  720.77       100
3 2016-05-25 13:30:00.048   GOOG  BATS  720.92       100
4 2016-05-25 13:30:00.050   AAPL  NSDQ   98.00       100

In [81]: quotes
Out[81]: 
                     time ticker  exch     bid     ask
0 2016-05-25 13:30:00.023   GOOG  BATS  720.51  720.92
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT  ARCA   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT  ARCA   51.99   52.00
4 2016-05-25 13:30:00.045   GOOG  NSDQ  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL  ARCA   97.99   98.01

In [82]: pd.merge_asof(trades, quotes, on='time', by=['ticker', 'exch'])
Out[82]: 
                     time ticker  exch   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT  ARCA   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95       155   51.95   51.96
2 2016-05-25 13:30:00.046   GOOG  NSDQ  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  BATS  720.92       100  720.51  720.92
4 2016-05-25 13:30:00.050   AAPL  NSDQ   98.00       100     NaN     NaN

I.e., the expected result is:

expected = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.046',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.050']),
    'ticker': ['MSFT', 'MSFT',
               'GOOG', 'GOOG', 'AAPL'],
    'exch': ['ARCA', 'NSDQ', 'NSDQ', 'BATS', 'NSDQ'],
    'price': [51.95, 51.95,
              720.77, 720.92, 98.00],
    'quantity': [75, 155,
                 100, 100, 100],
    'bid': [np.nan, 51.95, 720.50, 720.51, np.nan],
    'ask': [np.nan, 51.96, 720.93, 720.92, np.nan]},
    columns=['time', 'ticker', 'exch',
             'price', 'quantity', 'bid', 'ask'])

For the by parameter to take an array of column names, the hashing solution must be able to handle more than one object.


These two requests are of low priority right now, so I personally won't be able to get to them right away.

Contributor

chrisaycock commented Aug 8, 2016

Performance

Using merge_asof_noby from the join_merge.py benchmark:

k = merge_asof_noby()
k.setup()

Runtime of the 64-bit case:

In [4]: %timeit k.time_merge_asof_noby()
100 loops, best of 3: 12.4 ms per loop

Now cast to 32-bit:

k.df1.time = np.int32(k.df1.time)
k.df2.time = np.int32(k.df2.time)

And re-run it:

In [7]: %timeit k.time_merge_asof_noby()
100 loops, best of 3: 18.6 ms per loop

The added overhead comes from converting the time column back to 64-bit.


Multi-By

Let's say I have these DataFrames:

trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.046',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.050']),
    'ticker': ['MSFT', 'MSFT',
               'GOOG', 'GOOG', 'AAPL'],
    'exch': ['ARCA', 'NSDQ', 'NSDQ', 'BATS', 'NSDQ'],
    'price': [51.95, 51.95,
              720.77, 720.92, 98.00],
    'quantity': [75, 155,
                 100, 100, 100]},
    columns=['time', 'ticker', 'exch',
             'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.030',
                            '20160525 13:30:00.041',
                            '20160525 13:30:00.045',
                            '20160525 13:30:00.049']),
    'ticker': ['GOOG', 'MSFT', 'MSFT',
               'MSFT', 'GOOG', 'AAPL'],
    'exch': ['BATS', 'NSDQ', 'ARCA', 'ARCA',
             'NSDQ', 'ARCA'],
    'bid': [720.51, 51.95, 51.97, 51.99,
            720.50, 97.99],
    'ask': [720.92, 51.96, 51.98, 52.00,
            720.93, 98.01]},
    columns=['time', 'ticker', 'exch', 'bid', 'ask'])

I want to join by both the ticker symbol and the stock exchange:

In [80]: trades
Out[80]: 
                     time ticker  exch   price  quantity
0 2016-05-25 13:30:00.023   MSFT  ARCA   51.95        75
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95       155
2 2016-05-25 13:30:00.046   GOOG  NSDQ  720.77       100
3 2016-05-25 13:30:00.048   GOOG  BATS  720.92       100
4 2016-05-25 13:30:00.050   AAPL  NSDQ   98.00       100

In [81]: quotes
Out[81]: 
                     time ticker  exch     bid     ask
0 2016-05-25 13:30:00.023   GOOG  BATS  720.51  720.92
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT  ARCA   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT  ARCA   51.99   52.00
4 2016-05-25 13:30:00.045   GOOG  NSDQ  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL  ARCA   97.99   98.01

In [82]: pd.merge_asof(trades, quotes, on='time', by=['ticker', 'exch'])
Out[82]: 
                     time ticker  exch   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT  ARCA   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.023   MSFT  NSDQ   51.95       155   51.95   51.96
2 2016-05-25 13:30:00.046   GOOG  NSDQ  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  BATS  720.92       100  720.51  720.92
4 2016-05-25 13:30:00.050   AAPL  NSDQ   98.00       100     NaN     NaN

I.e., the expected result is:

expected = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                            '20160525 13:30:00.023',
                            '20160525 13:30:00.046',
                            '20160525 13:30:00.048',
                            '20160525 13:30:00.050']),
    'ticker': ['MSFT', 'MSFT',
               'GOOG', 'GOOG', 'AAPL'],
    'exch': ['ARCA', 'NSDQ', 'NSDQ', 'BATS', 'NSDQ'],
    'price': [51.95, 51.95,
              720.77, 720.92, 98.00],
    'quantity': [75, 155,
                 100, 100, 100],
    'bid': [np.nan, 51.95, 720.50, 720.51, np.nan],
    'ask': [np.nan, 51.96, 720.93, 720.92, np.nan]},
    columns=['time', 'ticker', 'exch',
             'price', 'quantity', 'bid', 'ask'])

For the by parameter to take an array of column names, the hashing solution must be able to handle more than one object.


These two requests are of low priority right now, so I personally won't be able to get to them right away.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 8, 2016

Contributor

does the 'by' issue work? (and just is somewhat slower)?

Contributor

jreback commented Aug 8, 2016

does the 'by' issue work? (and just is somewhat slower)?

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Aug 8, 2016

Contributor

@jreback No, the by parameter is limited to a scalar right now. Can the PyObjectHashTable take tuples?

Contributor

chrisaycock commented Aug 8, 2016

@jreback No, the by parameter is limited to a scalar right now. Can the PyObjectHashTable take tuples?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 8, 2016

Contributor

yes it can take tuples. normally that is not a good idea, but might work here.

mainly I would like to raise if there is something that is non-sensical (ATM). If it works but is slow that is fine as well.

Contributor

jreback commented Aug 8, 2016

yes it can take tuples. normally that is not a good idea, but might work here.

mainly I would like to raise if there is something that is non-sensical (ATM). If it works but is slow that is fine as well.

@chrisaycock

This comment has been minimized.

Show comment
Hide comment
@chrisaycock

chrisaycock Aug 8, 2016

Contributor

It raises an error now, yes.

In [2]: pd.merge_asof(trades, quotes, on='time', by=['ticker', 'exch'])
...

MergeError: can only asof by a single key
Contributor

chrisaycock commented Aug 8, 2016

It raises an error now, yes.

In [2]: pd.merge_asof(trades, quotes, on='time', by=['ticker', 'exch'])
...

MergeError: can only asof by a single key
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 8, 2016

Contributor

@chrisaycock perfect. Then good to go for now.

Contributor

jreback commented Aug 8, 2016

@chrisaycock perfect. Then good to go for now.

@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 15, 2016

@jreback jreback closed this in e7df751 Dec 16, 2016

ischurov added a commit to ischurov/pandas that referenced this issue Dec 19, 2016

ENH: merge_asof() has type specializations and can take multiple 'by'…
… parameters (#13936)

closes #13936

Author: Christopher C. Aycock <christopher.aycock@twosigma.com>

Closes #14783 from chrisaycock/GH13936 and squashes the following commits:

ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos
1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings
77eb47b [Christopher C. Aycock] Merge master branch into GH13936
89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments
0ad1687 [Christopher C. Aycock] Fixed whatsnew
2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review
fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup
5eeb7d9 [Christopher C. Aycock] Merge master into GH13936
c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936
46cc309 [Christopher C. Aycock] Update documentation
f01142c [Christopher C. Aycock] Merge master branch
75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (#13936)

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Dec 24, 2016

ENH: merge_asof() has type specializations and can take multiple 'by'…
… parameters (#13936)

closes #13936

Author: Christopher C. Aycock <christopher.aycock@twosigma.com>

Closes #14783 from chrisaycock/GH13936 and squashes the following commits:

ffcf0c2 [Christopher C. Aycock] Added test to reject float16; fixed typos
1f208a8 [Christopher C. Aycock] Use tuple representation instead of strings
77eb47b [Christopher C. Aycock] Merge master branch into GH13936
89256f0 [Christopher C. Aycock] Test 8-bit integers and raise error on 16-bit floats; add comments
0ad1687 [Christopher C. Aycock] Fixed whatsnew
2bce3cc [Christopher C. Aycock] Revert dict back to PyObjectHashTable in response to code review
fafbb02 [Christopher C. Aycock] Updated benchmarks to reflect new ASV setup
5eeb7d9 [Christopher C. Aycock] Merge master into GH13936
c33c4cb [Christopher C. Aycock] Merge branch 'master' into GH13936
46cc309 [Christopher C. Aycock] Update documentation
f01142c [Christopher C. Aycock] Merge master branch
75157fc [Christopher C. Aycock] merge_asof() has type specializations and can take multiple 'by' parameters (#13936)

(cherry picked from commit e7df751)

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jan 3, 2017

Merge tag 'v0.19.2' into releases
Version 0.19.2

* tag 'v0.19.2': (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936)
  [Backport #14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (#14886)
  Fixed KDE Plot to drop the missing values (#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (#14253) (#14531)
  TST: correct url for test file on s3 (xref #14587)
  ...

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jan 3, 2017

Merge branch 'releases' into debian
* releases: (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936)
  [Backport #14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (#14886)
  Fixed KDE Plot to drop the missing values (#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (#14253) (#14531)
  TST: correct url for test file on s3 (xref #14587)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment