Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious performance regression in DataFrame construction with monthly DatetimeIndex #6479

Closed
qwhelan opened this issue Feb 25, 2014 · 15 comments · Fixed by #6481
Closed

Serious performance regression in DataFrame construction with monthly DatetimeIndex #6479

qwhelan opened this issue Feb 25, 2014 · 15 comments · Fixed by #6481
Labels
Frequency DateOffsets Performance Memory or execution speed performance
Milestone

Comments

@qwhelan
Copy link
Contributor

qwhelan commented Feb 25, 2014

Hi,

After upgrading from v0.12.0 to v0.13.1, I noticed about a 100% slowdown on a pandas-heavy project. I've just started looking, but I've come up with a test case that shows a time-complexity change from O(1) to O(n) (~240x slowdown for my inputs).

Here's the comparison for v0.12.0 (y-axis is milliseconds):

perf_12

And the comparison for v0.13.1:

perf_131

The test code (I'll convert this to vbench later):

rows = 1000
columns = 10
data = DataFrame(np.random.random((rows, columns)), index=DatetimeIndex(start='1/1/1900', periods=rows, freq='M'))

d = {}

for col in data:
    d[col] = data[col]

%timeit DataFrame(d)

Daily indices don't appear to be affected, though I suspect other frequencies are impacted. I'm seeing similar regressions in v0.13.0.

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 25, 2014

I've identified the cause:

Using 'D' frequency:

Function: _fast_union at line 1033
Total time: 0.000723 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1033                                               def _fast_union(self, other):
  1034         9           19      2.1      2.6          if len(other) == 0:
  1035                                                       return self.view(type(self))
  1036                                           
  1037         9           11      1.2      1.5          if len(self) == 0:
  1038                                                       return other.view(type(self))
  1039                                           
  1040                                                   # to make our life easier, "sort" the two ranges
  1041         9          261     29.0     36.1          if self[0] <= other[0]:
  1042         9           11      1.2      1.5              left, right = self, other
  1043                                                   else:
  1044                                                       left, right = other, self
  1045                                           
  1046         9          245     27.2     33.9          left_start, left_end = left[0], left[-1]
  1047         9          121     13.4     16.7          right_end = right[-1]
  1048                                           
  1049         9           34      3.8      4.7          if not self.offset._should_cache():
  1050                                                       # concatenate dates
  1051         9           12      1.3      1.7              if left_end < right_end:
  1052                                                           loc = right.searchsorted(left_end, side='right')
  1053                                                           right_chunk = right.values[loc:]
  1054                                                           dates = com._concat_compat((left.values, right_chunk))
  1055                                                           return self._view_like(dates)
  1056                                                       else:
  1057         9            9      1.0      1.2                  return left
  1058                                                   else:
  1059                                                       return type(self)(start=left_start,
  1060                                                                         end=max(left_end, right_end),
  1061                                                                         freq=left.offset)

And with 'M':

Function: _fast_union at line 1033
Total time: 0.851928 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1033                                               def _fast_union(self, other):
  1034         9           33      3.7      0.0          if len(other) == 0:
  1035                                                       return self.view(type(self))
  1036                                           
  1037         9           11      1.2      0.0          if len(self) == 0:
  1038                                                       return other.view(type(self))
  1039                                           
  1040                                                   # to make our life easier, "sort" the two ranges
  1041         9          323     35.9      0.0          if self[0] <= other[0]:
  1042         9           12      1.3      0.0              left, right = self, other
  1043                                                   else:
  1044                                                       left, right = other, self
  1045                                           
  1046         9          253     28.1      0.0          left_start, left_end = left[0], left[-1]
  1047         9          139     15.4      0.0          right_end = right[-1]
  1048                                           
  1049         9           46      5.1      0.0          if not self.offset._should_cache():
  1050                                                       # concatenate dates
  1051                                                       if left_end < right_end:
  1052                                                           loc = right.searchsorted(left_end, side='right')
  1053                                                           right_chunk = right.values[loc:]
  1054                                                           dates = com._concat_compat((left.values, right_chunk))
  1055                                                           return self._view_like(dates)
  1056                                                       else:
  1057                                                           return left
  1058                                                   else:
  1059         9           13      1.4      0.0              return type(self)(start=left_start,
  1060         9           38      4.2      0.0                                end=max(left_end, right_end),
  1061         9       851060  94562.2     99.9                                freq=left.offset)

Which suggests this is due to a change in either Offset.isAnchored() or Offset._cacheable.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2014

yep it seems the offsets had the cache able class reversed with the offset
easy fix
I'll put up a pr but would appreciate of you could add some more vbenches so that this can be caught if things change again

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 25, 2014

Sure, I'll take a look tonight.

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Just for reference, git bisect identified 25cfcaf as the offending commit.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

yep...thanks

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

I added a couple of vbenches...but what I think we really need is from my suggestion at the end of this issue: #6450

essentially make a small generate_vbench_frequency.py which is a code generator and creates the frequency.py vbench file (take the 2 vbenches I added out, instead have it generate vbenches like these but for all frequencies - or a lot of them). then you run the vbench generator to create the actual vbench file. Its a bit simpler this way I think.

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Yeah, that's what I was thinking. The test cases also aren't covering everything.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

so feel free to push to the PR (I think you might need to push to my branch)........i'll see if I can fix the daily issue....

you can also put up a separate PR and i'll merge them....

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Some of the offsets aren't working just yet, but here's the comparison against v0.12.0:

Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_ctor_nested_dict_int64                 | 114.8001 | 116.6967 |   0.9837 |
frame_ctor_list_of_dict                      | 125.6680 | 122.8684 |   1.0228 |
frame_ctor_nested_dict                       |  94.0313 |  91.9131 |   1.0230 |
frame_ctor_dtindex_BMonthEnd                 | 291.1356 | 257.4974 |   1.1306 |
frame_ctor_dtindex_CDay                      |   1.9093 |   1.5590 |   1.2247 |
frame_ctor_dtindex_CustomBusinessDay         |   1.9133 |   1.5593 |   1.2270 |
frame_ctor_dtindex_BDay                      | 106.4280 |  86.5207 |   1.2301 |
frame_ctor_dtindex_BusinessDay               | 106.0473 |  83.0417 |   1.2770 |
frame_ctor_dtindex_Week                      |   1.3300 |   1.0080 |   1.3194 |
frame_ctor_dtindex_Day                       |   1.1443 |   0.8597 |   1.3311 |
frame_ctor_dtindex_Micro                     |   1.1423 |   0.8534 |   1.3386 |
frame_ctor_dtindex_daily                     |   1.1420 |   0.8516 |   1.3410 |
frame_ctor_dtindex_Second                    |   1.1457 |   0.8543 |   1.3410 |
frame_ctor_dtindex_Hour                      |   1.1434 |   0.8433 |   1.3559 |
frame_ctor_dtindex_Minute                    |   1.1466 |   0.8373 |   1.3694 |
frame_ctor_dtindex_Milli                     |   1.1760 |   0.8577 |   1.3712 |
frame_ctor_dtindex_QuarterBegin              | 239.4284 |   1.4216 | 168.4205 |
frame_ctor_dtindex_QuarterEnd                | 247.0164 |   1.4304 | 172.6961 |
frame_ctor_dtindex_BQuarterBegin             | 262.3610 |   1.4893 | 176.1614 |
frame_ctor_dtindex_MonthBegin                | 234.5270 |   1.2890 | 181.9491 |
frame_ctor_dtindex_BMonthBegin               | 257.3493 |   1.3250 | 194.2301 |
frame_ctor_dtindex_BQuarterEnd               | 288.2090 |   1.4470 | 199.1821 |
frame_ctor_dtindex_MonthEnd                  | 253.1923 |   1.2314 | 205.6213 |
frame_ctor_dtindex_monthly                   | 255.0759 |   1.2197 | 209.1352 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Not sure if this feature would be generally useful, but '2M' wasn't affected:

Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------  
frame_ctor_nested_dict_int64                 | 113.8573 | 116.8803 |   0.9741 |
frame_ctor_list_of_dict                      | 121.4930 | 121.7180 |   0.9982 |
frame_ctor_nested_dict                       |  94.3383 |  91.7673 |   1.0280 |  
frame_ctor_dtindex_BMonthEnd(1)              | 286.2213 | 256.7193 |   1.1149 |
frame_ctor_dtindex_BMonthBegin(2)            |   1.6034 |   1.3483 |   1.1891 |
frame_ctor_dtindex_CustomBusinessDay(1)      |   1.9443 |   1.6097 |   1.2078 |  
frame_ctor_dtindex_CDay(1)                   |   1.9423 |   1.6066 |   1.2089 |
frame_ctor_dtindex_CDay(2)                   |   1.9573 |   1.6180 |   1.2097 |  
frame_ctor_dtindex_MonthBegin(2)             |   1.5747 |   1.2863 |   1.2241 |
frame_ctor_dtindex_CustomBusinessDay(2)      |   1.9637 |   1.5990 |   1.2281 |  
frame_ctor_dtindex_BDay(2)                   |   1.5260 |   1.2047 |   1.2667 |
frame_ctor_dtindex_MonthEnd(2)               |   1.5957 |   1.2543 |   1.2721 |  
frame_ctor_dtindex_BMonthEnd(2)              |   1.5634 |   1.2233 |   1.2780 |
frame_ctor_dtindex_BusinessDay(2)            |   1.5517 |   1.1967 |   1.2967 |
frame_ctor_dtindex_Week(2)                   |   1.3377 |   1.0243 |   1.3059 |  
frame_ctor_dtindex_BDay(1)                   | 103.5473 |  79.2627 |   1.3064 |  
frame_ctor_dtindex_Day(2)                    |   1.1547 |   0.8800 |   1.3121 |
frame_ctor_dtindex_BusinessDay(1)            | 104.4460 |  79.3813 |   1.3158 |  
frame_ctor_dtindex_Milli(1)                  |   1.1473 |   0.8707 |   1.3177 |
frame_ctor_dtindex_Minute(2)                 |   1.1660 |   0.8840 |   1.3191 |  
frame_ctor_dtindex_Week(1)                   |   1.3837 |   1.0450 |   1.3241 |
frame_ctor_dtindex_Micro(1)                  |   1.1676 |   0.8777 |   1.3303 |  
frame_ctor_dtindex_Second(2)                 |   1.1590 |   0.8667 |   1.3372 |
frame_ctor_dtindex_Hour(1)                   |   1.1543 |   0.8570 |   1.3469 |  
frame_ctor_dtindex_Micro(2)                  |   1.1513 |   0.8517 |   1.3518 |
frame_ctor_dtindex_Milli(2)                  |   1.1844 |   0.8727 |   1.3572 |
frame_ctor_dtindex_Day(1)                    |   1.1577 |   0.8514 |   1.3597 |  
frame_ctor_dtindex_Second(1)                 |   1.1520 |   0.8460 |   1.3617 |  
frame_ctor_dtindex_daily                     |   1.1834 |   0.8667 |   1.3654 |
frame_ctor_dtindex_Minute(1)                 |   1.1503 |   0.8407 |   1.3683 |  
frame_ctor_dtindex_Hour(2)                   |   1.1650 |   0.8504 |   1.3700 |
frame_ctor_dtindex_QuarterEnd(1)             | 240.6406 |   1.4707 | 163.6204 |  
frame_ctor_dtindex_QuarterBegin(1)           | 236.0146 |   1.4323 | 164.7756 |
frame_ctor_dtindex_BQuarterBegin(1)          | 252.9564 |   1.5121 | 167.2936 |  
frame_ctor_dtindex_MonthBegin(1)             | 222.8660 |   1.3070 | 170.5158 |
frame_ctor_dtindex_BMonthBegin(1)            | 251.1834 |   1.3367 | 187.9195 |  
frame_ctor_dtindex_BQuarterEnd(1)            | 282.7850 |   1.4796 | 191.1193 |
frame_ctor_dtindex_monthly                   | 244.1387 |   1.2670 | 192.6849 |
frame_ctor_dtindex_MonthEnd(1)               | 244.7317 |   1.2530 | 195.3214 |  
-------------------------------------------------------------------------------  
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

2 month was being cached

can u do a pr of this 2 my branch? or I can grab your commit

what happens when u put in the change that I made?

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Just submitted a pull request to your branch. Working on running with your change.

@qwhelan
Copy link
Contributor Author

qwhelan commented Feb 26, 2014

Your change vs. 0.12.0:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_ctor_dtindex_BMonthEnd(1)              |   1.5987 | 254.7014 |   0.0063 |
frame_ctor_dtindex_BusinessDay(1)            |   1.4900 |  79.9637 |   0.0186 |
frame_ctor_dtindex_BDay(1)                   |   1.5120 |  79.7106 |   0.0190 |
frame_ctor_nested_dict_int64                 | 115.6197 | 115.4337 |   1.0016 |
frame_ctor_nested_dict                       |  95.0450 |  92.9697 |   1.0223 |
frame_ctor_list_of_dict                      | 123.4467 | 120.6383 |   1.0233 |
frame_ctor_dtindex_BQuarterBegin(1)          |   1.6353 |   1.4870 |   1.0997 |
frame_ctor_dtindex_BQuarterEnd(1)            |   1.6257 |   1.4459 |   1.1243 |
frame_ctor_dtindex_QuarterBegin(1)           |   1.5993 |   1.4211 |   1.1254 |
frame_ctor_dtindex_QuarterEnd(1)             |   1.6220 |   1.4327 |   1.1322 |
frame_ctor_dtindex_BMonthBegin(1)            |   1.6157 |   1.3154 |   1.2283 |
frame_ctor_dtindex_MonthBegin(1)             |   1.5984 |   1.2660 |   1.2625 |
frame_ctor_dtindex_MonthEnd(1)               |   1.6023 |   1.2487 |   1.2832 |
frame_ctor_dtindex_Minute(1)                 |   2.9350 |   0.8349 |   3.5152 |
frame_ctor_dtindex_Hour(1)                   |   3.0277 |   0.8520 |   3.5535 |
frame_ctor_dtindex_Minute(2)                 |   3.1273 |   0.8776 |   3.5634 |
frame_ctor_dtindex_Hour(2)                   |   3.0163 |   0.8403 |   3.5894 |
frame_ctor_dtindex_Day(2)                    |   3.2320 |   0.8829 |   3.6605 |
frame_ctor_dtindex_Micro(1)                  |   3.2264 |   0.8670 |   3.7214 |
frame_ctor_dtindex_Micro(2)                  |   3.1490 |   0.8460 |   3.7223 |
frame_ctor_dtindex_Day(1)                    |   3.1766 |   0.8483 |   3.7447 |
frame_ctor_dtindex_Milli(1)                  |   3.2760 |   0.8527 |   3.8420 |
frame_ctor_dtindex_Second(1)                 |   3.2516 |   0.8387 |   3.8771 |
frame_ctor_dtindex_Second(2)                 |   3.3247 |   0.8537 |   3.8944 |
frame_ctor_dtindex_Milli(2)                  |   3.4111 |   0.8477 |   4.0241 |
frame_ctor_dtindex_Week(2)                   |   6.1696 |   1.0093 |   6.1128 |
frame_ctor_dtindex_Week(1)                   |  23.8633 |   1.0303 |  23.1618 |
frame_ctor_dtindex_BusinessDay(2)            | 164.1870 |   1.1967 | 137.1996 |
frame_ctor_dtindex_BDay(2)                   | 164.1833 |   1.1900 | 137.9661 |
frame_ctor_dtindex_MonthBegin(2)             | 223.9563 |   1.2806 | 174.8804 |
frame_ctor_dtindex_BMonthBegin(2)            | 246.2444 |   1.3394 | 183.8528 |
frame_ctor_dtindex_MonthEnd(2)               | 245.0993 |   1.2410 | 197.4938 |
frame_ctor_dtindex_BMonthEnd(2)              | 280.9331 |   1.2097 | 232.2420 |
frame_ctor_dtindex_CustomBusinessDay(1)      | 468.6323 |   1.6143 | 290.2948 |
frame_ctor_dtindex_CDay(1)                   | 472.2750 |   1.6087 | 293.5775 |
frame_ctor_dtindex_CDay(2)                   | 470.9130 |   1.5996 | 294.3888 |
frame_ctor_dtindex_CustomBusinessDay(2)      | 473.1157 |   1.5993 | 295.8245 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

ok thanks
I think maybe will put the offsets back to how they were

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

cc @cancan101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Performance Memory or execution speed performance
Projects
None yet
2 participants