Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Some faster date/time string formatting #46759

Merged
merged 40 commits into from
Jul 1, 2022

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Apr 12, 2022

"Easy" fixes related to #44764, taken from the big PR #46116:

  • Performance improvement in BusinessHour, repr is now 4 times faster
  • Performance improvement in DatetimeArray and DatetimeIndex: string formatting is now up to 80% faster (as fast as default) when one of the default strftime formats "%Y-%m-%d %H:%M:%S" or "%Y-%m-%d %H:%M:%S.%f" is used (see full bench in PR [WIP] perf improvements for strftime #46116).
  • Performance improvement in Series.to_excel and DataFrame.to_excel (see ExcelFormatter): processing dates can be up to 4% faster.
  • Performance improvement in Series.to_sql and DataFrame.to_sql (SQLiteTable): processing time arrays can be up to 65% faster !

I also added a few code comments about datetime formatting sections, to improve maintenance.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@smarie
Copy link
Contributor Author

smarie commented Apr 12, 2022

BusinessHour repr improvements mini-bench:

get_repr = lambda: repr(self.offset1)
glob = globals()
glob.update(locals())
res = Timer("get_repr()", globals=glob).repeat(3, 10000)

before: [0.15039550000000013, 0.16580459999999952, 0.12403839999999988]
after: [0.035295299999999585, 0.0369325000000007, 0.03622369999999897]

= 4 times faster!

Sylvain MARIE added 10 commits April 12, 2022 23:37
…eIndex`: string formatting is now up to 80% faster (as fast as default) when one of the default strftime formats ``"%Y-%m-%d %H:%M:%S"`` or ``"%Y-%m-%d %H:%M:%S.%f"`` is used. See pandas-dev#44764
…me.to_excel` (:class:`ExcelFormatter`): processing dates can be up to 4% faster. (related to :issue:`44764`)
…QLiteTable`): processing time arrays can be up to 65% faster ! (related to pandas-dev#44764)
@smarie
Copy link
Contributor Author

smarie commented Apr 12, 2022

SQL mini-bench:

import sqlite3
con = sqlite3.connect(':memory:')
dt = pd.date_range('2014-01-01', periods=1000, freq='10h')
df = pd.DataFrame({'time': dt.time})

from timeit import Timer
to_assess = lambda: df.to_sql('test_datetime', con, index=False, if_exists='replace')
glob = globals()
glob.update(locals())
res = Timer("to_assess()", globals=glob).repeat(5, 300)
print(res)

before: [2.667514099999991, 2.711463600000002, 2.7073388000000023, 2.6616616999999962, 2.7392122000000114]
after: [0.9359815999999981, 0.9132108000000017, 0.9482149999999976, 0.9659205999999969, 0.9622573000000045]
65% improvement

@smarie smarie changed the title [WIP] Some faster date/time string formatting [READY] Some faster date/time string formatting Apr 12, 2022
@smarie
Copy link
Contributor Author

smarie commented Apr 12, 2022

Ready for review

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these have specific asv's that test performance? if not pls add. pls show them in the PR

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
pandas/_libs/tslibs/timedeltas.pyx Outdated Show resolved Hide resolved
@jreback jreback added the Performance Memory or execution speed performance label Apr 13, 2022
pandas/io/sql.py Outdated Show resolved Hide resolved
@smarie
Copy link
Contributor Author

smarie commented Apr 13, 2022

Thanks to both of you for the quick review ! I took care of all comments.

do these have specific asv's that test performance? if not pls add. pls show them in the PR

Ok. I never used asv, so I'll have to check how this works. It may take a bit of time until I come back with something

@smarie smarie mentioned this pull request May 2, 2022
5 tasks
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, pls rebase

asv_bench/benchmarks/tslibs/strftime.py Outdated Show resolved Hide resolved
pandas/core/arrays/period.py Outdated Show resolved Hide resolved
Sylvain MARIE added 3 commits June 7, 2022 22:39
…ure/44764_strftime_perf_datetimes

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
�	pandas/_libs/tslib.pyx
@smarie
Copy link
Contributor Author

smarie commented Jun 8, 2022

It seems that everything is ok now. Thanks for this review!

@smarie smarie requested a review from jreback June 8, 2022 19:35
@smarie
Copy link
Contributor Author

smarie commented Jun 14, 2022

@jreback : could we move forward on this ? thanks in advance!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fine. @jbrockmendel @mroeschke if any comments

Sylvain MARIE added 2 commits June 16, 2022 13:17
@smarie
Copy link
Contributor Author

smarie commented Jun 16, 2022

Note after updating the branch to the latest main: a recent mod in format_array_from_datetime introduced a new parameter reso, without documenting it in the docstring :(
I tried my best to guess what I should do for the "basic format day" branch but maybe it is not the right interpretation.

@jbrockmendel @jreback @mroeschke if one of you originated this mod, could you please suggest a docstring sentence and check that my interpretation (of not taking it into account for basic format day branch) is still relevant ?

EDIT: @jbrockmendel answered so this is OK now.

…ure/44764_strftime_perf_datetimes

� Conflicts:
�	pandas/_libs/tslib.pyx
@smarie
Copy link
Contributor Author

smarie commented Jun 20, 2022

Ping ?
I rebased one more time, and once again it seems that there are some unrelated issues in two runners.
@jbrockmendel , @mroeschke , can you confirm @jreback 's approval and proceed to merge ? Thanks in advance

@jreback jreback changed the title [READY] Some faster date/time string formatting PERF: Some faster date/time string formatting Jun 21, 2022
…ure/44764_strftime_perf_datetimes

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
@smarie
Copy link
Contributor Author

smarie commented Jul 1, 2022

I updated the PR to the latest head, it builds ok. @jreback approved.

Could we proceed ? @jreback @mroeschke @jbrockmendel

@jreback jreback merged commit 55d9dcf into pandas-dev:main Jul 1, 2022
@jreback
Copy link
Contributor

jreback commented Jul 1, 2022

thanks @smarie

@smarie
Copy link
Contributor Author

smarie commented Jul 4, 2022

Great ! Thanks all for the reviews

@smarie smarie deleted the feature/44764_strftime_perf_datetimes branch July 4, 2022 10:00
@@ -0,0 +1,64 @@
import numpy as np

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review; these benchmarks don't belong in the tslibs directory (unless you can re-write them to not depend on anything outside of tslibs). can you do a follow-up to move them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, do you think that moving them one folder up would be enough ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would do it, yes. thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #47665 . Let's wait for a successful build just in case..

smarie pushed a commit to smarie/pandas that referenced this pull request Jul 11, 2022
mroeschke pushed a commit that referenced this pull request Jul 11, 2022
…libs dir (#47665)

Code review from #46759 : moved benchmark file

Co-authored-by: Sylvain MARIE <sylvain.marie@se.com>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…side of tslibs dir (pandas-dev#47665)

Code review from pandas-dev#46759 : moved benchmark file

Co-authored-by: Sylvain MARIE <sylvain.marie@se.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants