Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] perf improvements for strftime #46116

Closed
wants to merge 68 commits into from

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Feb 22, 2022

Changelog

@jbrockmendel
Copy link
Member

Can you post %timeit results showing the perf improvement? (edit OP)

@smarie
Copy link
Contributor Author

smarie commented Feb 24, 2022

Sure. I edited the todo list accordingly.

@smarie
Copy link
Contributor Author

smarie commented Feb 24, 2022

Note: there seem to be a related (but distinct) issue in #29461 , solved by #34668 .

Still I will stick to the current road where I modify the format_array_from_datetime function only.

… to be able to replace strftime faster. The result can be demonstrated using `fast_strftime`.
@pep8speaks
Copy link

pep8speaks commented Feb 24, 2022

Hello @smarie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-27 14:05:05 UTC

@smarie
Copy link
Contributor Author

smarie commented Feb 24, 2022

First milestone reached: all utility functions required to create the fast datetime formatting strings are there.

A fast_strftime util function can be used to demonstrate (and test) their performance on plain old datetime objects. Added relevant tests for this.

Next step: modify format_array_from_datetime so that it relies on the same underlying function convert_dtformat in order to format npy_datetimestruct instances.

@jreback jreback changed the title [Work in progress] perf improvements for strftime [WIP] perf improvements for strftime Feb 27, 2022
@jreback jreback added Performance Memory or execution speed performance Datetime Datetime data dtype labels Feb 27, 2022
Sylvain MARIE added 19 commits February 28, 2022 15:05
…and `UnsupportedDatetimeDirective`. Modified `format_array_from_datetime` to use it when `fast_strftime=True` (new arg). Removed `get_datetime_fmt_dct`
…epts `fast_strftime`. Modified `PeriodArray._format_native_types` so that it benefits from `fast_strftime` too.
…ter`, `Datetime64TZFormatter` and `DatetimeIndex`.
…iod_strftime %q format. Also accelerated slightly the _period_strftime.
… code was right, not the docstring. Fixed the docstring
…tting symbols (% or {} depending on formatting style)
…pecial characters. New `Period.fast_strftime` used in `PeriodArray`. Renamed `convert_dtformat` into `convert_strftime_format` and `UnsupportedDatetimeDirective` into `UnsupportedStrFmtDirective`.
@smarie
Copy link
Contributor Author

smarie commented Mar 11, 2022

Noted - I'll at least try to move the two small fixes for Period into separate PRs, before this final big one about fast_strftime is reviewed.

@smarie
Copy link
Contributor Author

smarie commented Mar 12, 2022

Quick status on this PR: today I reached a point where all CI checks pass (finally !!!), and therefore I was able to try reproducing the benchmark in #44764 . Results are pretty good !

image

  • (blue) the "legacy" curve corresponds to using strftime. It seems even worse on my machine (windows 10) than the result reported in PERF: strftime is slow #44764 !
  • (orange) the "manual" curve corresponds to the same manual method as in PERF: strftime is slow #44764. It is as good as reported in PERF: strftime is slow #44764
  • (green) the "new and legacy default" represents the case where date_format=None (default format). It is as good as reported in PERF: strftime is slow #44764
  • (red) the "new custom" is the main use case why this PR was created: a custom date format. It was improved a lot as compared to the blue curve !
  • (violet) the "new basic formats" represents the case reported by @auderson where the custom date format is actually identical to the default format (green). We see that both now have the same perf
```python import time import numpy as np import pandas as pd from joblib import Parallel, delayed

def timer(f):
def inner(*args, **kwargs):
s = time.time()
result = f(*args, **kwargs)
e = time.time()
return e - s
return inner

@Timer
def method_legacy(index):
return index.strftime("%Y-%m-%d_%H:%M:%S", fast_strftime=False)

@Timer
def method_manual(index):
attrs = ('year', 'month', 'day', 'hour', 'minute', 'second')
parts = [getattr(index, at) for at in attrs]
b = []
for year, month, day, hour, minute, second in zip(*parts):
b.append(f'{year}-{month:02}-{day:02}_{hour:02}:{minute:02}:{second:02}')
b = pd.Index(b)
return b

@Timer
def method_new_and_legacy_default(index):
return index.strftime(None)

@Timer
def method_new_custom(index):
return index.strftime("%Y-%m-%d_%H:%M:%S")

@Timer
def method_new_basic_formats(index):
return index.strftime("%Y-%m-%d %H:%M:%S")

index = pd.date_range('2000', '2020', freq='1min')

@delayed
def profile(p):
n = int(10 ** p)
time_legacy = method_legacy(index[:n])
time_manual = method_manual(index[:n])
time_new_and_legacy_default = method_new_and_legacy_default(index[:n])
time_new_custom = method_new_custom(index[:n])
time_new_basic_formats = method_new_basic_formats(index[:n])
return n, time_legacy, time_manual, time_new_and_legacy_default,
time_new_custom, time_new_basic_formats

records = Parallel(10, verbose=10, max_nbytes='500M')(
profile(p) for p in np.arange(1, 7.1, 0.1)
)

pd.DataFrame(records, columns=['n', 'time_legacy', 'time_manual',
'time_new_and_legacy_default', 'time_new_custom',
'time_new_basic_formats'
]).set_index(
'n').plot(figsize=(10, 8))

import matplotlib.pyplot as plt
plt.show()

</details>

@smarie smarie mentioned this pull request Mar 12, 2022
3 tasks
@jreback
Copy link
Contributor

jreback commented Mar 12, 2022

@smarie thins way too large a PR

pls split this up into logical pieces

@smarie
Copy link
Contributor Author

smarie commented Mar 12, 2022

@jreback yes I was going to split it as explained above, but I wanted first to reach a "clean" state where CI is happy and the benchmark shows good results. Now I will try (next week) to split this into pieces, but as already mentioned, there is not much I can do here: strftime was used in many places (Period, Timestamp, csv formatter, excel formatter, sqlite adapter, some repr methods, etc.) and all of them needed replacement. I will do my best do separate between

  • bug fixes
  • quick fixes for all repr and the like (not requiring core changes)
  • the "core" replacement for datetime arrays
  • the same for periods
  • ... maybe other similar

@auderson
Copy link
Contributor

This is impressive😀

@@ -2340,14 +2392,70 @@ cdef class _Period(PeriodMixin):
object_state = None, self.freq, self.ordinal
return (Period, object_state)

def fast_strftime(self, fmt_str: str, loc_s: object) -> str:
"""A faster alternative to `strftime` using string formatting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick newline after the """

@@ -101,7 +105,8 @@ def format_array_from_datetime(
ndarray[int64_t] values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a tslibs.strftime does get implemented, that might be a good home for format_array_from_datetime

…ure/44764_perf_issue

� Conflicts:
�	pandas/_libs/tslibs/period.pyx
�	pandas/core/arrays/period.py
�	pandas/tests/io/formats/test_format.py
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 27, 2022
@smarie
Copy link
Contributor Author

smarie commented May 2, 2022

bot: please leave this open. #46759 and #46405 are the next bits of this PR to consider

@Paul-AUB
Copy link

Hello @smarie, it seems there is a flake8 issue (pandas/tests/io/formats/test_format.py:3253:5: F811 redefinition of unused 'test_period_custom' from line 3201) to fix and a conflict to solve before this can be merged. Could you do that please ? Feature is really interesting !

@smarie
Copy link
Contributor Author

smarie commented Jun 17, 2022

Thanks @Paul-AUB ! Actually this PR is just here as a reference, it is split in several pieces. #46759 and #46405 are the next bits of this PR to consider, the first is actually quite close to be merged. The second has a side effect in the CI concerning some locale-dependent setting, that needs to be investigated. Once #46405 is merged, I'll finally update this PR or create a new one with the same contents, so that it contains the remaining bit (the engine able to improve any strftime format)

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Dec 6, 2022
@smarie
Copy link
Contributor Author

smarie commented Dec 12, 2022

Hi @mroeschke, yes I plan to finish this now that most of its "other aspects" have been merged (through other PRs #46361, #46759, #46405, #47570). I'll try to find some time later this month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Stale
Projects
None yet
8 participants