New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[READY] perf improvements for strftime #51298
base: main
Are you sure you want to change the base?
Conversation
…string formatting
…nvert_strftime_format` (raises `UnsupportedStrFmtDirective`). This function converts a `strftime` date format string into a native python formatting string
… use faster datetime formatting.
…faster datetime formatting. `_format_native_types` modified with this new argument too. Subclasses modified to support it (`DatetimeArray`, `PeriodArray`, `TimedeltaArray`, `DatetimeIndex`)
… argument `fast_strftime` to use faster datetime formatting.
…nit__`: new boolean argument `fast_strftime` to use faster datetime formatting.
…ure/44764_perf_issue_new � Conflicts: � pandas/_libs/tslibs/period.pyx � pandas/io/formats/format.py � pandas/tests/scalar/test_nat.py
…ure/44764_perf_issue_new
This PR is still pretty big. Any reason why you are introducing a new |
…ired by `Period.fast_strftime` and `Timestamp.fast_strftime`
…ure/44764_perf_issue_new � Conflicts: � pandas/tests/frame/methods/test_to_csv.py
Indeed. If I'm not mistaken, pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]')).strftime("%y") raises NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there. Note that the same error is raised by the instance level one I'll add a test and make sure we can get the same error raised |
Raising "not yet supported" is fine, we really need to avoid silently returning wrong results |
It would probably slow down the performance of strftime on arrays if we were to raise an error. Indeed to get this error we would have
Since the fix was really simple, I rather fixed the issue (it was just a matter of handling the modulo operation right). -- However, this now creates a difference between
(Note that there is already a difference in pandas today since Is this difference a problem ? If so, I suggest to add an explicit argument What do you think ? EDIT: I renamed the parameter in the proposal above to EDIT2: if we keep this behaviour, we should declare the new feature (support for negative/out-of-python-range dates) in the changelog. I did not check yet if there are open issues on this. EDIT3: I slightly improved the message above. Other possible names for that parameter : |
…responding tests.
…to feature/44764_perf_issue_new # Conflicts: # doc/source/whatsnew/v3.0.0.rst
…ure/44764_perf_issue_new
…ure/44764_perf_issue_new
@MarcoGorelli I edited previous message to add details. I also made tests pass. If the user uses a strftime template that can not be converted to the fast python template, for example Therefore a user could experience this situation :
Both of these situations can be solved easily by reintroducing a parameter on all array-acting Let me know how you want to proceed. |
@MarcoGorelli I had a few thoughts about all of this during week-end and decided to describe the two related issues of consistency and error management in separate tickets :
That way, we can potentially try to not solve all issues at once, merge the current PR purely as a "performance" PR, and discuss the longer-term vision in dedicated PR(s) tackling the two above tickets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! I have a question about esc
, but I think this looks good
Agree with your suggestion about adding engine
, and agree that it can be done separately
Thanks for having stuck with this for so long - the perf improvements are really noticeable so it's worth doing this!
@WillAyd do you have any comments?
esc = "/_+\\" | ||
|
||
# Escape the %% before searching for directives, same as strftime | ||
strftime_fmt = strftime_fmt.replace("%%", esc) | ||
|
||
esc_l = "+^_\\" | ||
esc_r = "/_^+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit mysterious to me, mind explaining why you've constructed esc
, esc_l
and esc_r
like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha good catch. Indeed this is ugly. Purely random strings here. Long enough so that users will not hit them by chance.
A more robust way would probably be to check that the above do not exist in the input sequence, and if they do, to add as many trailing '+' (or any other char) as needed so that it is not present in the input string. We could implement it in a find_escape_patterns
internal function.
If you agree with this strategy I can change it like this.
Sorry if I have missed the decision in comments but from an API perspective I really dislike |
@WillAyd there is no The reason we cannot completely hide these is due to the fact that the legacy engine is OS-dependent and has a larger compatibility scope, while the new faster engine is non-OS-dependent and has a more restricted compatibility scope. This is explained in details in the past discussions above, and summarized in the "why is it so" section of #58179 EDIT: Anticipating #58179 we could rename it |
This PR is a new clean version of #46116
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.