Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] perf improvements for strftime #51298

Open
wants to merge 145 commits into
base: main
Choose a base branch
from

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Feb 10, 2023

This PR is a new clean version of #46116

Sylvain MARIE added 26 commits February 6, 2023 15:44
…nvert_strftime_format` (raises `UnsupportedStrFmtDirective`). This function converts a `strftime` date format string into a native python formatting string
…faster datetime formatting. `_format_native_types` modified with this new argument too. Subclasses modified to support it (`DatetimeArray`, `PeriodArray`, `TimedeltaArray`, `DatetimeIndex`)
… argument `fast_strftime` to use faster datetime formatting.
…nit__`: new boolean argument `fast_strftime` to use faster datetime formatting.
…ure/44764_perf_issue_new

� Conflicts:
�	pandas/_libs/tslibs/period.pyx
�	pandas/io/formats/format.py
�	pandas/tests/scalar/test_nat.py
@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2023

This PR is still pretty big. Any reason why you are introducing a new fast_strftime keyword instead of just trying to improve performance inplace? I think that would help to reduce the size, though still probably need to break up in smaller subsets. The bigger a PR is, the harder it is to review so ends up in a long review cycle

Sylvain MARIE added 3 commits February 14, 2023 09:50
…ired by `Period.fast_strftime` and `Timestamp.fast_strftime`
…ure/44764_perf_issue_new

� Conflicts:
�	pandas/tests/frame/methods/test_to_csv.py
@smarie
Copy link
Contributor Author

smarie commented Apr 2, 2024

thanks for updating

I think it's not correct with negative dates:

In [21]: pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]')).strftime('%y')
Out[21]: Index(['80', '20'], dtype='object')

Indeed. If I'm not mistaken, pandas does not handle negative dates in strftime currently. Indeed instance and array strftime raise an exception:

pd.DatetimeIndex(np.array(['-0020-01-01', '2020-01-02'], 'datetime64[s]')).strftime("%y")

raises

NotImplementedError: strftime not yet supported on Timestamps which are outside the range of Python's standard library. For now, please call the components you need (such as `.year` and `.month`) and construct your string from there.

Note that the same error is raised by the instance level one Timestamp("-0020-01-01").strftime("%y")

I'll add a test and make sure we can get the same error raised

@MarcoGorelli
Copy link
Member

Raising "not yet supported" is fine, we really need to avoid silently returning wrong results

@smarie
Copy link
Contributor Author

smarie commented Apr 2, 2024

Raising "not yet supported" is fine, we really need to avoid silently returning wrong results

It would probably slow down the performance of strftime on arrays if we were to raise an error. Indeed to get this error we would have

  • to create a python datetime object (this raises the error as you can see in current Timestamp.strftime),
  • or to compare the timestamp with datetime.min and datetime.max python constants (might not be totally right in presence of timezone-aware timestamp).

Since the fix was really simple, I rather fixed the issue (it was just a matter of handling the modulo operation right).

--

However, this now creates a difference between

  • the more permissive array operation DateTimeIndex.strftime (which calls Timestamp._fast_strftime behind the scenes and therefore is OS-independent and supports negative dates),
  • the legacy less permissive instance-level Timestamp.strftime, that is OS-dependent and can raise errors for negative dates or other situations, such as using "%y" for dates before 1900 on windows.

(Note that there is already a difference in pandas today since DateTimeIndex.strftime directly calls OS C strftime without going through datetime.strftime first, hence not raising the same exception as Timestamp.strftime.)

Is this difference a problem ?

If so, I suggest to add an explicit argument use_py_datetime: bool=False to the instance-level method. That way, by default the instance and array ops will continue to have the same behaviour (and will not raise this NotImplementedError anymore). Yet, on developer's explicit choice, turning use_py_datetime=True on the instance method would use the legacy method, convertible the timestamp to python datetime before using strftime.

What do you think ?

EDIT: I renamed the parameter in the proposal above to use_py_datetime rather than fast which was meaningless.

EDIT2: if we keep this behaviour, we should declare the new feature (support for negative/out-of-python-range dates) in the changelog. I did not check yet if there are open issues on this.

EDIT3: I slightly improved the message above. Other possible names for that parameter : use_py_strftime or use_os_strftime.

@smarie smarie requested a review from MarcoGorelli April 2, 2024 22:53
@smarie
Copy link
Contributor Author

smarie commented Apr 4, 2024

@MarcoGorelli I edited previous message to add details. I also made tests pass.
Finally, I thought of another side effect of current proposal, hopefully minor but I let you decide.

If the user uses a strftime template that can not be converted to the fast python template, for example %Y %a, then DateTimeIndex.strftime will silently fallback to current behaviour, that is, OS-dependent C strftime.

Therefore a user could experience this situation :

  • "It was working fine with negative dates but when I modified the template from '%Y' to '%Y %a' it stopped supporting negative dates".
  • or "on my linux machine, %Y was providing a non-zero-padded representation of the year for year 123, and when I changed the template from '%Y' to '%Y %a' it now provides a zero-padded representation"

Both of these situations can be solved easily by reintroducing a parameter on all array-acting strftime, controlling which engine is used. For example DateTimeIndex.strftime(format: str, engine: str = "auto" (default), "py_str_template", "c_strftime"). The default behaviour would be the current one (silent fallback from 'py_str_template' to 'c_strftime' if format cannot be transformed into a fast string template).

Let me know how you want to proceed.

@smarie
Copy link
Contributor Author

smarie commented Apr 8, 2024

@MarcoGorelli I had a few thoughts about all of this during week-end and decided to describe the two related issues of consistency and error management in separate tickets :

That way, we can potentially try to not solve all issues at once, merge the current PR purely as a "performance" PR, and discuss the longer-term vision in dedicated PR(s) tackling the two above tickets.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! I have a question about esc, but I think this looks good

Agree with your suggestion about adding engine, and agree that it can be done separately

Thanks for having stuck with this for so long - the perf improvements are really noticeable so it's worth doing this!

@WillAyd do you have any comments?

Comment on lines +230 to +236
esc = "/_+\\"

# Escape the %% before searching for directives, same as strftime
strftime_fmt = strftime_fmt.replace("%%", esc)

esc_l = "+^_\\"
esc_r = "/_^+"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit mysterious to me, mind explaining why you've constructed esc, esc_l and esc_r like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha good catch. Indeed this is ugly. Purely random strings here. Long enough so that users will not hit them by chance.

A more robust way would probably be to check that the above do not exist in the input sequence, and if they do, to add as many trailing '+' (or any other char) as needed so that it is not present in the input string. We could implement it in a find_escape_patterns internal function.
If you agree with this strategy I can change it like this.

@WillAyd
Copy link
Member

WillAyd commented Apr 12, 2024

Sorry if I have missed the decision in comments but from an API perspective I really dislike fast_strfrtime - why are we not just improving performance in place?

@smarie
Copy link
Contributor Author

smarie commented Apr 12, 2024

@WillAyd there is no fast_strftime public API in the current proposal. There are new private methods on Timestamp and Period instances, and there is a new parameter in the low-level C array routines of tslib for period and timestamp arrays.

The reason we cannot completely hide these is due to the fact that the legacy engine is OS-dependent and has a larger compatibility scope, while the new faster engine is non-OS-dependent and has a more restricted compatibility scope. This is explained in details in the past discussions above, and summarized in the "why is it so" section of #58179

EDIT: Anticipating #58179 we could rename it _strftime_pystr if you prefer (mentioning the engine name pystr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Stale Timeseries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: strftime is slow
5 participants