Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: strftime is slow #44764

Open
3 tasks done
auderson opened this issue Dec 5, 2021 · 8 comments · May be fixed by #51298
Open
3 tasks done

PERF: strftime is slow #44764

auderson opened this issue Dec 5, 2021 · 8 comments · May be fixed by #51298
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@auderson
Copy link
Contributor

auderson commented Dec 5, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I found pd.DatatimeIndex.strftime is pretty slow when data is large.

In the following I made a simple benchmark. method_b first stores 'year', 'month', 'day', 'hour', 'minute', 'second', then convert them to string with f-formatter. Although it's written in python, the time spent is significantly lower.

import time
import numpy as np
import pandas as pd
from joblib import Parallel, delayed

def timer(f):
    def inner(*args, **kwargs):
        s = time.time()
        result = f(*args, **kwargs)
        e = time.time()
        return e - s
    return inner


@timer
def method_a(index):
    return index.strftime("%Y-%m-%d %H:%M:%S")

@timer
def method_b(index):
    attrs = ('year', 'month', 'day', 'hour', 'minute', 'second')
    parts = [getattr(index, at) for at in attrs]
    b = []
    for year, month, day, hour, minute, second in zip(*parts):
        b.append(f'{year}-{month:02}-{day:02} {hour:02}:{minute:02}:{second:02}')
    b = pd.Index(b)
    return b

index = pd.date_range('2000', '2020', freq='1min')

@delayed
def profile(p):
    n = int(10 ** p)
    time_a = method_a(index[:n])
    time_b = method_b(index[:n])
    return n, time_a, time_b

records = Parallel(10, verbose=10)(profile(p) for p in np.arange(1, 7.1, 0.1))
pd.DataFrame(records, columns=['n', 'time_a', 'time_b']).set_index('n').plot(figsize=(10, 8))

image

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-63-generic
Version : #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.20.0
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.54.1

Prior Performance

No response

@auderson auderson added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 5, 2021
@auderson
Copy link
Contributor Author

auderson commented Dec 12, 2021

After reading the source code, I probably know where the bottleneck comes from. Internally strftime is called on every single element, meaning that the format string is repeatedly evaluated. If that can be done for just once before the loop, the performance can be much better.
I also notice format_array_from_datetime has a shortcut for format=None:

elif basic_format:
dt64_to_dtstruct(val, &dts)
res = (f'{dts.year}-{dts.month:02d}-{dts.day:02d} '
f'{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}')
if show_ns:
ns = dts.ps // 1000
res += f'.{ns + dts.us * 1000:09d}'
elif show_us:
res += f'.{dts.us:06d}'
elif show_ms:
res += f'.{dts.us // 1000:03d}'
result[i] = res

Using this feature, method_c became the fastest:

@timer
def method_c(index):
    return index.strftime(None)

image

I suggest to change the following line to

basic_format = format is None or format ==  "%Y-%m-%d %H:%M:%S"  and tz is None 

basic_format = format is None and tz is None

@mroeschke mroeschke added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2021
@smarie
Copy link
Contributor

smarie commented Feb 22, 2022

I confirm that we identified the same performance issue on our side, with custom formats such as "%Y-%m-%dT%H:%M:%SZ".

It would be great to improve this in a future version ! Would you like us to propose a PR ? If so, some guidance would be appreciated.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2022

PRs are how things are fixed

core can provide review

@smarie
Copy link
Contributor

smarie commented Feb 22, 2022

I opened a draft PR. It seems to me that we could have some kind of format string processor run beforehand, in order to transform all strftime patterns i.e. %Y-%m-%d %H:%M:%S into '{dts.year}-{dts.month:02d}-{dts.day:02d} {dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}'.

I'll have a try in the upcoming days

@smarie
Copy link
Contributor

smarie commented Mar 12, 2022

@auderson , just being curious: did you try running your benchmark on windows ? Indeed it seems from my first benchmark results that it is even slower (blue curve) : #46116 (comment)

@auderson
Copy link
Contributor Author

@smarie I ran this on a Linux Jupyter notebook.

@auderson
Copy link
Contributor Author

auderson commented Mar 13, 2022

This is my result on windows 10, a bit faster than yours #46116 (comment) but still way slower than Linux

Figure_1

EDIT

@smarie
I installed WSL on my desktop (CPU 3600X with 32GB 3000MHZ dual channel RAM) and ran again:

image

Looks like windows strftime is slower than Linux!

@smarie
Copy link
Contributor

smarie commented Mar 13, 2022

Thanks @auderson for this confirmation !

smarie pushed a commit to smarie/pandas that referenced this issue Apr 12, 2022
smarie pushed a commit to smarie/pandas that referenced this issue Apr 12, 2022
…eIndex`: string formatting is now up to 80% faster (as fast as default) when one of the default strftime formats ``"%Y-%m-%d %H:%M:%S"`` or ``"%Y-%m-%d %H:%M:%S.%f"`` is used. See pandas-dev#44764
smarie pushed a commit to smarie/pandas that referenced this issue Apr 12, 2022
…QLiteTable`): processing time arrays can be up to 65% faster ! (related to pandas-dev#44764)
@jreback jreback added this to the 1.5 milestone May 15, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@smarie smarie linked a pull request Feb 10, 2023 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
4 participants