PERF: strftime is slow #44764

auderson · 2021-12-05T04:29:04Z

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I found pd.DatatimeIndex.strftime is pretty slow when data is large.

In the following I made a simple benchmark. method_b first stores 'year', 'month', 'day', 'hour', 'minute', 'second', then convert them to string with f-formatter. Although it's written in python, the time spent is significantly lower.

import time
import numpy as np
import pandas as pd
from joblib import Parallel, delayed

def timer(f):
    def inner(*args, **kwargs):
        s = time.time()
        result = f(*args, **kwargs)
        e = time.time()
        return e - s
    return inner


@timer
def method_a(index):
    return index.strftime("%Y-%m-%d %H:%M:%S")

@timer
def method_b(index):
    attrs = ('year', 'month', 'day', 'hour', 'minute', 'second')
    parts = [getattr(index, at) for at in attrs]
    b = []
    for year, month, day, hour, minute, second in zip(*parts):
        b.append(f'{year}-{month:02}-{day:02} {hour:02}:{minute:02}:{second:02}')
    b = pd.Index(b)
    return b

index = pd.date_range('2000', '2020', freq='1min')

@delayed
def profile(p):
    n = int(10 ** p)
    time_a = method_a(index[:n])
    time_b = method_b(index[:n])
    return n, time_a, time_b

records = Parallel(10, verbose=10)(profile(p) for p in np.arange(1, 7.1, 0.1))

pd.DataFrame(records, columns=['n', 'time_a', 'time_b']).set_index('n').plot(figsize=(10, 8))

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-63-generic
Version : #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.20.0
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.54.1

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

auderson · 2021-12-12T09:34:04Z

After reading the source code, I probably know where the bottleneck comes from. Internally strftime is called on every single element, meaning that the format string is repeatedly evaluated. If that can be done for just once before the loop, the performance can be much better.
I also notice format_array_from_datetime has a shortcut for format=None:

pandas/pandas/_libs/tslib.pyx

Lines 152 to 166 in 193ca73

    
           elif basic_format: 
        
               dt64_to_dtstruct(val, &dts) 
        
               res = (f'{dts.year}-{dts.month:02d}-{dts.day:02d} ' 
        
                      f'{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}') 
        
               if show_ns: 
        
                   ns = dts.ps // 1000 
        
                   res += f'.{ns + dts.us * 1000:09d}' 
        
               elif show_us: 
        
                   res += f'.{dts.us:06d}' 
        
               elif show_ms: 
        
                   res += f'.{dts.us // 1000:03d}' 
        
               result[i] = res

Using this feature, method_c became the fastest:

@timer
def method_c(index):
    return index.strftime(None)

I suggest to change the following line to

basic_format = format is None or format ==  "%Y-%m-%d %H:%M:%S"  and tz is None

pandas/pandas/_libs/tslib.pyx

Line 134 in 193ca73

basic_format = format is None and tz is None

smarie · 2022-02-22T16:41:50Z

I confirm that we identified the same performance issue on our side, with custom formats such as "%Y-%m-%dT%H:%M:%SZ".

It would be great to improve this in a future version ! Would you like us to propose a PR ? If so, some guidance would be appreciated.

jreback · 2022-02-22T16:43:31Z

PRs are how things are fixed

core can provide review

smarie · 2022-02-22T17:14:50Z

I opened a draft PR. It seems to me that we could have some kind of format string processor run beforehand, in order to transform all strftime patterns i.e. %Y-%m-%d %H:%M:%S into '{dts.year}-{dts.month:02d}-{dts.day:02d} {dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}'.

I'll have a try in the upcoming days

smarie · 2022-03-12T14:38:24Z

@auderson , just being curious: did you try running your benchmark on windows ? Indeed it seems from my first benchmark results that it is even slower (blue curve) : #46116 (comment)

auderson · 2022-03-13T01:27:15Z

@smarie I ran this on a Linux Jupyter notebook.

auderson · 2022-03-13T02:04:08Z

This is my result on windows 10, a bit faster than yours #46116 (comment) but still way slower than Linux

EDIT

@smarie
I installed WSL on my desktop (CPU 3600X with 32GB 3000MHZ dual channel RAM) and ran again:

Looks like windows strftime is slower than Linux!

smarie · 2022-03-13T14:34:36Z

Thanks @auderson for this confirmation !

…imes faster ! Related to pandas-dev#44764

…eIndex`: string formatting is now up to 80% faster (as fast as default) when one of the default strftime formats ``"%Y-%m-%d %H:%M:%S"`` or ``"%Y-%m-%d %H:%M:%S.%f"`` is used. See pandas-dev#44764

…QLiteTable`): processing time arrays can be up to 65% faster ! (related to pandas-dev#44764)

auderson added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 5, 2021

mroeschke added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2021

smarie mentioned this issue Feb 22, 2022

[WIP] perf improvements for strftime #46116

Closed

5 tasks

smarie pushed a commit to smarie/pandas that referenced this issue Apr 12, 2022

Performance improvement in :class:BusinessHour, repr is now 4 t…

01b9667

…imes faster ! Related to pandas-dev#44764

smarie mentioned this issue Apr 12, 2022

PERF: Some faster date/time string formatting #46759

Merged

4 tasks

smarie pushed a commit to smarie/pandas that referenced this issue Apr 12, 2022

Performance improvement in Series.to_sql and DataFrame.to_sql (`S…

54f51f0

…QLiteTable`): processing time arrays can be up to 65% faster ! (related to pandas-dev#44764)

jreback added this to the 1.5 milestone May 15, 2022

mroeschke removed this from the 1.5 milestone Aug 15, 2022

smarie linked a pull request Feb 10, 2023 that will close this issue

[READY] perf improvements for strftime #51298

Open

5 tasks

smarie mentioned this issue Feb 17, 2023

[READY] Improved performance of Period's default formatter (period_format) #51459

Merged

3 tasks

smarie mentioned this issue Apr 8, 2024

ENH: consistent strftime behaviour #58179

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: strftime is slow #44764

PERF: strftime is slow #44764

auderson commented Dec 5, 2021

INSTALLED VERSIONS

auderson commented Dec 12, 2021 •

edited

Loading

smarie commented Feb 22, 2022 •

edited

Loading

jreback commented Feb 22, 2022

smarie commented Feb 22, 2022 •

edited

Loading

smarie commented Mar 12, 2022 •

edited

Loading

auderson commented Mar 13, 2022

auderson commented Mar 13, 2022 •

edited

Loading

smarie commented Mar 13, 2022

PERF: strftime is slow #44764

PERF: strftime is slow #44764

Comments

auderson commented Dec 5, 2021

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

auderson commented Dec 12, 2021 • edited Loading

smarie commented Feb 22, 2022 • edited Loading

jreback commented Feb 22, 2022

smarie commented Feb 22, 2022 • edited Loading

smarie commented Mar 12, 2022 • edited Loading

auderson commented Mar 13, 2022

auderson commented Mar 13, 2022 • edited Loading

EDIT

smarie commented Mar 13, 2022

auderson commented Dec 12, 2021 •

edited

Loading

smarie commented Feb 22, 2022 •

edited

Loading

smarie commented Feb 22, 2022 •

edited

Loading

smarie commented Mar 12, 2022 •

edited

Loading

auderson commented Mar 13, 2022 •

edited

Loading