PERF: groupby is significantly slower for `DatetimeIndex` with timezone #58956

veenstrajelmer · 2024-06-07T14:07:32Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
for tz in [None, "UTC+01:00"]:
    dtindex = pd.date_range("1900-01-01","2024-01-01", freq="30min", tz=tz)
    df = pd.DataFrame(index=dtindex)
    df["values"] = 1
    print(f'deriving stats for tz={tz}')
    dtstart = pd.Timestamp.now()
    wl_count_peryear = df.groupby(pd.PeriodIndex(df.index, freq="Y"))["values"].count() # .mean() gives comparable timings
    print(f'{(pd.Timestamp.now()-dtstart).total_seconds():.2f} sec')

The above code does a groupby with an arbitrary reduction, both with and without tz in the DatetimeIndex, the timings are the following:

deriving stats for tz=None:
0.09 sec
deriving stats for tz=UTC+01:00:
7.72 sec

This is a significant performance difference, while the results are equal. Is this expected or a bug? I can workaround this with df.tz_localize(None) on my dataframe with timezones, but it still seemed good to report this.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : Dutch_Netherlands.1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.9.0.post0
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.12.1
gcsfs : None
matplotlib : 3.8.4
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.1.3
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

Prior Performance

Tested for pandas 2.1.4 and 2.2.2, both versions behave similar.

The text was updated successfully, but these errors were encountered:

Liam3851 · 2024-06-07T14:23:43Z

Speed is the same both ways if you do the more idiomatic:

wl_count_peryear = df.resample('A')["values"].count()

The extra time is entirely due to the construction of the PeriodIndex:

In [192]: %time pd.PeriodIndex(df.index, freq='Y')
CPU times: total: 10.4 s
Wall time: 10.4 s

Note that the speed is recovered if you do

df.index.to_period('Y')

but this raises a warning that the timezone information is dropped (since PeriodIndex is not tz-aware).

veenstrajelmer · 2024-06-07T15:01:53Z

@Liam3851 thanks a lot for clarifying this. In that case please feel free to close this issue, I will update my code accordingly.

veenstrajelmer added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jun 7, 2024

veenstrajelmer mentioned this issue Jun 7, 2024

avoid dropping timezone of measurement dataframe Deltares-research/kenmerkendewaarden#40

Closed

6 tasks

mroeschke closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby is significantly slower for `DatetimeIndex` with timezone #58956

PERF: groupby is significantly slower for `DatetimeIndex` with timezone #58956

veenstrajelmer commented Jun 7, 2024 •

edited

Loading

INSTALLED VERSIONS

Liam3851 commented Jun 7, 2024

veenstrajelmer commented Jun 7, 2024

PERF: groupby is significantly slower for DatetimeIndex with timezone #58956

PERF: groupby is significantly slower for DatetimeIndex with timezone #58956

Comments

veenstrajelmer commented Jun 7, 2024 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Liam3851 commented Jun 7, 2024

veenstrajelmer commented Jun 7, 2024

PERF: groupby is significantly slower for `DatetimeIndex` with timezone #58956

PERF: groupby is significantly slower for `DatetimeIndex` with timezone #58956

veenstrajelmer commented Jun 7, 2024 •

edited

Loading