pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

bchu · 2018-03-22T02:32:44Z

Calling pd.to_datetime with the unit='s' kwarg appears to be 1000x slower for float64 than for int64. There does not appear to be a difference in performance between the two types if the timestamps are first converted to nanoseconds and no unit is specified.

timestamp_seconds_int = pd.Series(np.random.randint(1521685107 - 604800, 1521685107, 1000000, dtype='int64'))
timestamp_seconds_float = timestamp_seconds_int.astype('float64')

%%timeit -r 3
pd.to_datetime(timestamp_seconds_int, unit='s')
Output: 12.4 ms ± 1.66 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)

%%timeit -r 3
pd.to_datetime(timestamp_seconds_float, unit='s')
Output: 6.88 s ± 138 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+658.g17c1fadb0
pytest: 3.0.6
pip: 9.0.3
setuptools: 38.5.2
Cython: 0.28.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-03-22T10:13:26Z

see the logic here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L400

for integers of unit='ns' we take a fast path essentially by astyping. However we don't special cases ints/floats of non-ns. We iterate over the list (in cython), but this is not necessary when its a uniform dtypes (this is necessary for strings and mixed dtypes).

so this is conceptually very easy to fix, but would have to slightly re-factor cast_from_unit into 2 parts: 1) that returns m, p (the rouding and multiplers and 2) use those to replace the existing function.

The reason is you need part 1) to do the calculation.

This is all very straightforward, but care must be taken to make sure cast_from_unit is still performant.

jreback · 2018-03-22T10:14:01Z

we also likely don't have asv's for this.

arw2019 · 2020-06-20T06:46:46Z

I notice that this has been open for a while - I'm keen to work on this if there's interest!

arw2019 · 2020-06-24T18:47:43Z

Checked that this is still an issue (June 2020). I get

%%timeit -r 3 
pd.to_datetime(timestamp_seconds_int, unit='s')                                                                                                                                                                                  
44.1 ms ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

%%timeit -r 3
pd.to_datetime(timestamp_seconds_float, unit='s')
15.3 s ± 144 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Output of pd.versions()

INSTALLED VERSIONS ------------------ commit : 97f7918 python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-106-generic Version : #107-Ubuntu SMP Thu Jun 4 11:27:52 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1931.g97f791876
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : 5.18.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

I'll work on fixing this following @jreback's comments above, and add the benchmarks.

jreback added Datetime Datetime data dtype Difficulty Intermediate Performance Memory or execution speed performance labels Mar 22, 2018

jreback added this to the Next Major Release milestone Mar 22, 2018

jbrockmendel added this to DTI/DTA Constructor Issues in DatetimeArray Refactor Nov 16, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

arw2019 mentioned this issue Jun 27, 2020

PERF: pd.to_datetime, unit='s' much slower for float64 than for int64 #35027

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Sep 19, 2020

jreback closed this as completed in #35027 Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

bchu commented Mar 22, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

arw2019 commented Jun 20, 2020

arw2019 commented Jun 24, 2020

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

Comments

bchu commented Mar 22, 2018

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

arw2019 commented Jun 20, 2020

arw2019 commented Jun 24, 2020

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS