Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

Closed
bchu opened this issue Mar 22, 2018 · 4 comments · Fixed by #35027
Closed

pd.to_datetime, unit='s' much slower for float64 than for int64 #20445

bchu opened this issue Mar 22, 2018 · 4 comments · Fixed by #35027
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Milestone

Comments

@bchu
Copy link

bchu commented Mar 22, 2018

Calling pd.to_datetime with the unit='s' kwarg appears to be 1000x slower for float64 than for int64. There does not appear to be a difference in performance between the two types if the timestamps are first converted to nanoseconds and no unit is specified.

timestamp_seconds_int = pd.Series(np.random.randint(1521685107 - 604800, 1521685107, 1000000, dtype='int64'))
timestamp_seconds_float = timestamp_seconds_int.astype('float64')
%%timeit -r 3
pd.to_datetime(timestamp_seconds_int, unit='s')
Output: 12.4 ms ± 1.66 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)
%%timeit -r 3
pd.to_datetime(timestamp_seconds_float, unit='s')
Output: 6.88 s ± 138 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+658.g17c1fadb0
pytest: 3.0.6
pip: 9.0.3
setuptools: 38.5.2
Cython: 0.28.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

see the logic here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L400

for integers of unit='ns' we take a fast path essentially by astyping. However we don't special cases ints/floats of non-ns. We iterate over the list (in cython), but this is not necessary when its a uniform dtypes (this is necessary for strings and mixed dtypes).

so this is conceptually very easy to fix, but would have to slightly re-factor cast_from_unit into 2 parts: 1) that returns m, p (the rouding and multiplers and 2) use those to replace the existing function.

The reason is you need part 1) to do the calculation.

This is all very straightforward, but care must be taken to make sure cast_from_unit is still performant.

@jreback jreback added Datetime Datetime data dtype Difficulty Intermediate Performance Memory or execution speed performance labels Mar 22, 2018
@jreback jreback added this to the Next Major Release milestone Mar 22, 2018
@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

we also likely don't have asv's for this.

@jbrockmendel jbrockmendel added this to DTI/DTA Constructor Issues in DatetimeArray Refactor Nov 16, 2018
@arw2019
Copy link
Member

arw2019 commented Jun 20, 2020

I notice that this has been open for a while - I'm keen to work on this if there's interest!

@arw2019
Copy link
Member

arw2019 commented Jun 24, 2020

Checked that this is still an issue (June 2020). I get

%%timeit -r 3 
pd.to_datetime(timestamp_seconds_int, unit='s')                                                                                                                                                                                  
44.1 ms ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
%%timeit -r 3
pd.to_datetime(timestamp_seconds_float, unit='s')
15.3 s ± 144 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Output of pd.versions() INSTALLED VERSIONS ------------------ commit : 97f7918 python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-106-generic Version : #107-Ubuntu SMP Thu Jun 4 11:27:52 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1931.g97f791876
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : 5.18.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

I'll work on fixing this following @jreback's comments above, and add the benchmarks.

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
No open projects
DatetimeArray Refactor
  
DTI/DTA Constructor Issues
Development

Successfully merging a pull request may close this issue.

4 participants