Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Period factorization very slow in 0.19.0 #14338

Closed
bmoscon opened this issue Oct 3, 2016 · 10 comments · Fixed by #14419
Closed

PERF: Period factorization very slow in 0.19.0 #14338

bmoscon opened this issue Oct 3, 2016 · 10 comments · Fixed by #14419
Labels
Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@bmoscon
Copy link
Contributor

bmoscon commented Oct 3, 2016

df = DataFrame(data={'data': np.random.randint(0, 100, size=5500000),
                         'date': [dt(2016, 1, 1)] * 5500000})
for period, g in df.groupby(pd.DatetimeIndex(df.date).to_period('D')):
    print(g)

Expected Output

outputs dataframe

Output of pd.show_versions()

0.19.0

The output is not the issue, the issue is that in any version before 0.19.0, this was incredibly fast, like ~1 second or less. With 0.19.0, after waiting many minutes I just give up.

@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

Can you simplify the example here to the simplest possible setup? e.g., by removing the groupby?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 3, 2016

It's not the to_period that's slow, so it's probably the groupby.

In [24]: %time p = pd.DatetimeIndex(df.date).to_period('D')
CPU times: user 339 ms, sys: 6.71 ms, total: 345 ms
Wall time: 350 ms

Also, post the actual code you're running if you could (imports too).

@TomAugspurger
Copy link
Contributor

Here's a smaller example,

import time
import pandas as pd

p = pd.period_range('2010-01-01', freq='D', periods=100000)
t0 = time.time()
pd.factorize(p)
t1 = time.time()
print('{}: {:.2f}s'.format(pd.__version__, t1 - t0))

Some outputs:

  • 0.18.1: 0.01s
  • 0.19.0rc1: 2.96s

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 3, 2016

This is probably due to np.asarray(PeriodIndex) not returning an array of integers.

# 0.18.1
In [5]: np.asarray(p)
Out[5]: array([ 14610,  14611,  14612, ..., 114607, 114608, 114609])
# 0.19
In [4]: np.asarray(p)
Out[4]:
array([Period('2010-01-01', 'D'), Period('2010-01-02', 'D'),
       Period('2010-01-03', 'D'), ..., Period('2283-10-14', 'D'),
       Period('2283-10-15', 'D'), Period('2283-10-16', 'D')], dtype=object)

cc @sinhrks I think.

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 3, 2016

Probably just need a check similar to datetimetz around here to view as an int64

https://github.com/pydata/pandas/blob/v0.19.0/pandas/core/algorithms.py#L294

@chris-b1 chris-b1 changed the title to_period now very slow in 0.19.0 PERF: Period factorization very slow in 0.19.0 Oct 3, 2016
@chris-b1 chris-b1 added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Oct 3, 2016
@chris-b1 chris-b1 added this to the 0.19.1 milestone Oct 3, 2016
@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

@MattRijk Personally, I use SublimeText, usually just on a laptop. But this is off topic for this issue.

@jreback jreback added the Period Period data type label Oct 3, 2016
@jreback
Copy link
Contributor

jreback commented Oct 3, 2016

yeah this is a pretty easy fix, IIRC this was in @sinhrks PeriodBlock PR, but must have been backed out...something like

if is_period_dtype(values):
    values = values.view('i8')

@sinhrks
Copy link
Member

sinhrks commented Oct 4, 2016

Caused by #13988. I think the logic of period/datetimetz can be merged using needs_i8_conversion.

And the following comment is no longer correct...

@wesm
Copy link
Member

wesm commented Oct 4, 2016

Looks like a 0.19.1 may be close around the corner...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants