New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Period factorization very slow in 0.19.0 #14338

Closed
bmoscon opened this Issue Oct 3, 2016 · 10 comments

Comments

Projects
None yet
7 participants
@bmoscon

bmoscon commented Oct 3, 2016

df = DataFrame(data={'data': np.random.randint(0, 100, size=5500000),
                         'date': [dt(2016, 1, 1)] * 5500000})
for period, g in df.groupby(pd.DatetimeIndex(df.date).to_period('D')):
    print(g)

Expected Output

outputs dataframe

Output of pd.show_versions()

0.19.0

The output is not the issue, the issue is that in any version before 0.19.0, this was incredibly fast, like ~1 second or less. With 0.19.0, after waiting many minutes I just give up.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 3, 2016

Member

Can you simplify the example here to the simplest possible setup? e.g., by removing the groupby?

Member

shoyer commented Oct 3, 2016

Can you simplify the example here to the simplest possible setup? e.g., by removing the groupby?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 3, 2016

Contributor

It's not the to_period that's slow, so it's probably the groupby.

In [24]: %time p = pd.DatetimeIndex(df.date).to_period('D')
CPU times: user 339 ms, sys: 6.71 ms, total: 345 ms
Wall time: 350 ms

Also, post the actual code you're running if you could (imports too).

Contributor

TomAugspurger commented Oct 3, 2016

It's not the to_period that's slow, so it's probably the groupby.

In [24]: %time p = pd.DatetimeIndex(df.date).to_period('D')
CPU times: user 339 ms, sys: 6.71 ms, total: 345 ms
Wall time: 350 ms

Also, post the actual code you're running if you could (imports too).

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 3, 2016

Contributor

Here's a smaller example,

import time
import pandas as pd

p = pd.period_range('2010-01-01', freq='D', periods=100000)
t0 = time.time()
pd.factorize(p)
t1 = time.time()
print('{}: {:.2f}s'.format(pd.__version__, t1 - t0))

Some outputs:

  • 0.18.1: 0.01s
  • 0.19.0rc1: 2.96s
Contributor

TomAugspurger commented Oct 3, 2016

Here's a smaller example,

import time
import pandas as pd

p = pd.period_range('2010-01-01', freq='D', periods=100000)
t0 = time.time()
pd.factorize(p)
t1 = time.time()
print('{}: {:.2f}s'.format(pd.__version__, t1 - t0))

Some outputs:

  • 0.18.1: 0.01s
  • 0.19.0rc1: 2.96s
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 3, 2016

Contributor

This is probably due to np.asarray(PeriodIndex) not returning an array of integers.

# 0.18.1
In [5]: np.asarray(p)
Out[5]: array([ 14610,  14611,  14612, ..., 114607, 114608, 114609])
# 0.19
In [4]: np.asarray(p)
Out[4]:
array([Period('2010-01-01', 'D'), Period('2010-01-02', 'D'),
       Period('2010-01-03', 'D'), ..., Period('2283-10-14', 'D'),
       Period('2283-10-15', 'D'), Period('2283-10-16', 'D')], dtype=object)

cc @sinhrks I think.

Contributor

TomAugspurger commented Oct 3, 2016

This is probably due to np.asarray(PeriodIndex) not returning an array of integers.

# 0.18.1
In [5]: np.asarray(p)
Out[5]: array([ 14610,  14611,  14612, ..., 114607, 114608, 114609])
# 0.19
In [4]: np.asarray(p)
Out[4]:
array([Period('2010-01-01', 'D'), Period('2010-01-02', 'D'),
       Period('2010-01-03', 'D'), ..., Period('2283-10-14', 'D'),
       Period('2283-10-15', 'D'), Period('2283-10-16', 'D')], dtype=object)

cc @sinhrks I think.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Oct 3, 2016

Contributor

Probably just need a check similar to datetimetz around here to view as an int64

https://github.com/pydata/pandas/blob/v0.19.0/pandas/core/algorithms.py#L294

Contributor

chris-b1 commented Oct 3, 2016

Probably just need a check similar to datetimetz around here to view as an int64

https://github.com/pydata/pandas/blob/v0.19.0/pandas/core/algorithms.py#L294

@chris-b1 chris-b1 changed the title from to_period now very slow in 0.19.0 to PERF: Period factorization very slow in 0.19.0 Oct 3, 2016

@chris-b1 chris-b1 added this to the 0.19.1 milestone Oct 3, 2016

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 3, 2016

Member

@MattRijk Personally, I use SublimeText, usually just on a laptop. But this is off topic for this issue.

Member

shoyer commented Oct 3, 2016

@MattRijk Personally, I use SublimeText, usually just on a laptop. But this is off topic for this issue.

@jreback jreback added the Period label Oct 3, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 3, 2016

Contributor

yeah this is a pretty easy fix, IIRC this was in @sinhrks PeriodBlock PR, but must have been backed out...something like

if is_period_dtype(values):
    values = values.view('i8')
Contributor

jreback commented Oct 3, 2016

yeah this is a pretty easy fix, IIRC this was in @sinhrks PeriodBlock PR, but must have been backed out...something like

if is_period_dtype(values):
    values = values.view('i8')
@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Oct 4, 2016

Member

Caused by #13988. I think the logic of period/datetimetz can be merged using needs_i8_conversion.

And the following comment is no longer correct...

Member

sinhrks commented Oct 4, 2016

Caused by #13988. I think the logic of period/datetimetz can be merged using needs_i8_conversion.

And the following comment is no longer correct...

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Oct 4, 2016

Member

Looks like a 0.19.1 may be close around the corner...

Member

wesm commented Oct 4, 2016

Looks like a 0.19.1 may be close around the corner...

@chris-b1 chris-b1 referenced this issue Oct 5, 2016

Closed

PERF: period factorization #14348

4 of 4 tasks complete

@chris-b1 chris-b1 referenced this issue Oct 13, 2016

Merged

Period factorization #14419

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment