PERF: Period factorization very slow in 0.19.0 #14338

Closed
bmoscon opened this Issue Oct 3, 2016 · 10 comments

Comments

Projects
None yet
7 participants

bmoscon commented Oct 3, 2016

df = DataFrame(data={'data': np.random.randint(0, 100, size=5500000),
                         'date': [dt(2016, 1, 1)] * 5500000})
for period, g in df.groupby(pd.DatetimeIndex(df.date).to_period('D')):
    print(g)

Expected Output

outputs dataframe

Output of pd.show_versions()

0.19.0

The output is not the issue, the issue is that in any version before 0.19.0, this was incredibly fast, like ~1 second or less. With 0.19.0, after waiting many minutes I just give up.

Member

shoyer commented Oct 3, 2016

Can you simplify the example here to the simplest possible setup? e.g., by removing the groupby?

Contributor

TomAugspurger commented Oct 3, 2016 edited

It's not the to_period that's slow, so it's probably the groupby.

In [24]: %time p = pd.DatetimeIndex(df.date).to_period('D')
CPU times: user 339 ms, sys: 6.71 ms, total: 345 ms
Wall time: 350 ms

Also, post the actual code you're running if you could (imports too).

Contributor

TomAugspurger commented Oct 3, 2016

Here's a smaller example,

import time
import pandas as pd

p = pd.period_range('2010-01-01', freq='D', periods=100000)
t0 = time.time()
pd.factorize(p)
t1 = time.time()
print('{}: {:.2f}s'.format(pd.__version__, t1 - t0))

Some outputs:

  • 0.18.1: 0.01s
  • 0.19.0rc1: 2.96s
Contributor

TomAugspurger commented Oct 3, 2016 edited

This is probably due to np.asarray(PeriodIndex) not returning an array of integers.

# 0.18.1
In [5]: np.asarray(p)
Out[5]: array([ 14610,  14611,  14612, ..., 114607, 114608, 114609])
# 0.19
In [4]: np.asarray(p)
Out[4]:
array([Period('2010-01-01', 'D'), Period('2010-01-02', 'D'),
       Period('2010-01-03', 'D'), ..., Period('2283-10-14', 'D'),
       Period('2283-10-15', 'D'), Period('2283-10-16', 'D')], dtype=object)

cc @sinhrks I think.

Contributor

chris-b1 commented Oct 3, 2016

Probably just need a check similar to datetimetz around here to view as an int64

https://github.com/pydata/pandas/blob/v0.19.0/pandas/core/algorithms.py#L294

chris-b1 changed the title from to_period now very slow in 0.19.0 to PERF: Period factorization very slow in 0.19.0 Oct 3, 2016

chris-b1 added this to the 0.19.1 milestone Oct 3, 2016

Member

shoyer commented Oct 3, 2016

@MattRijk Personally, I use SublimeText, usually just on a laptop. But this is off topic for this issue.

jreback added the Period label Oct 3, 2016

Contributor

jreback commented Oct 3, 2016

yeah this is a pretty easy fix, IIRC this was in @sinhrks PeriodBlock PR, but must have been backed out...something like

if is_period_dtype(values):
    values = values.view('i8')
Member

sinhrks commented Oct 4, 2016

Caused by #13988. I think the logic of period/datetimetz can be merged using needs_i8_conversion.

And the following comment is no longer correct...

Owner

wesm commented Oct 4, 2016

Looks like a 0.19.1 may be close around the corner...

This was referenced Oct 5, 2016

bmoscon referenced this issue in manahl/arctic Nov 4, 2016

Closed

Backout Pandas 0.19.0 workaround #277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment