# PERF: Period factorization very slow in 0.19.0 #14338

Closed
opened this Issue Oct 3, 2016 · 10 comments

Projects
None yet
7 participants

### bmoscon commented Oct 3, 2016

```df = DataFrame(data={'data': np.random.randint(0, 100, size=5500000),
'date': [dt(2016, 1, 1)] * 5500000})
for period, g in df.groupby(pd.DatetimeIndex(df.date).to_period('D')):
print(g)
```

#### Expected Output

outputs dataframe

#### Output of `pd.show_versions()`

0.19.0

The output is not the issue, the issue is that in any version before 0.19.0, this was incredibly fast, like ~1 second or less. With 0.19.0, after waiting many minutes I just give up.

Member

### shoyer commented Oct 3, 2016

 Can you simplify the example here to the simplest possible setup? e.g., by removing the `groupby`?
Contributor

### TomAugspurger commented Oct 3, 2016 • edited Edited 1 time TomAugspurger edited Oct 3, 2016 (most recent)

 It's not the `to_period` that's slow, so it's probably the groupby. ```In [24]: %time p = pd.DatetimeIndex(df.date).to_period('D') CPU times: user 339 ms, sys: 6.71 ms, total: 345 ms Wall time: 350 ms``` Also, post the actual code you're running if you could (imports too).
Contributor

### TomAugspurger commented Oct 3, 2016

 Here's a smaller example, ```import time import pandas as pd p = pd.period_range('2010-01-01', freq='D', periods=100000) t0 = time.time() pd.factorize(p) t1 = time.time() print('{}: {:.2f}s'.format(pd.__version__, t1 - t0))``` Some outputs: 0.18.1: 0.01s 0.19.0rc1: 2.96s
Contributor

### TomAugspurger commented Oct 3, 2016 • edited Edited 1 time TomAugspurger edited Oct 3, 2016 (most recent)

 This is probably due to `np.asarray(PeriodIndex)` not returning an array of integers. ```# 0.18.1 In [5]: np.asarray(p) Out[5]: array([ 14610, 14611, 14612, ..., 114607, 114608, 114609])``` ```# 0.19 In [4]: np.asarray(p) Out[4]: array([Period('2010-01-01', 'D'), Period('2010-01-02', 'D'), Period('2010-01-03', 'D'), ..., Period('2283-10-14', 'D'), Period('2283-10-15', 'D'), Period('2283-10-16', 'D')], dtype=object)``` cc @sinhrks I think.
Contributor

### chris-b1 commented Oct 3, 2016

 Probably just need a check similar to datetimetz around here to view as an `int64` https://github.com/pydata/pandas/blob/v0.19.0/pandas/core/algorithms.py#L294

Member

Member

### shoyer commented Oct 3, 2016

 @MattRijk Personally, I use SublimeText, usually just on a laptop. But this is off topic for this issue.

Contributor

### jreback commented Oct 3, 2016

 yeah this is a pretty easy fix, IIRC this was in @sinhrks PeriodBlock PR, but must have been backed out...something like ``````if is_period_dtype(values): values = values.view('i8') ``````

Closed

Member

### sinhrks commented Oct 4, 2016

 Caused by #13988. I think the logic of period/datetimetz can be merged using `needs_i8_conversion`. And the following comment is no longer correct...
Member

### wesm commented Oct 4, 2016

 Looks like a 0.19.1 may be close around the corner...

Closed

Merged