PERF: NaT groups cause wrong path in grouping #11010

Closed
jreback opened this Issue Sep 5, 2015 · 3 comments

Comments

Projects
None yet
2 participants
Contributor

jreback commented Sep 5, 2015

xref #10625

before this patch:

In [1]: from string import ascii_lowercase
In [2]: np.random.seed(2718281)
In [3]: n = 1 << 21
In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')
In [5]: df = DataFrame({
   ...:         '1st':np.random.choice(list(ascii_lowercase), n),
   ...:         '2nd':np.random.randint(0, 5, n),
   ...:         '3rd':np.random.choice(dr, n)})

In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan
In [7]: gr = df.groupby(['1st', '2nd'])

In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 13.3 ms per loop

In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2'

with this patch:

In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop

In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6'

jreback added this to the 0.17.0 milestone Sep 5, 2015

Contributor

jreback commented Sep 5, 2015

Contributor

jreback commented Sep 7, 2015

After #11013 this seems ok

is their a asv bench for this? (e.g. count on datetime64 with NaT)?

In [11]: %timeit gr.count()
100 loops, best of 3: 7.16 ms per loop

In [12]: pd.__version__
Out[12]: '0.16.2+599.g33530b3'
Contributor

behzadnouri commented Sep 7, 2015

it is only ok for count since i removed the cython wrapper. it does still break other cythonized methods

@jreback jreback added a commit to jreback/pandas that referenced this issue Sep 7, 2015

@jreback jreback PERF: use NaT comparisons in int64/datetimelikes #11010 c187ac9

jreback closed this in #11023 Sep 8, 2015

@jreback jreback added a commit that referenced this issue Sep 8, 2015

@jreback jreback Merge pull request #11023 from jreback/nat
PERF: use NaT comparisons in int64/datetimelikes #11010
76a4d99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment