improves groupby.get_group_index when shape is a long sequence #11180

behzadnouri · 2015-09-23T22:56:20Z

In [5]: df = DataFrame(np.random.randn(5000, 100).astype(str))

In [6]: %timeit df.duplicated()
1 loops, best of 3: 151 ms per loop

In [7]: %timeit df.T.duplicated()
1 loops, best of 3: 1.39 s per loop

part of this is because of taking the transpose (maybe cache locality). i.e. below performs better even though the shape is the same as df.T in above:

In [8]: df = DataFrame(np.random.randn(100, 5000).astype(str))

In [9]: %timeit df.duplicated()
1 loops, best of 3: 965 ms per loop

jreback · 2015-09-24T00:53:46Z

are there asv benches for this?

jreback · 2015-09-24T10:36:55Z

can you add a doc-note in the performance section as well. thxs.

samuelclark · 2015-09-24T11:48:26Z

I tested this fix on the same dataframe and it looks like it solves the problem

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 35 columns):
...
dtypes: float64(7), int64(12), object(16)
memory usage: 1.4+ MB


In [9]: %timeit -n 3 df.T.duplicated()
3 loops, best of 3: 549 ms per loop

There is still a slight regression from 0.12.0 but it is minimal. Thanks for fixing this.

behzadnouri · 2015-09-24T12:00:56Z

there already is a frame_duplicated asv benchmark.

added the doc note.

jorisvandenbossche · 2015-09-24T12:34:29Z

@behzadnouri maybe add to that benchmark a case with the tranposed frame? (to catch this case with many columns)

jreback · 2015-09-25T12:19:03Z

merged via 3fb802a

thanks!

jreback added the Performance Memory or execution speed performance label Sep 24, 2015

behzadnouri force-pushed the i8-cut-off branch from 0e352b9 to efba516 Compare September 24, 2015 01:34

jreback added this to the 0.17.0 milestone Sep 24, 2015

behzadnouri force-pushed the i8-cut-off branch from efba516 to 810a702 Compare September 24, 2015 11:27

behzadnouri force-pushed the i8-cut-off branch from 810a702 to cdef706 Compare September 25, 2015 11:43

improves groupby.get_group_index when shape is a long sequence

a7e644e

behzadnouri force-pushed the i8-cut-off branch from cdef706 to a7e644e Compare September 25, 2015 12:05

jreback closed this Sep 25, 2015

jreback mentioned this pull request Sep 25, 2015

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0 #10161

Closed

behzadnouri deleted the i8-cut-off branch September 26, 2015 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improves groupby.get_group_index when shape is a long sequence #11180

improves groupby.get_group_index when shape is a long sequence #11180

behzadnouri commented Sep 23, 2015

jreback commented Sep 24, 2015

jreback commented Sep 24, 2015

samuelclark commented Sep 24, 2015

behzadnouri commented Sep 24, 2015

jorisvandenbossche commented Sep 24, 2015

jreback commented Sep 25, 2015

improves groupby.get_group_index when shape is a long sequence #11180

improves groupby.get_group_index when shape is a long sequence #11180

Conversation

behzadnouri commented Sep 23, 2015

jreback commented Sep 24, 2015

jreback commented Sep 24, 2015

samuelclark commented Sep 24, 2015

behzadnouri commented Sep 24, 2015

jorisvandenbossche commented Sep 24, 2015

jreback commented Sep 25, 2015