Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improves groupby.get_group_index when shape is a long sequence #11180

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

closes #10161

xref #10161 (comment)

In [5]: df = DataFrame(np.random.randn(5000, 100).astype(str))

In [6]: %timeit df.duplicated()
1 loops, best of 3: 151 ms per loop

In [7]: %timeit df.T.duplicated()
1 loops, best of 3: 1.39 s per loop

part of this is because of taking the transpose (maybe cache locality). i.e. below performs better even though the shape is the same as df.T in above:

In [8]: df = DataFrame(np.random.randn(100, 5000).astype(str))

In [9]: %timeit df.duplicated()
1 loops, best of 3: 965 ms per loop

@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

are there asv benches for this?

@jreback jreback added the Performance Memory or execution speed performance label Sep 24, 2015
@jreback jreback added this to the 0.17.0 milestone Sep 24, 2015
@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

can you add a doc-note in the performance section as well. thxs.

@samuelclark
Copy link

I tested this fix on the same dataframe and it looks like it solves the problem

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 35 columns):
...
dtypes: float64(7), int64(12), object(16)
memory usage: 1.4+ MB


In [9]: %timeit -n 3 df.T.duplicated()
3 loops, best of 3: 549 ms per loop

There is still a slight regression from 0.12.0 but it is minimal. Thanks for fixing this.

@behzadnouri
Copy link
Contributor Author

there already is a frame_duplicated asv benchmark.

added the doc note.

@jorisvandenbossche
Copy link
Member

@behzadnouri maybe add to that benchmark a case with the tranposed frame? (to catch this case with many columns)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

merged via 3fb802a

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicated() performance and bug on long rows regression from 0.15.2->0.16.0
4 participants