Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: categorical value_counts can be much faster #10804

Closed
jreback opened this issue Aug 12, 2015 · 0 comments · Fixed by #10874
Closed

PERF: categorical value_counts can be much faster #10804

jreback opened this issue Aug 12, 2015 · 0 comments · Fixed by #10874
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Aug 12, 2015

The internal impl of Categorical.value_counts should just do this. I think it is factorizing multiple times when it is not necessary.

In [32]: np.random.seed(1234)

In [33]: n = 500000

In [34]: u = int(0.1*n)

In [35]: arr = [ "s%04d" % i for i in np.random.randint(0,u,size=n) ]

In [36]: c = pd.Series(arr).astype('category')                              

In [37]: result1 = Series(np.arange(len(c.cat.categories)),c.cat.categories).map(c.cat.codes.value_counts()).order(ascending=False)

In [38]: result2 = c.value_counts()

In [39]: %timeit Series(np.arange(len(c.cat.categories)),c.cat.categories).map(c.cat.codes.value_counts()).order(ascending=False)
100 loops, best of 3: 17.2 ms per loop

In [40]: %timeit c.value_counts()
10 loops, best of 3: 62.3 ms per loop

In [41]: result1.equals(result2)
Out[41]: True
@jreback jreback added the Categorical Categorical Data Type label Aug 12, 2015
@jreback jreback added this to the Next Major Release milestone Aug 12, 2015
@jreback jreback added the Performance Memory or execution speed performance label Aug 12, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant