Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse get dummies perf #21997

Merged
merged 4 commits into from Jul 20, 2018

Conversation

Projects
None yet
4 participants
@TomAugspurger
Copy link
Contributor

commented Jul 20, 2018

Previously, we did a scalar elem == -1 for every element in the ndarray.

This replaces that check with a vectorized array == -1.

Running the ASV now. In the meantime, here's a simple timeit on the same problem

# HEAD
In [3]: %timeit pd.get_dummies(s, sparse=True)
561 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Master
In [3]: %timeit pd.get_dummies(s, sparse=True)
        2.18 s ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TomAugspurger added some commits Jul 20, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Jul 20, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 20, 2018 at 15:48 Hours UTC

TomAugspurger added some commits Jul 20, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2018

Here's the ASV (only a 3x speedup).

[100.00%] ··· Running reshape.GetDummies.time_get_dummies_1d_sparse                                                                                    1.79s       before           after         ratio
     [272bbdc7]       [bc658b03]
-           1.79s       1.05±0.02s     0.59  reshape.GetDummies.time_get_dummies_1d_sparse

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
dtype=pd.api.types.CategoricalDtype(categories))
self.s = s

def time_get_dummies_1d(self):

This comment has been minimized.

Copy link
@mroeschke

mroeschke Jul 20, 2018

Member

Small nit: you can param over sparce=False/True

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 20, 2018

Author Contributor

I have a slight preference for leaving them separate, since they're such distinct code paths and it's a tad easier to run just sparse with this layout. Happy to change if you feel strongly about this.

This comment has been minimized.

Copy link
@mroeschke

mroeschke Jul 20, 2018

Member

Sounds good, no strong preference to use params then.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2018

lgtm.

@jreback jreback merged commit 322dbf4 into pandas-dev:master Jul 20, 2018

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2018

thanks!

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Aug 12, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.