Sparse get dummies perf #21997

TomAugspurger · 2018-07-20T15:47:29Z

Previously, we did a scalar elem == -1 for every element in the ndarray.

This replaces that check with a vectorized array == -1.

Running the ASV now. In the meantime, here's a simple timeit on the same problem

# HEAD
In [3]: %timeit pd.get_dummies(s, sparse=True)
561 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Master
In [3]: %timeit pd.get_dummies(s, sparse=True)
        2.18 s ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pep8speaks · 2018-07-20T15:47:32Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 20, 2018 at 15:48 Hours UTC

TomAugspurger · 2018-07-20T16:47:40Z

Here's the ASV (only a 3x speedup).

[100.00%] ··· Running reshape.GetDummies.time_get_dummies_1d_sparse                                                                                    1.79s       before           after         ratio
     [272bbdc7]       [bc658b03]
-           1.79s       1.05±0.02s     0.59  reshape.GetDummies.time_get_dummies_1d_sparse

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

mroeschke · 2018-07-20T17:33:01Z

asv_bench/benchmarks/reshape.py

+                      dtype=pd.api.types.CategoricalDtype(categories))
+        self.s = s
+
+    def time_get_dummies_1d(self):


Small nit: you can param over sparce=False/True

I have a slight preference for leaving them separate, since they're such distinct code paths and it's a tad easier to run just sparse with this layout. Happy to change if you feel strongly about this.

Sounds good, no strong preference to use params then.

jreback · 2018-07-20T20:45:47Z

lgtm.

jreback · 2018-07-20T20:46:16Z

thanks!

TomAugspurger added 2 commits July 20, 2018 10:41

PERF

bc658b0

Whatsnew

8fda4fb

TomAugspurger added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Jul 20, 2018

TomAugspurger added this to the 0.24.0 milestone Jul 20, 2018

TomAugspurger added 2 commits July 20, 2018 10:47

Issue number

234e9b2

lint

be8b9eb

TomAugspurger mentioned this pull request Jul 20, 2018

Case Study: Criteo dataset dask/dask-ml#295

Open

mroeschke reviewed Jul 20, 2018

View reviewed changes

jreback merged commit 322dbf4 into pandas-dev:master Jul 20, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sparse get dummies perf (pandas-dev#21997)

7468768

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse get dummies perf #21997

Sparse get dummies perf #21997

TomAugspurger commented Jul 20, 2018

pep8speaks commented Jul 20, 2018 •

edited

TomAugspurger commented Jul 20, 2018 •

edited

mroeschke Jul 20, 2018

TomAugspurger Jul 20, 2018

mroeschke Jul 20, 2018

jreback commented Jul 20, 2018

jreback commented Jul 20, 2018

Sparse get dummies perf #21997

Sparse get dummies perf #21997

Conversation

TomAugspurger commented Jul 20, 2018

pep8speaks commented Jul 20, 2018 • edited

Comment last updated on July 20, 2018 at 15:48 Hours UTC

TomAugspurger commented Jul 20, 2018 • edited

mroeschke Jul 20, 2018

Choose a reason for hiding this comment

TomAugspurger Jul 20, 2018

Choose a reason for hiding this comment

mroeschke Jul 20, 2018

Choose a reason for hiding this comment

jreback commented Jul 20, 2018

jreback commented Jul 20, 2018

pep8speaks commented Jul 20, 2018 •

edited

TomAugspurger commented Jul 20, 2018 •

edited