Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/PERF: Sparse get_dummies uses concat #24372

Merged
merged 4 commits into from
Dec 21, 2018

Conversation

TomAugspurger
Copy link
Contributor

Working around the DataFrame constructor perf issue in #24368

Fixes deprecation warnings in the ASV files so there's something to run.

Closes #24371

(cherry picked from commit f566b46)
(cherry picked from commit eb219ac)
* Preserve sparsity
* Preserve fill value
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Dec 20, 2018
@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Sparse Sparse Data Type labels Dec 20, 2018
@@ -1613,6 +1613,7 @@ Sparse
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)
- Bug in :meth:`SparseArray.nonzero` and :meth:`SparseDataFrame.dropna` returning shifted/incorrect results (:issue:`21172`)
- Bug in :meth:`DataFrame.apply` where dtypes would lose sparseness (:issue:`23744`)
- Bug in :func:`concat` when concatenating a list of :class:`Series` with all-sparse values changing the ``fill_value`` and converting to a dense Series (:issue:`24371`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the input to concat is a List[Series[Sparse]], we now return a DataFrame with sparse values. Previously this was a dense DataFrame (probably a bug), so it isn't API breaking.

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #24372 into master will decrease coverage by 49.31%.
The diff coverage is 9.09%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24372       +/-   ##
===========================================
- Coverage   92.29%   42.97%   -49.32%     
===========================================
  Files         162      162               
  Lines       51832    51836        +4     
===========================================
- Hits        47839    22279    -25560     
- Misses       3993    29557    +25564
Flag Coverage Δ
#multiple ?
#single 42.97% <9.09%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/reshape.py 13.31% <0%> (-86.24%) ⬇️
pandas/core/dtypes/concat.py 57.35% <50%> (-39.25%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 122 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cf7d9...6a65cbc. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #24372 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24372      +/-   ##
==========================================
+ Coverage   92.29%   92.29%   +<.01%     
==========================================
  Files         162      162              
  Lines       51832    51836       +4     
==========================================
+ Hits        47839    47843       +4     
  Misses       3993     3993
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.98% <9.09%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 97.05% <100%> (+0.45%) ⬆️
pandas/core/reshape/reshape.py 99.56% <100%> (ø) ⬆️
pandas/util/testing.py 87.57% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cf7d9...6a65cbc. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

n.b.
6a65cbc has an API breaking change for SparseSeries.unstack. With this PR that returns a DataFrame of sparse values instead of a SparseDataFrame.

@@ -909,7 +910,15 @@ def _make_col_name(prefix, prefix_sep, level):
index = None

if sparse:
sparse_series = {}

if is_integer_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a routine in pandas.core.dtypes.missing for this already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na_value_for_dtype, or something else? We need something a little different, since we want the 0 value for each dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have that too let’s try to not reinvent the wheel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't a function like this in any of the dtypes modules.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 20, 2018 via email

@TomAugspurger
Copy link
Contributor Author

I think this should be merged soon if possible. The CI failures are blocking #20796 and the explode PR.

I haven't made too much progress on fixing #24368 properly. Too many edge cases in our constructors.

@jreback jreback merged commit 0bb3772 into pandas-dev:master Dec 21, 2018
@TomAugspurger TomAugspurger deleted the sparse-perf branch January 2, 2019 20:17
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants