Deprecate SparseDataFrame and SparseSeries #26137

TomAugspurger · 2019-04-18T18:32:25Z

Closes #19239

This currently includes the changes from #25682, which I think is mergeable.

I think this would be good to have for 0.25.0. I think it's close, but I may not have time to push this across the finish line. Anyone interested in finishing it off?

commit 8b136bf Merge: 3005aed 01d3dc2 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Mar 15 16:03:23 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 3005aed Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Mar 14 06:26:32 2019 -0500 isort? commit 318c06f Merge: 0922296 79205ea Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Mar 14 06:25:45 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 0922296 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 21:35:51 2019 -0500 updates commit f433be8 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 20:54:07 2019 -0500 lint commit 6696f28 Merge: 534a379 1017382 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Mar 13 20:53:13 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 534a379 Merge: 94a7baf 5c341dc Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 14:37:27 2019 -0500 Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor commit 94a7baf Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 14:22:48 2019 -0500 fixups commit 6f619b5 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Mar 12 13:38:48 2019 -0500 32-bit compat commit 24f48c3 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Mon Mar 11 22:05:46 2019 -0500 API: DataFrame.sparse accessor Closes pandas-dev#25681

pandas/util/testing.py

pandas/core/sparse/series.py

codecov · 2019-05-14T14:49:59Z

Codecov Report

Merging #26137 into master will decrease coverage by 0.57%.
The diff coverage is 30.43%.

@@            Coverage Diff             @@
##           master   #26137      +/-   ##
==========================================
- Coverage   41.31%   40.73%   -0.58%     
==========================================
  Files         174      175       +1     
  Lines       50749    52432    +1683     
==========================================
+ Hits        20968    21360     +392     
- Misses      29781    31072    +1291

Flag	Coverage Δ
#single	`40.73% <30.43%> (-0.58%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/testing.py	`49.31% <100%> (+0.1%)`	⬆️
pandas/core/sparse/series.py	`44.24% <100%> (+0.49%)`	⬆️
pandas/core/frame.py	`34.57% <100%> (-0.12%)`	⬇️
pandas/core/arrays/sparse.py	`38.9% <21.05%> (+0.01%)`	⬆️
pandas/core/sparse/frame.py	`28.76% <60%> (+0.49%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-48.69%)`	⬇️
pandas/core/arrays/array_.py	`15.55% <0%> (-22.23%)`	⬇️
pandas/core/sorting.py	`22.22% <0%> (-4.12%)`	⬇️
pandas/io/formats/format.py	`30.27% <0%> (-4.03%)`	⬇️
pandas/core/indexes/timedeltas.py	`44.38% <0%> (-3.07%)`	⬇️
... and 81 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...836d19b. Read the comment docs.

codecov · 2019-05-14T14:50:03Z

Codecov Report

Merging #26137 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26137      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50639    50646       +7     
==========================================
+ Hits        46473    46476       +3     
- Misses       4166     4170       +4

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`41.69% <45.45%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`93.61% <ø> (ø)`	⬆️
pandas/core/frame.py	`97% <ø> (-0.12%)`	⬇️
pandas/core/generic.py	`93.57% <ø> (ø)`	⬆️
pandas/core/sparse/series.py	`93.24% <100%> (+0.06%)`	⬆️
pandas/core/arrays/sparse.py	`93.07% <100%> (ø)`	⬆️
pandas/core/sparse/frame.py	`95.65% <100%> (+0.01%)`	⬆️
pandas/core/sparse/scipy_sparse.py	`100% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f0a743...12d8d83. Read the comment docs.

TomAugspurger · 2019-05-15T03:07:16Z

doc/source/user_guide/sparse.rst

@@ -6,27 +6,28 @@
 Sparse data structures
 **********************

-We have implemented "sparse" versions of ``Series`` and ``DataFrame``. These are not sparse


I don't think the diff here is all that informative. I'd recommend just viewing the new file. The basic flow is

short intro

SparseArray / SparseDtype

Sparse Accessors

SparseIndex / computation

Migration Guide

SparseSeries / SparseDataFrame.

TomAugspurger · 2019-05-15T14:06:10Z

Note: we still have some warnings leaking through on some of the CI jobs (just not numpydev). Trying to track those down.

TomAugspurger · 2019-05-15T18:53:09Z

I think I got all the warnings... I added a global filterwarnings to our setup.cfg https://github.com/pandas-dev/pandas/pull/26137/files#diff-380c6a8ebbbce17d55d50ef17d3cf906. This proved helpful in tracking them down. Are people OK with keeping that there? As an aside, I couldn't get the syntax for "raise on all warnings from pandas" to work. In theory error:::pandas[.*] should do it, but that was still elevating warnings from other packages.

jreback · 2019-05-15T21:38:17Z

will have a look soon

doc/source/user_guide/sparse.rst

jreback · 2019-05-16T00:08:20Z

doc/source/user_guide/sparse.rst


+.. code-block:: python
+
+   # Old way


use *Previous* and *New*

Will change old to previous. I think I'll keep them as comments, rather than **-style headings, since we're using ** for the subtopic (e.g. construction).

doc/source/user_guide/sparse.rst

doc/source/whatsnew/v0.25.0.rst

pandas/core/generic.py

jorisvandenbossche

@TomAugspurger thanks a lot for this!

Did a first pass, and some high level comments:

in some older whatsnew files, we will need add some :okwarnings: for now (see the doc build on travis)
In the migration section, I think we also need to state some differences between old SparseDataFrame/Series and the new way. Eg:
- It is no longer guaranteed that all columns are sparse. You can have a mixture.
- Practical consequence of the above: assigning values to a new column of a "sparse" dataframe no longer automatically sparsifies it, you need to do that yourself
- also related: no more a default_fill_value (but if you can't assign values with automatic sparsification, this default fill value also has no use, I think, so this is not really a problem given the above)
might be for a different issue, but noted this while reviewing: when having mixed sparse and non-sparse columns in a dataframe, the sparse accessor should either give a better error message (indicating that not all columns are sparse) or either work (eg density could in principle work for a mixture)
- related to that: how to convert to dense if you have a mixture?

jorisvandenbossche · 2019-05-16T07:31:04Z

doc/source/user_guide/sparse.rst

@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``:

   df = pd.DataFrame(np.random.randn(10000, 4))
   df.iloc[:9998] = np.nan
-   sdf = df.to_sparse()
+   sdf = df.astype(pd.SparseDtype("float", np.nan))


For such a purpose, I was thinking we could also provide df.sparse.to_sparse() to convert a full DataFrame to sparse?

Makes sense to me, though perhaps as a followup? I don't plan to put more time into sparse personally.

this would have to be a DataFrame.astype('sparse') ? though I think, IOW, or is .sparse allowed on any DataFrame? the semantics are a bit odd on this

@TomAugspurger can you respond to this

I think, IOW, or is .sparse allowed on any DataFrame? the semantics are a bit odd on this

Kinda. If you just do df.sparse on a dataframe without all-sparse values, we raise

In [6]: pd.DataFrame({"A": [1, 2]}).sparse --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-6-ab0fb67ed650> in <module> ----> 1 pd.DataFrame({"A": [1, 2]}).sparse ... ~/sandbox/pandas/pandas/core/arrays/sparse.py in _validate(self, data) 2119 dtypes = data.dtypes 2120 if not all(isinstance(t, SparseDtype) for t in dtypes): -> 2121 raise AttributeError(self._validation_msg) 2122 2123 @classmethod AttributeError: Can only use the '.sparse' accessor with Sparse data.

But we also allow for pd.DataFrame.sparse.from_spmatrix.

this would have to be a DataFrame.astype('sparse')

It would be .astype('Sparse') which is shorthand for .astype(SparseDtype(float64, nan))

ok this is all fine; is there a test for using .sparse on non-any-sparse df?

doc/source/user_guide/sparse.rst

jorisvandenbossche · 2019-05-16T07:36:30Z

doc/source/user_guide/sparse.rst

+   arr[2:5] = np.nan
+   arr[7:8] = np.nan
+   sparr = pd.SparseArray(arr)
+   sparr


Not important for this PR, but we should actually improve the repr of SparseArray. Currently the example gives

[-2.329703982704994, -0.7776235464173905, nan, nan, nan, -0.07270483900887693, 0.4093257484722553, nan, -0.33749585746785415, 1.884146289689117] Fill: nan IntIndex Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

(so way to wide, and showing too much detail of the random numbers)

doc/source/user_guide/sparse.rst

jorisvandenbossche · 2019-05-16T08:35:52Z

doc/source/user_guide/sparse.rst


+A ``.sparse`` accessor has been added for :class:`DataFrame` as well.
+See :ref:`api.dataframe.sparse` for more.


Suggested change

See :ref:`api.dataframe.sparse` for more.

See :ref:`api.frame.sparse` for more.

pandas/core/generic.py

pandas/core/sparse/frame.py

pandas/core/sparse/series.py

TomAugspurger · 2019-05-16T21:17:24Z

I've updated the doc examples to all use Series[sparse], rather than SparseSeries. I've just left a note that the sparse subclasses are deprecated.

TomAugspurger · 2019-05-16T21:18:28Z

c5fa3fb also has a change to Series.sparse.from_coo. Previously that was using SparseSeries internally, and so a warning was being raised. I (lazily) applied the warnings filter to the class so it was being ignored in the test.

jreback

looks good. just a couple of points.

doc/source/user_guide/sparse.rst

pandas/core/generic.py

jreback · 2019-05-19T17:59:22Z

pandas/core/sparse/scipy_sparse.py

@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ),
    return sparse_matrix, rows, columns


-def _coo_to_sparse_series(A, dense_index=False):
+def _coo_to_sparse_series(A, dense_index=False, sparse_series=True):
    """
    Convert a scipy.sparse.coo_matrix to a SparseSeries.


can you add a doc-string here (types too if you can!)

Done. I'm not really sure on two things

The type for A is 'scipy.sparse.coo.coo_matrix', but we can't import sparse.

The return type is Union[Series, SparseSeries] but importing SparseSeries would cause a circular import

so I left types off for those.

can't you just use the string? (I think that works)

same use the string

Can you? Are these types actually checked in our CI? I'd rather not introduce invalid types.

yes they should be

jreback · 2019-05-19T17:59:55Z

pandas/core/sparse/scipy_sparse.py

    s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
    s = s.sort_index()
-    s = s.to_sparse()  # TODO: specify kind?
+    if sparse_series:


why exactly do you need sparse_series flag? why can't we just do the astype after calling this routine?

This is called from both Series.sparse and SparseSeries.

Previously, this went coo_matrix -> SparseSeries -> Series[sparse], which caused an undesired warning for Series.sparse.from_coo(). Once SparseSeries is gone we can remove the keyword.

ok can you add a todo about this then, this is not obvious at all

mitar · 2019-05-26T02:57:42Z

doc/source/user_guide/sparse.rst


-You can apply NumPy *ufuncs* to ``SparseArray`` and get a ``SparseArray`` as a result.
+Sparse-specific properties, like ``density``, are available on the ``.sparse`` accssor.


accssor typo.

jreback · 2019-05-26T15:24:36Z

doc/source/user_guide/sparse.rst

@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``:

   df = pd.DataFrame(np.random.randn(10000, 4))
   df.iloc[:9998] = np.nan
-   sdf = df.to_sparse()
+   sdf = df.astype(pd.SparseDtype("float", np.nan))


@TomAugspurger can you respond to this

doc/source/user_guide/sparse.rst

jreback · 2019-05-26T15:26:12Z

doc/source/user_guide/sparse.rst

+Sparse Calculation
+------------------
+
+You can apply NumPy `ufuncs <https://docs.scipy.org/doc/numpy/reference/ufuncs.html>`_


is there a reason we are recommending people work directly with SparseArray? the unit of computation is generally the Series, no?

This was here before, just moved. Whether or not it makes sense, I dunno. Depends on whether or not you need / want an index I suppose.

doc/source/user_guide/sparse.rst

jreback · 2019-05-26T15:27:37Z

doc/source/user_guide/sparse.rst

-~~~~~~~~~~~~
-
-A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
+:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.


:class:`MultiIndex`

do we have the doc inventory for scipy? can you add a refernce to coo_matrix?

We do have SciPy in our intersphinx.

jreback · 2019-05-26T15:29:14Z

pandas/core/sparse/scipy_sparse.py

@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ),
    return sparse_matrix, rows, columns


-def _coo_to_sparse_series(A, dense_index=False):
+def _coo_to_sparse_series(A, dense_index=False, sparse_series=True):
    """
    Convert a scipy.sparse.coo_matrix to a SparseSeries.


can't you just use the string? (I think that works)

same use the string

jreback · 2019-05-26T15:29:54Z

pandas/core/sparse/scipy_sparse.py

    s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
    s = s.sort_index()
-    s = s.to_sparse()  # TODO: specify kind?
+    if sparse_series:


ok can you add a todo about this then, this is not obvious at all

jreback · 2019-05-26T15:31:08Z

pandas/tests/arrays/sparse/test_array.py

@@ -215,6 +215,7 @@ def test_scalar_with_index_infer_dtype(self, scalar, dtype):
        assert exp.dtype == dtype

    @pytest.mark.parametrize("fill", [1, np.nan, 0])
+    @pytest.mark.filterwarnings("ignore:Sparse:FutureWarning")


I think you don't need these as a prior PR added this to setup.cfg

The setup.cfg has an error::: config to elevate unhandled warnings to errors. We still need these otherwise the tests would fail.

We have a single test asserting that SparseSeries.__init__ warns, and explicitly ignore the warnings elsewhere.

jreback · 2019-05-29T02:15:39Z

thanks @TomAugspurger

jorisvandenbossche

There are still a bunch of :okwarnings: needed in older whatsnew files.

jorisvandenbossche · 2019-05-29T07:00:11Z

doc/source/user_guide/sparse.rst

-~~~~~~~~~~~~~~~
-
-.. versionadded:: 0.20.0
+Use :meth:`DataFrame.sparse.from_coo` to create a ``DataFrame`` with sparse values from a sparse matrix.


This should be from_spmatrix ?

simonjayhawkins · 2019-05-29T13:51:03Z

pandas/core/sparse/series.py

 class SparseSeries(Series):
    """Data structure for labeled, sparse floating point data

+    .. deprectaed:: 0.25.0


TomAugspurger · 2019-05-29T14:00:27Z

Thanks. I'll make a PR with these doc updates.

…

On Wed, May 29, 2019 at 8:51 AM Simon Hawkins ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/sparse/series.py <#26137 (comment)>: > class SparseSeries(Series): """Data structure for labeled, sparse floating point data + .. deprectaed:: 0.25.0 typo — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26137?email_source=notifications&email_token=AAKAOITMLOLJFTRQDF54URLPX2C5BA5CNFSM4HG7LDP2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB2ABB5A#pullrequestreview-243273972>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIS7PLLTJDVHORMM4YDPX2C5BANCNFSM4HG7LDPQ> .

simonjayhawkins · 2019-05-29T14:06:24Z

pandas/core/sparse/frame.py



 class SparseDataFrame(DataFrame):
    """
    DataFrame containing sparse floating point data in the form of SparseSeries
    objects

+    .. deprectaed:: 0.25.0


TomAugspurger added 2 commits April 18, 2019 13:28

DEPR: Deprecate SparseSeries and SparseDataFrame

c32e5ff

TomAugspurger commented Apr 18, 2019

View reviewed changes

pandas/util/testing.py Outdated Show resolved Hide resolved

pandas/core/sparse/series.py Outdated Show resolved Hide resolved

gfyoung added Deprecate Functionality to remove in pandas Sparse Sparse Data Type labels Apr 18, 2019

TomAugspurger mentioned this pull request May 14, 2019

API: DataFrame.sparse accessor #25682

Merged

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

836d19b

fixup

c0d6cf2

jorisvandenbossche mentioned this pull request May 14, 2019

SparseDataFrame.to_parquet fails with new error #26378

Closed

TomAugspurger added 9 commits May 14, 2019 14:47

fixup

8f06d88

fixup

380c7c0

fixup

21569e2

docs

6a81837

remove change

12a8329

fixed merge conflict

01c7710

pickle

e9b9b29

fixups

b295ce1

fixups

ccf71db

TomAugspurger commented May 15, 2019

View reviewed changes

doc lint

7e6fbd6

TomAugspurger added 4 commits May 15, 2019 10:23

fix pytables

865f1aa

temp set error

9915c48

skip doctests

30f3670

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

b043243

jreback requested changes May 16, 2019

View reviewed changes

jorisvandenbossche reviewed May 16, 2019

View reviewed changes

TomAugspurger added 4 commits May 16, 2019 11:27

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

b2aef95

fixups

706c5dc

fixup

13d30d2

updates

c5fa3fb

jreback requested changes May 19, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone May 19, 2019

TomAugspurger added 4 commits May 20, 2019 14:06

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

101c425

fixups

b76745f

return

f153400

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

0c49ddc

mitar reviewed May 26, 2019

View reviewed changes

jreback requested changes May 26, 2019

View reviewed changes

TomAugspurger added 3 commits May 28, 2019 09:08

fixups

1903f67

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

0b03ac2

Merge remote-tracking branch 'upstream/master' into depr-sparse-depr

12d8d83

jreback approved these changes May 29, 2019

View reviewed changes

jreback merged commit e7ad884 into pandas-dev:master May 29, 2019

jsexauer mentioned this pull request May 29, 2019

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jorisvandenbossche reviewed May 29, 2019

View reviewed changes

simonjayhawkins reviewed May 29, 2019

View reviewed changes

TomAugspurger deleted the depr-sparse-depr branch May 29, 2019 14:02

simonjayhawkins reviewed May 29, 2019

View reviewed changes

jreback mentioned this pull request Nov 22, 2019

DEPR: deprecations log for removed issues #13777

Closed


		A ``.sparse`` accessor has been added for :class:`DataFrame` as well.
		See :ref:`api.dataframe.sparse` for more.

	See :ref:`api.dataframe.sparse` for more.
	See :ref:`api.frame.sparse` for more.


		You can apply NumPy ufuncs to ``SparseArray`` and get a ``SparseArray`` as a result.
		Sparse-specific properties, like ``density``, are available on the ``.sparse`` accssor.

Deprecate SparseDataFrame and SparseSeries #26137

Deprecate SparseDataFrame and SparseSeries #26137

Conversation

TomAugspurger commented Apr 18, 2019

codecov bot commented May 14, 2019

Codecov Report

codecov bot commented May 14, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

TomAugspurger commented May 15, 2019

TomAugspurger commented May 15, 2019

jreback commented May 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback May 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented May 16, 2019

TomAugspurger commented May 16, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 29, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented May 29, 2019 via email

Choose a reason for hiding this comment

codecov bot commented May 14, 2019 •

edited

Loading

jreback May 19, 2019 •

edited

Loading

TomAugspurger May 28, 2019 •

edited

Loading