SparseSeries.array only returns non-fills #14167

jnothman · 2016-09-06T23:51:36Z

Code Sample, a copy-pastable example if possible

>>> np.array(pd.SparseSeries([0, 1], fill_value=0))
array([1])

Expected Output

array([0, 1])

this should really be consistent with Series rather than just returning the non-fill values (i.e. rather than being equivalent to np.array(ps.SparseArray([np.nan, 1])).

output of `pd.show_versions()`

Pandas 0.18.1

should alone be relevant.

Apologies I've not checked if this is fixed in master. Just passing on issues from
scikit-learn/scikit-learn#7352.

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-07T00:18:08Z

__array__ densifying is not what you would want. It must be explict.

In [7]: pd.SparseSeries([0, 1], fill_value=0).get_values()
Out[7]: array([0, 1])

In [8]: np.array(pd.SparseSeries([0, 1], fill_value=0).to_dense())
Out[8]: array([0, 1])

shoyer · 2016-09-07T03:14:38Z

I agree with @jnothman -- I think this is a bug with SparseSeries. It should not implement a __array__ method like this.

jnothman · 2016-09-07T03:32:27Z

I think you're going to confuse users when they pass things to objects that
expect array-likes. If you want to raise NotImplemented in array that's
fine; your library should be explicit. Returning something semantically
obscure, as you do, is not helpful.

sinhrks · 2016-09-07T03:45:29Z

I understand current impl is for applying ufunc little efficient. I agree that np.array(SparseSeries) returns dense is more natural from end user's point of view.

jnothman · 2016-09-07T04:21:27Z

but correctness/appropriateness depends on which ufunc anyway, I think.

shoyer · 2016-09-07T05:37:57Z

@sinhrks are you saying you did this intentionally for ufunc support? This is not the right way to do that (though the right way may not be possible until/if __numpy_ufunc__).

sinhrks · 2016-09-07T05:44:32Z

@shoyer No, the spec hasn't changed for quite a while. ufunc is supported prior to #13853.

shoyer · 2016-09-07T06:09:04Z

OK, I am going to reopen this as something we should discuss fixing for pandas 2.0.

If it isn't possible to implement ufuncs on sparse objects with an incorrect implementation of __array__, then IMO we shouldn't implement the ufunc interface at all. This is a variation on the theme of pandas as a leaky abstraction for numpy that is the main motivation for pandas 2.0 in the first place.

jreback · 2016-09-07T10:45:45Z

Well if we can agree what exactly __array__ is supposed to do for non-numpy supported things (e.g. Categorical, datetime w/tz and sparse), then sure. It seems its basically .values, IOW a densified-numpy object array (in this case could preserve dtype) as close as possible, though these all loses information. Its just the way to loose less information.

But surely should be fixed long before 2.0. Deferring things that can clearly be fixed beforehand is not useful.

shoyer · 2016-09-07T15:50:03Z

__array__ is a special method that exists for the benefit of NumPy to call to coerce something into a NumPy array.

I see two viable options for what SparseSeries.__array__ should do:

Convert into a dense array, including fill values, as @jnothman suggests. This is the approach used by some array-like types (e.g., dask.array).
Raise TypeError. This is what the matrix classes from scipy.sparse do.

wesm · 2016-09-07T15:50:29Z

I can't remember the original reasoning for having __array__ return the sparse values, but with fresh eyes I agree with @shoyer to either convert to dense or do not allow any implicit behavior.

hexgnu · 2018-01-15T09:22:18Z

I was going to submit a PR changing array to be dense... but the problem I see is this:

if you are to call a numpy function like np.abs(sparse_series) then you end up in a really odd spot where npoints on the sparse index doesn't match the data anymore.

I think it'd probably be best to document that array returns only the dense elements and leave it at that.

jreback · 2018-01-15T15:30:46Z

@hexgnu this is a bit more complicated here, __array__ can/should return a dense view of the data, but you can handle the wrapping/unwrapping to a SparseSeries thru things like __array_wrap__ (which are defined inSeries)

hexgnu · 2018-01-15T15:39:56Z

Yea I was digging into array_wrap although couldn't figure it out today while at the coffee shop... maybe I'll pick it up after I rejigger some of my other PR's outstanding.

hexgnu · 2018-01-23T05:14:27Z

So I was digging into this a bit more.

This causes some other pretty major issues with groupby.

For instance

df = pd.DataFrame({'a': [0, 1, 0, 0], 'b': [0, 1, 0, 0]})
sdf = df.to_sparse(fill_value=0)


sdf.groupby('a').mean() # returns only dense portions because the group by functionality relies on 
np.asarray
sdf.groupby('a').count() # causes segfault

Cross ref: #5078

jorisvandenbossche · 2018-07-20T09:41:20Z

Regarding to the last comment, note that sdf.groupby('a').mean() seems to work, but as far as I see returns an incorrect result?

In [42]: df = pd.DataFrame({'a': [0, 1, 0, 0], 'b': [0, 1, 0, 0]})
    ...: sdf = df.to_sparse(fill_value=0)

In [43]: sdf
Out[43]: 
   a  b
0  0  0
1  1  1
2  0  0
3  0  0

In [47]: sdf.groupby('a').mean()
Out[47]: 
   b
a   
1  0     # <---- should be 1?

In [44]: g = sdf.groupby('a')

In [45]: g.groups
Out[45]: {1: Int64Index([0], dtype='int64')}

Exactly because of this "only returning dense", it is pointing to the wrong values.

TomAugspurger · 2018-10-17T13:50:11Z

This was fixed by #22325

jreback closed this as completed Sep 7, 2016

jreback added Sparse Sparse Data Type API Design labels Sep 7, 2016

jreback added this to the No action milestone Sep 7, 2016

shoyer reopened this Sep 7, 2016

jreback modified the milestones: 1.0, No action Sep 7, 2016

jorisvandenbossche mentioned this issue Jun 1, 2017

Problems accepting pandas.SparseSeries as target scikit-learn/scikit-learn#7352

Closed

jorisvandenbossche mentioned this issue Feb 2, 2018

take_1d yields surprising results when working with SparseArray #19506

Closed

hexgnu mentioned this issue Feb 5, 2018

Allows for merging of SparseDataFrames, and fixes __array__ interface #19488

Closed

4 tasks

jnothman mentioned this issue Feb 14, 2018

tf-idf Returning DTM with more rows than number of documents? scikit-learn/scikit-learn#10631

Closed

TomAugspurger mentioned this issue Jul 19, 2018

Make SparseArray an ExtensionArray #21978

Closed

TomAugspurger closed this as completed Oct 17, 2018

jorisvandenbossche modified the milestones: 1.0, 0.25.0 Feb 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseSeries.array only returns non-fills #14167

SparseSeries.array only returns non-fills #14167

jnothman commented Sep 6, 2016

jreback commented Sep 7, 2016

shoyer commented Sep 7, 2016

jnothman commented Sep 7, 2016 •

edited

Loading

sinhrks commented Sep 7, 2016

jnothman commented Sep 7, 2016 •

edited

Loading

shoyer commented Sep 7, 2016

sinhrks commented Sep 7, 2016 •

edited

Loading

shoyer commented Sep 7, 2016 •

edited

Loading

jreback commented Sep 7, 2016 •

edited

Loading

shoyer commented Sep 7, 2016

wesm commented Sep 7, 2016

hexgnu commented Jan 15, 2018

jreback commented Jan 15, 2018

hexgnu commented Jan 15, 2018

hexgnu commented Jan 23, 2018 •

edited

Loading

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Oct 17, 2018 •

edited

Loading

SparseSeries.__array__ only returns non-fills #14167

SparseSeries.__array__ only returns non-fills #14167

Comments

jnothman commented Sep 6, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented Sep 7, 2016

shoyer commented Sep 7, 2016

jnothman commented Sep 7, 2016 • edited Loading

sinhrks commented Sep 7, 2016

jnothman commented Sep 7, 2016 • edited Loading

shoyer commented Sep 7, 2016

sinhrks commented Sep 7, 2016 • edited Loading

shoyer commented Sep 7, 2016 • edited Loading

jreback commented Sep 7, 2016 • edited Loading

shoyer commented Sep 7, 2016

wesm commented Sep 7, 2016

hexgnu commented Jan 15, 2018

jreback commented Jan 15, 2018

hexgnu commented Jan 15, 2018

hexgnu commented Jan 23, 2018 • edited Loading

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Oct 17, 2018 • edited Loading

SparseSeries.array only returns non-fills #14167

SparseSeries.array only returns non-fills #14167

output of `pd.show_versions()`

jnothman commented Sep 7, 2016 •

edited

Loading

jnothman commented Sep 7, 2016 •

edited

Loading

sinhrks commented Sep 7, 2016 •

edited

Loading

shoyer commented Sep 7, 2016 •

edited

Loading

jreback commented Sep 7, 2016 •

edited

Loading

hexgnu commented Jan 23, 2018 •

edited

Loading

TomAugspurger commented Oct 17, 2018 •

edited

Loading