Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseSeries.__array__ only returns non-fills #14167

Closed
jnothman opened this issue Sep 6, 2016 · 17 comments
Closed

SparseSeries.__array__ only returns non-fills #14167

jnothman opened this issue Sep 6, 2016 · 17 comments
Labels
API Design Sparse Sparse Data Type
Milestone

Comments

@jnothman
Copy link
Contributor

jnothman commented Sep 6, 2016

Code Sample, a copy-pastable example if possible

>>> np.array(pd.SparseSeries([0, 1], fill_value=0))
array([1])

Expected Output

array([0, 1])

this should really be consistent with Series rather than just returning the non-fill values (i.e. rather than being equivalent to np.array(ps.SparseArray([np.nan, 1])).

output of pd.show_versions()

Pandas 0.18.1

should alone be relevant.

Apologies I've not checked if this is fixed in master. Just passing on issues from
scikit-learn/scikit-learn#7352.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2016

__array__ densifying is not what you would want. It must be explict.

In [7]: pd.SparseSeries([0, 1], fill_value=0).get_values()
Out[7]: array([0, 1])

In [8]: np.array(pd.SparseSeries([0, 1], fill_value=0).to_dense())
Out[8]: array([0, 1])

@jreback jreback closed this as completed Sep 7, 2016
@jreback jreback added Sparse Sparse Data Type API Design labels Sep 7, 2016
@jreback jreback added this to the No action milestone Sep 7, 2016
@shoyer
Copy link
Member

shoyer commented Sep 7, 2016

I agree with @jnothman -- I think this is a bug with SparseSeries. It should not implement a __array__ method like this.

@jnothman
Copy link
Contributor Author

jnothman commented Sep 7, 2016

I think you're going to confuse users when they pass things to objects that
expect array-likes. If you want to raise NotImplemented in array that's
fine; your library should be explicit. Returning something semantically
obscure, as you do, is not helpful.

@sinhrks
Copy link
Member

sinhrks commented Sep 7, 2016

I understand current impl is for applying ufunc little efficient. I agree that np.array(SparseSeries) returns dense is more natural from end user's point of view.

@jnothman
Copy link
Contributor Author

jnothman commented Sep 7, 2016

but correctness/appropriateness depends on which ufunc anyway, I think.

@shoyer
Copy link
Member

shoyer commented Sep 7, 2016

@sinhrks are you saying you did this intentionally for ufunc support? This is not the right way to do that (though the right way may not be possible until/if __numpy_ufunc__).

@sinhrks
Copy link
Member

sinhrks commented Sep 7, 2016

@shoyer No, the spec hasn't changed for quite a while. ufunc is supported prior to #13853.

@shoyer
Copy link
Member

shoyer commented Sep 7, 2016

OK, I am going to reopen this as something we should discuss fixing for pandas 2.0.

If it isn't possible to implement ufuncs on sparse objects with an incorrect implementation of __array__, then IMO we shouldn't implement the ufunc interface at all. This is a variation on the theme of pandas as a leaky abstraction for numpy that is the main motivation for pandas 2.0 in the first place.

@shoyer shoyer reopened this Sep 7, 2016
@jreback jreback modified the milestones: 1.0, No action Sep 7, 2016
@jreback
Copy link
Contributor

jreback commented Sep 7, 2016

Well if we can agree what exactly __array__ is supposed to do for non-numpy supported things (e.g. Categorical, datetime w/tz and sparse), then sure. It seems its basically .values, IOW a densified-numpy object array (in this case could preserve dtype) as close as possible, though these all loses information. Its just the way to loose less information.

But surely should be fixed long before 2.0. Deferring things that can clearly be fixed beforehand is not useful.

@shoyer
Copy link
Member

shoyer commented Sep 7, 2016

__array__ is a special method that exists for the benefit of NumPy to call to coerce something into a NumPy array.

I see two viable options for what SparseSeries.__array__ should do:

  1. Convert into a dense array, including fill values, as @jnothman suggests. This is the approach used by some array-like types (e.g., dask.array).
  2. Raise TypeError. This is what the matrix classes from scipy.sparse do.

@wesm
Copy link
Member

wesm commented Sep 7, 2016

I can't remember the original reasoning for having __array__ return the sparse values, but with fresh eyes I agree with @shoyer to either convert to dense or do not allow any implicit behavior.

@hexgnu
Copy link
Contributor

hexgnu commented Jan 15, 2018

I was going to submit a PR changing array to be dense... but the problem I see is this:

if you are to call a numpy function like np.abs(sparse_series) then you end up in a really odd spot where npoints on the sparse index doesn't match the data anymore.

I think it'd probably be best to document that array returns only the dense elements and leave it at that.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2018

@hexgnu this is a bit more complicated here, __array__ can/should return a dense view of the data, but you can handle the wrapping/unwrapping to a SparseSeries thru things like __array_wrap__ (which are defined inSeries)

@hexgnu
Copy link
Contributor

hexgnu commented Jan 15, 2018

Yea I was digging into array_wrap although couldn't figure it out today while at the coffee shop... maybe I'll pick it up after I rejigger some of my other PR's outstanding.

@hexgnu
Copy link
Contributor

hexgnu commented Jan 23, 2018

So I was digging into this a bit more.

This causes some other pretty major issues with groupby.

For instance

df = pd.DataFrame({'a': [0, 1, 0, 0], 'b': [0, 1, 0, 0]})
sdf = df.to_sparse(fill_value=0)


sdf.groupby('a').mean() # returns only dense portions because the group by functionality relies on 
np.asarray
sdf.groupby('a').count() # causes segfault

Cross ref: #5078

@jorisvandenbossche
Copy link
Member

Regarding to the last comment, note that sdf.groupby('a').mean() seems to work, but as far as I see returns an incorrect result?

In [42]: df = pd.DataFrame({'a': [0, 1, 0, 0], 'b': [0, 1, 0, 0]})
    ...: sdf = df.to_sparse(fill_value=0)

In [43]: sdf
Out[43]: 
   a  b
0  0  0
1  1  1
2  0  0
3  0  0

In [47]: sdf.groupby('a').mean()
Out[47]: 
   b
a   
1  0     # <---- should be 1?

In [44]: g = sdf.groupby('a')

In [45]: g.groups
Out[45]: {1: Int64Index([0], dtype='int64')}

Exactly because of this "only returning dense", it is pointing to the wrong values.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 17, 2018

This was fixed by #22325

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, 0.25.0 Feb 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

8 participants