BUG: setting a sparse column in a frame buggy #8131

Closed
jreback opened this Issue Aug 28, 2014 · 4 comments

Comments

Projects
None yet
2 participants
Contributor

jreback commented Aug 28, 2014

from SO

thought this was well tested.....

df = pd.DataFrame({'c_1':['a', 'b', 'c'], 'n_1': [1., 2., 3.]})
df['new_column'] = pd.Series([0, 0, 1]).to_sparse(fill_value=0)
# AssertionError: Shape of new values must be compatible with manager shape

jreback added this to the 0.15.0 milestone Aug 28, 2014

@jreback jreback modified the milestone: 0.15.1, 0.15.0 Sep 9, 2014

Contributor

jreback commented Sep 9, 2014

Contributor

immerrr commented Sep 10, 2014

Let me check that...

Contributor

immerrr commented Sep 11, 2014

The issue is in df._sanitize_column that returns a 2d dense array with non-fill elements:

In [9]: df = pd.DataFrame({'c_1': list('abc')})

In [10]: sp_col = pd.Series([0,0,1]).to_sparse(fill_value=0)

In [11]: df._sanitize_column('n', sp_col)
Out[11]: array([[1]])

Hacking sanitize column is easy, but it uncovers yet another issue with ndarray subclassing:

In [54]: sp_arr = pd.SparseArray([0,0,1], fill_value=0)

In [55]: sp_arr
Out[55]: 
[0, 0, 1.0]
Fill: 0
IntIndex
Indices: array([2], dtype=int32)


In [56]: np.asarray(sp_arr)
Out[56]: array([ 1.])

This happens because np.asarray checks on C level that sp_arr provides PEP3118 buffer interface (which ndarray does) and uses that representation which contains only non-fill elements. Which is unfortunate because it can not be overridden by inheriting class on Python level (see python issue).

Contributor

jreback commented Sep 11, 2014

yep, prob SpareseArray just needs to be tested for (similar to what just did with Categorical),
needs to be passed thru directly.

@jreback jreback modified the milestone: 0.15.0, 0.15.1 Sep 17, 2014

jreback closed this in #8291 Sep 17, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment