New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Sparse incorrectly handle fill_value #12797

Closed
sinhrks opened this Issue Apr 4, 2016 · 5 comments

Comments

Projects
None yet
2 participants
@sinhrks
Member

sinhrks commented Apr 4, 2016

Sparse looks to handle missing (NaN) and fill_value confusingly. Based on the doc, I understand fill_value is a user-specified value to be omitted in the sparse internal repr. fill_value may be different from missing (NaN).

Code Sample, a copy-pastable example if possible

# NG, 2nd and last element must be NaN
pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  0.,  0.,  3.,  0.])

# NG, 2nd element must be NaN
orig = pd.Series([1, np.nan, 0, 3, np.nan], index=list('ABCDE'))
sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    0.0
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

Expected Output

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
# array([ 1.,  np.nan,  0.,  3.,  np.nan])

sparse = orig.to_sparse(fill_value=0)
sparse.reindex(['A', 'B', 'C'])
# A    1.0
# B    NaN
# C    0.0
# dtype: float64
# BlockIndex
# Block locations: array([0], dtype=int32)
# Block lengths: array([1], dtype=int32)

output of pd.show_versions()

Current master.

The fix itself looks straightforward, but it breaks some tests use dubious comparison.

@sinhrks sinhrks added this to the 0.18.1 milestone Apr 4, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 4, 2016

Contributor

hmm, I think its using np.nan as the missing value indicator. Which is right. THEN you fill using the fill_value those locations. not the other way around.

Contributor

jreback commented Apr 4, 2016

hmm, I think its using np.nan as the missing value indicator. Which is right. THEN you fill using the fill_value those locations. not the other way around.

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Apr 4, 2016

Member

@jreback I may misunderstand, but fill_value will be a missing value indicator if provided (np.nan is included in SparseIndex indices).

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0)
[1.0, nan, 0, 3.0, nan]
Fill: 0
IntIndex
Indices: array([0, 1, 3, 4], dtype=int32)

Thus I feel it is natural to .to_dense returns np.nan as it is, not fill_value.

Member

sinhrks commented Apr 4, 2016

@jreback I may misunderstand, but fill_value will be a missing value indicator if provided (np.nan is included in SparseIndex indices).

pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0)
[1.0, nan, 0, 3.0, nan]
Fill: 0
IntIndex
Indices: array([0, 1, 3, 4], dtype=int32)

Thus I feel it is natural to .to_dense returns np.nan as it is, not fill_value.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 4, 2016

Contributor

in your example the 0 (2nd element) is the missing one.

In [5]: pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
Out[5]: array([ 1.,  0.,  0.,  3.,  0.])

ahh so you think this should be
Out[5]: array([ 1., np.nan, 0., 3., np.nan])

yes that is prob right.

Contributor

jreback commented Apr 4, 2016

in your example the 0 (2nd element) is the missing one.

In [5]: pd.SparseArray([1, np.nan, 0, 3, np.nan], fill_value=0).to_dense()
Out[5]: array([ 1.,  0.,  0.,  3.,  0.])

ahh so you think this should be
Out[5]: array([ 1., np.nan, 0., 3., np.nan])

yes that is prob right.

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Apr 4, 2016

Member

Ah sorry, added Expected Output section.

Member

sinhrks commented Apr 4, 2016

Ah sorry, added Expected Output section.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 4, 2016

Contributor

yep that looks right.

yeh I that comparison tests equates NaN to missing value, when in fact the fill_value are the missing ones.

Contributor

jreback commented Apr 4, 2016

yep that looks right.

yeh I that comparison tests equates NaN to missing value, when in fact the fill_value are the missing ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment