Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method dropna does not work on SparseDataFrames #21172

Closed
babky opened this issue May 22, 2018 · 6 comments

Comments

Projects
None yet
4 participants
@babky
Copy link
Contributor

commented May 22, 2018

Function dropna may return wrong result on SparseDataFrame. The following code

import pandas as pd

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')

outputs

import pandas as pd

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
   F1  F2
0 NaN   0
1 NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
    F1
0  NaN
1  NaN

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
   F1
0 NaN
1 NaN

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

Problem description

dropna method behaves differently for SparseDataFrames and dense ones. Also it may happen that it does not drop nan columns at all (see the last examples in the first batch). The correct behaviour is in the second batch of commands.

Expected Output

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2
0   0
1   1

   F2
0   0
1   1

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64  
OS: Linux       
OS-release: 4.15.0-20-generic
machine: x86_64
processor:
byteorder: little                                                                
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
                                                                                 
pandas: 0.23.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2                                                                   
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None                                                                     
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2                                                                  
pytz: 2018.4
blosc: None
bottleneck: None
tables: None                                                                     
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None                                                                   
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1       
sqlalchemy: 1.2.7
pymysql: None     
psycopg2: None    
jinja2: 2.10
s3fs: None           
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@WillAyd

This comment has been minimized.

Copy link
Member

commented May 22, 2018

Can you simplify your example? Ideally focus on one copy / paste-able example that outlines the issue and expected result.

Be sure to check out the contributing guide for proper bug reporting:

https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

@babky

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2018

The most prominent sample is this one

import pandas as pd

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))

outputs

   F1
0 NaN
1 NaN

That means that after dropping fields having NaNs a field having all NaNs is returned instead of the field F2. It should have dropped field F1.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented May 22, 2018

Is this new in 0.23?

@babky

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2018

Exactly I tried it in a new virtualenv and pandas v 0.23.0 has this issue and v 0.22.0 does not.

@WillAyd

This comment has been minimized.

Copy link
Member

commented May 22, 2018

Thanks for the example / report. Investigation / PRs are always welcome!

@WillAyd WillAyd added Missing-data and removed Needs Info labels May 22, 2018

@babky

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2018

To conclude. I've isolated this issue into this code. A mask - SparseArray of the fields to which should be preserved is created. The mask is called nonzero operation, however that is buggy even in 0.22.0. And in both versions this code

import pandas as pd

sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])
sa = sa.nonzero()
print(sa)

outputs 0, 3, 7. The correct result would be 2, 5, 9 - shifted by two. In pandas 0.22.0 this was resolved by using to_dense() in the process.

To resolve this - one could use to_dense() and dropna() would work and SparseArray would remain buggy. What would be of a greater value is fixing SparseArray.

@babky babky referenced this issue May 22, 2018

Merged

Fix nonzero of a SparseArray #21175

4 of 4 tasks complete

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.