concat() on Sparse dataframe returns strange results #12174

JeanLescut · 2016-01-29T13:13:11Z

I open a stackoverflow question here :
http://stackoverflow.com/questions/35083277/pandas-concat-on-sparse-dataframes-a-mystery

And someone ask me to open an issue on Github.
To summarise, I don't really understand what's going on after a concat of 2 sparse data frame...
After such concat, df.density or df.memory_usage, for example, will throw an error.
Moreover, the basic structure of the sparse result seems strange...
I'm sorry that the bug is not defined better.

jreback · 2016-01-29T13:46:24Z

dupe of #10536

when making a report, ideally would have a simple copy-pastable example as well as pd.show_versions()

JeanLescut · 2016-01-29T13:51:53Z

Hi,

I don't understand, the issue #10536 says that :

The [code] above does work correctly for SparseDataFrames.

More over, pd.show_versions() get me :

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 8.0.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None

jreback · 2016-01-29T13:54:22Z

well, then pls show an example

JeanLescut · 2016-01-29T13:56:15Z

This is a pretty long example, sorry about it :
Maybe it's not a bug, and I don't get something. (?)

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'C': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
              'D': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'E': [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

df2 = pd.DataFrame({'F': [0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
              'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0],
              'H': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'J': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

print("df1 sparse size =", df1.memory_usage().sum(),"Bytes, density =", df1.density)
print(type(df1))
print('default_fill_value =', df1.default_fill_value)
print(df1.values)

print("df2 sparse size =", df2.memory_usage().sum(),"Bytes, density =", df2.density)
print(type(df2))
print('default_fill_value =', df2.default_fill_value)
print(df2.values)

result = pd.concat([df1,df2], axis=1)

print(type(result)) # Seems alright
print('default_fill_value =', result.default_fill_value) # The default fill value is not 0 ???
print(result.values) # What's that "nan" blocks ?
# result.density # Throw an error
# result.memory_usage # Throw an error

jreback · 2016-01-29T14:11:38Z

its the same issue as I pointed

jreback · 2016-01-29T14:14:34Z

added to the master issue. Sparse have not gotten a lot of TLC; need someone to step up and give them some :)

JeanLescut · 2016-01-29T14:15:10Z

Yes #10627 is certainly root cause

yonatanp · 2017-03-06T10:25:54Z

In case this ever helps someone, here is my favorite workaround for fixing semi-sparse dataframes:

def fix_broken_semi_sparse_columns_inplace(df, fill_value=0):
        """
        Things happen in a DataFrame's life that gets some of its columns to be "semi-sparse".
        The tell-tale of this illness is when a column is of type `SparseSeries` but has no `sp_values` member.
        This function takes `df` and identifies its broken columns; and then cuts-fixes-reinserts them in same order.
        Hopefully this is more efficient than just calling df.to_sparse(fill_value=fill_value).
        """
        # TODO: what should we do if `df` is not a SparseDataFrame?
        #       right now we raise to make sure we notice if it happens.
        if not isinstance(df, pd.SparseDataFrame):
            raise Exception("df is not a SparseDataFrame (while not necessarily bad, it's unexpected and wasn't tested")
        columns = df.columns.tolist()
        for (index, column) in enumerate(columns):
            # we only want SparseSeries
            if not isinstance(df[column], pd.SparseSeries):
                continue
            # we only want broken ones
            if hasattr(df[column], "sp_values"):
                continue
            # cut-fix-insert
            df_of_col = df[[column]]
            df_of_col_fixed = df_of_col.to_sparse(fill_value=fill_value)
            col_fixed = df_of_col_fixed[column]
            del df[column]
            df.insert(index, column, col_fixed)

jreback · 2017-03-06T14:37:06Z

@yonatanp not sure what you are showing here. This issue was fixed in 0.18.1

mdgoldberg · 2017-03-19T04:43:40Z

I still have this issue, when using pd.concat(..., axis=1) on SparseDataFrames with the same fill value:

import pandas as pd, numpy as np
df = pd.SparseDataFrame(np.random.choice([0., 1.], size=(5,5)), default_fill_value=0.)
df.density, df.default_fill_value
# results in density around 0.5, default_fill_value = 0
concatted = pd.concat((df, df), axis=1)
concatted.density, concatted.default_fill_value
# results in same density around 0.5, default_fill_value = nan

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request Sparse Sparse Data Type labels Jan 29, 2016

jreback closed this as completed Jan 29, 2016

kawochen mentioned this issue Jan 29, 2016

BUG: Sparse master issue #10627

Closed

18 tasks

sinhrks mentioned this issue Apr 23, 2016

BUG: Sparse concat may fill fill_value with NaN #12966

Closed

4 tasks

jreback mentioned this issue Mar 20, 2017

BUG: concat with sparse not propogating default_fill_value #15737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concat() on Sparse dataframe returns strange results #12174

concat() on Sparse dataframe returns strange results #12174

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

yonatanp commented Mar 6, 2017

jreback commented Mar 6, 2017

mdgoldberg commented Mar 19, 2017

concat() on Sparse dataframe returns strange results #12174

concat() on Sparse dataframe returns strange results #12174

Comments

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

jreback commented Jan 29, 2016

jreback commented Jan 29, 2016

JeanLescut commented Jan 29, 2016

yonatanp commented Mar 6, 2017

jreback commented Mar 6, 2017

mdgoldberg commented Mar 19, 2017