Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat() on Sparse dataframe returns strange results #12174

Closed
JeanLescut opened this issue Jan 29, 2016 · 10 comments
Closed

concat() on Sparse dataframe returns strange results #12174

JeanLescut opened this issue Jan 29, 2016 · 10 comments
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type

Comments

@JeanLescut
Copy link

I open a stackoverflow question here :
http://stackoverflow.com/questions/35083277/pandas-concat-on-sparse-dataframes-a-mystery

And someone ask me to open an issue on Github.
To summarise, I don't really understand what's going on after a concat of 2 sparse data frame...
After such concat, df.density or df.memory_usage, for example, will throw an error.
Moreover, the basic structure of the sparse result seems strange...
I'm sorry that the bug is not defined better.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request Sparse Sparse Data Type labels Jan 29, 2016
@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

dupe of #10536

when making a report, ideally would have a simple copy-pastable example as well as pd.show_versions()

@jreback jreback closed this as completed Jan 29, 2016
@JeanLescut
Copy link
Author

Hi,

I don't understand, the issue #10536 says that :

The [code] above does work correctly for SparseDataFrames.

More over, pd.show_versions() get me :

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 8.0.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

well, then pls show an example

@JeanLescut
Copy link
Author

This is a pretty long example, sorry about it :
Maybe it's not a bug, and I don't get something. (?)

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'C': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
              'D': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'E': [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

df2 = pd.DataFrame({'F': [0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
              'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0],
              'H': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'J': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

print("df1 sparse size =", df1.memory_usage().sum(),"Bytes, density =", df1.density)
print(type(df1))
print('default_fill_value =', df1.default_fill_value)
print(df1.values)

print("df2 sparse size =", df2.memory_usage().sum(),"Bytes, density =", df2.density)
print(type(df2))
print('default_fill_value =', df2.default_fill_value)
print(df2.values)

result = pd.concat([df1,df2], axis=1)

print(type(result)) # Seems alright
print('default_fill_value =', result.default_fill_value) # The default fill value is not 0 ???
print(result.values) # What's that "nan" blocks ?
# result.density # Throw an error
# result.memory_usage # Throw an error

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

its the same issue as I pointed

@kawochen kawochen mentioned this issue Jan 29, 2016
18 tasks
@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

added to the master issue. Sparse have not gotten a lot of TLC; need someone to step up and give them some :)

@JeanLescut
Copy link
Author

Yes #10627 is certainly root cause

@yonatanp
Copy link

yonatanp commented Mar 6, 2017

In case this ever helps someone, here is my favorite workaround for fixing semi-sparse dataframes:

def fix_broken_semi_sparse_columns_inplace(df, fill_value=0):
        """
        Things happen in a DataFrame's life that gets some of its columns to be "semi-sparse".
        The tell-tale of this illness is when a column is of type `SparseSeries` but has no `sp_values` member.
        This function takes `df` and identifies its broken columns; and then cuts-fixes-reinserts them in same order.
        Hopefully this is more efficient than just calling df.to_sparse(fill_value=fill_value).
        """
        # TODO: what should we do if `df` is not a SparseDataFrame?
        #       right now we raise to make sure we notice if it happens.
        if not isinstance(df, pd.SparseDataFrame):
            raise Exception("df is not a SparseDataFrame (while not necessarily bad, it's unexpected and wasn't tested")
        columns = df.columns.tolist()
        for (index, column) in enumerate(columns):
            # we only want SparseSeries
            if not isinstance(df[column], pd.SparseSeries):
                continue
            # we only want broken ones
            if hasattr(df[column], "sp_values"):
                continue
            # cut-fix-insert
            df_of_col = df[[column]]
            df_of_col_fixed = df_of_col.to_sparse(fill_value=fill_value)
            col_fixed = df_of_col_fixed[column]
            del df[column]
            df.insert(index, column, col_fixed)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@yonatanp not sure what you are showing here. This issue was fixed in 0.18.1

@mdgoldberg
Copy link

I still have this issue, when using pd.concat(..., axis=1) on SparseDataFrames with the same fill value:

import pandas as pd, numpy as np
df = pd.SparseDataFrame(np.random.choice([0., 1.], size=(5,5)), default_fill_value=0.)
df.density, df.default_fill_value
# results in density around 0.5, default_fill_value = 0
concatted = pd.concat((df, df), axis=1)
concatted.density, concatted.default_fill_value
# results in same density around 0.5, default_fill_value = nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

4 participants