New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat() on Sparse dataframe returns strange results #12174

Closed
JeanLescut opened this Issue Jan 29, 2016 · 10 comments

Comments

Projects
None yet
4 participants
@JeanLescut

JeanLescut commented Jan 29, 2016

I open a stackoverflow question here :
http://stackoverflow.com/questions/35083277/pandas-concat-on-sparse-dataframes-a-mystery

And someone ask me to open an issue on Github.
To summarise, I don't really understand what's going on after a concat of 2 sparse data frame...
After such concat, df.density or df.memory_usage, for example, will throw an error.
Moreover, the basic structure of the sparse result seems strange...
I'm sorry that the bug is not defined better.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 29, 2016

Contributor

dupe of #10536

when making a report, ideally would have a simple copy-pastable example as well as pd.show_versions()

Contributor

jreback commented Jan 29, 2016

dupe of #10536

when making a report, ideally would have a simple copy-pastable example as well as pd.show_versions()

@jreback jreback closed this Jan 29, 2016

@JeanLescut

This comment has been minimized.

Show comment
Hide comment
@JeanLescut

JeanLescut Jan 29, 2016

Hi,

I don't understand, the issue #10536 says that :

The [code] above does work correctly for SparseDataFrames.

More over, pd.show_versions() get me :

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 8.0.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None

JeanLescut commented Jan 29, 2016

Hi,

I don't understand, the issue #10536 says that :

The [code] above does work correctly for SparseDataFrames.

More over, pd.show_versions() get me :

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 8.0.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 29, 2016

Contributor

well, then pls show an example

Contributor

jreback commented Jan 29, 2016

well, then pls show an example

@JeanLescut

This comment has been minimized.

Show comment
Hide comment
@JeanLescut

JeanLescut Jan 29, 2016

This is a pretty long example, sorry about it :
Maybe it's not a bug, and I don't get something. (?)

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'C': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
              'D': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'E': [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

df2 = pd.DataFrame({'F': [0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
              'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0],
              'H': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'J': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

print("df1 sparse size =", df1.memory_usage().sum(),"Bytes, density =", df1.density)
print(type(df1))
print('default_fill_value =', df1.default_fill_value)
print(df1.values)

print("df2 sparse size =", df2.memory_usage().sum(),"Bytes, density =", df2.density)
print(type(df2))
print('default_fill_value =', df2.default_fill_value)
print(df2.values)

result = pd.concat([df1,df2], axis=1)

print(type(result)) # Seems alright
print('default_fill_value =', result.default_fill_value) # The default fill value is not 0 ???
print(result.values) # What's that "nan" blocks ?
# result.density # Throw an error
# result.memory_usage # Throw an error

JeanLescut commented Jan 29, 2016

This is a pretty long example, sorry about it :
Maybe it's not a bug, and I don't get something. (?)

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'B': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'C': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
              'D': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'E': [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

df2 = pd.DataFrame({'F': [0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
              'G': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0],
              'H': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'I': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              'J': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6]},
            index=['a','b','c','d','e','f','g','h','i','j','k','l']).to_sparse(fill_value=0)

print("df1 sparse size =", df1.memory_usage().sum(),"Bytes, density =", df1.density)
print(type(df1))
print('default_fill_value =', df1.default_fill_value)
print(df1.values)

print("df2 sparse size =", df2.memory_usage().sum(),"Bytes, density =", df2.density)
print(type(df2))
print('default_fill_value =', df2.default_fill_value)
print(df2.values)

result = pd.concat([df1,df2], axis=1)

print(type(result)) # Seems alright
print('default_fill_value =', result.default_fill_value) # The default fill value is not 0 ???
print(result.values) # What's that "nan" blocks ?
# result.density # Throw an error
# result.memory_usage # Throw an error
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 29, 2016

Contributor

its the same issue as I pointed

Contributor

jreback commented Jan 29, 2016

its the same issue as I pointed

@kawochen kawochen referenced this issue Jan 29, 2016

Open

BUG: Sparse master issue #10627

11 of 18 tasks complete
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 29, 2016

Contributor

added to the master issue. Sparse have not gotten a lot of TLC; need someone to step up and give them some :)

Contributor

jreback commented Jan 29, 2016

added to the master issue. Sparse have not gotten a lot of TLC; need someone to step up and give them some :)

@JeanLescut

This comment has been minimized.

Show comment
Hide comment
@JeanLescut

JeanLescut Jan 29, 2016

Yes #10627 is certainly root cause

JeanLescut commented Jan 29, 2016

Yes #10627 is certainly root cause

@yonatanp

This comment has been minimized.

Show comment
Hide comment
@yonatanp

yonatanp Mar 6, 2017

In case this ever helps someone, here is my favorite workaround for fixing semi-sparse dataframes:

def fix_broken_semi_sparse_columns_inplace(df, fill_value=0):
        """
        Things happen in a DataFrame's life that gets some of its columns to be "semi-sparse".
        The tell-tale of this illness is when a column is of type `SparseSeries` but has no `sp_values` member.
        This function takes `df` and identifies its broken columns; and then cuts-fixes-reinserts them in same order.
        Hopefully this is more efficient than just calling df.to_sparse(fill_value=fill_value).
        """
        # TODO: what should we do if `df` is not a SparseDataFrame?
        #       right now we raise to make sure we notice if it happens.
        if not isinstance(df, pd.SparseDataFrame):
            raise Exception("df is not a SparseDataFrame (while not necessarily bad, it's unexpected and wasn't tested")
        columns = df.columns.tolist()
        for (index, column) in enumerate(columns):
            # we only want SparseSeries
            if not isinstance(df[column], pd.SparseSeries):
                continue
            # we only want broken ones
            if hasattr(df[column], "sp_values"):
                continue
            # cut-fix-insert
            df_of_col = df[[column]]
            df_of_col_fixed = df_of_col.to_sparse(fill_value=fill_value)
            col_fixed = df_of_col_fixed[column]
            del df[column]
            df.insert(index, column, col_fixed)

yonatanp commented Mar 6, 2017

In case this ever helps someone, here is my favorite workaround for fixing semi-sparse dataframes:

def fix_broken_semi_sparse_columns_inplace(df, fill_value=0):
        """
        Things happen in a DataFrame's life that gets some of its columns to be "semi-sparse".
        The tell-tale of this illness is when a column is of type `SparseSeries` but has no `sp_values` member.
        This function takes `df` and identifies its broken columns; and then cuts-fixes-reinserts them in same order.
        Hopefully this is more efficient than just calling df.to_sparse(fill_value=fill_value).
        """
        # TODO: what should we do if `df` is not a SparseDataFrame?
        #       right now we raise to make sure we notice if it happens.
        if not isinstance(df, pd.SparseDataFrame):
            raise Exception("df is not a SparseDataFrame (while not necessarily bad, it's unexpected and wasn't tested")
        columns = df.columns.tolist()
        for (index, column) in enumerate(columns):
            # we only want SparseSeries
            if not isinstance(df[column], pd.SparseSeries):
                continue
            # we only want broken ones
            if hasattr(df[column], "sp_values"):
                continue
            # cut-fix-insert
            df_of_col = df[[column]]
            df_of_col_fixed = df_of_col.to_sparse(fill_value=fill_value)
            col_fixed = df_of_col_fixed[column]
            del df[column]
            df.insert(index, column, col_fixed)
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 6, 2017

Contributor

@yonatanp not sure what you are showing here. This issue was fixed in 0.18.1

Contributor

jreback commented Mar 6, 2017

@yonatanp not sure what you are showing here. This issue was fixed in 0.18.1

@mdgoldberg

This comment has been minimized.

Show comment
Hide comment
@mdgoldberg

mdgoldberg Mar 19, 2017

I still have this issue, when using pd.concat(..., axis=1) on SparseDataFrames with the same fill value:

import pandas as pd, numpy as np
df = pd.SparseDataFrame(np.random.choice([0., 1.], size=(5,5)), default_fill_value=0.)
df.density, df.default_fill_value
# results in density around 0.5, default_fill_value = 0
concatted = pd.concat((df, df), axis=1)
concatted.density, concatted.default_fill_value
# results in same density around 0.5, default_fill_value = nan

mdgoldberg commented Mar 19, 2017

I still have this issue, when using pd.concat(..., axis=1) on SparseDataFrames with the same fill value:

import pandas as pd, numpy as np
df = pd.SparseDataFrame(np.random.choice([0., 1.], size=(5,5)), default_fill_value=0.)
df.density, df.default_fill_value
# results in density around 0.5, default_fill_value = 0
concatted = pd.concat((df, df), axis=1)
concatted.density, concatted.default_fill_value
# results in same density around 0.5, default_fill_value = nan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment