DataFrame[SparseArray] coerces to dense on sum(axis=1) #28487

scottgigante · 2019-09-17T20:07:23Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy import sparse
X_sp = sparse.coo_matrix((2**30, 2**10))
X_pd = pd.DataFrame.sparse.from_spmatrix(X_sp)
X_sp.sum(axis=1)
X_sp.sum(axis=0)
X_pd.sum(axis=1)
X_pd.sum(axis=0)

Problem description

The new sparse dataframe is coerced to dense when computing the sum.

>>> import pandas as pd
>>> from scipy import sparse
>>> X_sp = sparse.coo_matrix((2**30, 2**10))
>>> X_pd = pd.DataFrame.sparse.from_spmatrix(X_sp)
>>> X_sp.sum(axis=1)
matrix([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])
>>> X_sp.sum(axis=0)
matrix([[0., 0., 0., ..., 0., 0., 0.]])
>>> X_pd.sum(axis=1)
Traceback (most recent call last):
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/frame.py", line 7908, in _reduce
    values = self.values
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/generic.py", line 5443, in values
    return self._data.as_array(transpose=self._AXIS_REVERSED)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 822, in as_array
    arr = mgr._interleave()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
    result = np.empty(self.shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (1024, 1073741824) and data type float64

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/generic.py", line 11585, in stat_func
    min_count=min_count,
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/frame.py", line 7953, in _reduce
    result = f(data.values)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/generic.py", line 5443, in values
    return self._data.as_array(transpose=self._AXIS_REVERSED)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 822, in as_array
    arr = mgr._interleave()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
    result = np.empty(self.shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (1024, 1073741824) and data type float64
>>> X_pd.sum(axis=0)
# hangs forever
^C
Traceback (most recent call last):
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/frame.py", line 7908, in _reduce
    values = self.values
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/generic.py", line 5443, in values
    return self._data.as_array(transpose=self._AXIS_REVERSED)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 822, in as_array
    arr = mgr._interleave()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
    result = np.empty(self.shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (1024, 1073741824) and data type float64

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/generic.py", line 11585, in stat_func
    min_count=min_count,
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/frame.py", line 7935, in _reduce
    result = opa.get_result()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
    self.apply_series_generator()
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/apply.py", line 308, in apply_series_generator
    results[i] = self.f(v)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/frame.py", line 7893, in f
    return op(x, axis=axis, skipna=skipna, **kwds)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/nanops.py", line 70, in _f
    return f(*args, **kwargs)
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/nanops.py", line 495, in nansum
    values, skipna, fill_value=0, mask=mask
  File "/home/scottgigante/sandbox/lib/python3.7/site-packages/pandas/core/nanops.py", line 309, in _get_values
    values = values.copy()
KeyboardInterrupt

Expected Output

The output should be computed successfully as in the scipy case.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.10-arch1-1-ARCH
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.0
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.0.1
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.1
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : 2.7.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-17T20:53:25Z

Can you verify: this is an issue with both SparseDataFrame and DataFrame[Sparse]?

scottgigante · 2019-09-21T16:36:04Z

Yes, confirmed. Same result for X_pd_depr = pd.SparseDataFrame(X_sp).

gpascualg · 2019-09-29T16:53:58Z

I can also confirm this on

Python: 3.6.6
Pandas: 0.23.4
Numpy: 1.15.1

TheVidAllMayThe · 2020-04-28T14:52:24Z

I've come across this same issue.

I've been using the following method to sum the columns as a workaround:

def _sum_sparse_columns(df: pd.DataFrame) -> pd.Series:
    idx = df.index
    df.columns = range(len(df.columns))  # Otherwise an exception is thrown when converting to a scipy matrix
    mat = df.sparse.to_coo()
    return pd.Series(
        [x[0, 0] for x in mat.sum(axis=1)],
        index=idx
    )

mzeitlin11 · 2021-09-12T17:39:31Z

This looks good now for the axis=0 case on master. The axis=1 case is a lot harder since storage is column-oriented (and seems a less useful operation). A benchmark would probably be good to add for the axis=0 case

jbrockmendel · 2021-12-29T20:29:19Z

We can now use _reduce_axis1 to avoid a transpose in some cases, but I'm finding there are tricky corner cases to be worked out. Good topic for a medium-experience contributor.

jorisvandenbossche added the Sparse Sparse Data Type label Sep 18, 2019

mroeschke added the Bug label Apr 25, 2020

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020

impredicative mentioned this issue Feb 7, 2021

BUG: MemoryError: Unable to allocate #39629

Closed

3 tasks

mzeitlin11 added Benchmark Performance (ASV) benchmarks and removed Bug labels Sep 12, 2021

mzeitlin11 added this to the Contributions Welcome milestone Sep 12, 2021

mzeitlin11 changed the title ~~DataFrame[SparseArray] coerces to dense on sum~~ DataFrame[SparseArray] coerces to dense on sum(axis=1) Sep 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added the Performance Memory or execution speed performance label Jun 7, 2023

jbrockmendel mentioned this issue Aug 15, 2023

PERF: axis=1 reductions with EA dtypes #54341

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame[SparseArray] coerces to dense on sum(axis=1) #28487

DataFrame[SparseArray] coerces to dense on sum(axis=1) #28487

scottgigante commented Sep 17, 2019

TomAugspurger commented Sep 17, 2019

scottgigante commented Sep 21, 2019

gpascualg commented Sep 29, 2019

TheVidAllMayThe commented Apr 28, 2020 •

edited

mzeitlin11 commented Sep 12, 2021

jbrockmendel commented Dec 29, 2021

DataFrame[SparseArray] coerces to dense on sum(axis=1) #28487

DataFrame[SparseArray] coerces to dense on sum(axis=1) #28487

Comments

scottgigante commented Sep 17, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Sep 17, 2019

scottgigante commented Sep 21, 2019

gpascualg commented Sep 29, 2019

TheVidAllMayThe commented Apr 28, 2020 • edited

mzeitlin11 commented Sep 12, 2021

jbrockmendel commented Dec 29, 2021

Output of `pd.show_versions()`

TheVidAllMayThe commented Apr 28, 2020 •

edited