Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

Merged
merged 18 commits into from
Mar 22, 2020

Conversation

rth
Copy link
Contributor

@rth rth commented Mar 19, 2020

This optimizes DataFrame.sparse.from_spmatrix performance, using an approach proposed by @jorisvandenbossche in #32196 (comment)

Bulds on top of #32821 and adds another ~4.5x speed up in addition to that PR. Benchmarks for run time (in seconds) are done by running pd.DataFrame.sparse.from_spmatrix on a random sparse CSR array of given n_samples, n_features with a density=0.01:

label                     PR   master Speed-up master/PR                                                                               
n_samples n_features                                    
100       100000      2.3624  10.1247              4.29x
10000     10000       0.2391   1.1037              4.62x
100000    100         0.0031   0.0134              4.32x

with the benchmarking code below,

import pandas as pd
import numpy as np
import scipy.sparse

from neurtu import timeit, delayed


def bench_cases():
    for n_samples, n_features in [(100, 100000), (10000, 10000), (100000, 100)]:
        X = scipy.sparse.rand(
            n_samples, n_features, random_state=0, density=0.01, format="csr"
        )
        tags = {"n_samples": n_samples, "n_features": n_features, "label": "PR"}

        yield delayed(pd.DataFrame.sparse.from_spmatrix, tags=tags.copy())(X)


res = timeit(bench_cases())

res = (
    res.reset_index()
    .set_index(["n_samples", "n_features", "label"])["wall_time"]
    .unstack(-1)[["PR"]]
    .round(4)
)
# res["master"] = np.array([10.1247, 1.1037, 0.0134])

# res["Speed-up master/PR"] = (res["master"] / res["PR"]).round(2).astype("str") + "x"
print(res)

Closes #32196 although further optimization might be possible. Around 90% of remaining run time happens in DataFrame._from_arrays which goes deeper into pandas internals. Maybe some checks could be disabled there, but that looks less straightforward.

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Sparse Sparse Data Type labels Mar 19, 2020
sl = slice(indptr[i], indptr[i + 1])
idx = IntIndex(n_rows, indices[sl], check_integrity=False)
arr = SparseArray._simple_new(data[sl], idx, dtype)
arrays.append(arr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, also tried with a generator here to avoid pre-allocating all the arrays, but it doesn't really matter. Most of the remaining run time is in DataFrame._from_arrays.

result.columns = columns
return result
n_rows, n_columns = data.shape
data.sort_indices()
Copy link
Contributor Author

@rth rth Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be already done in tocsc, but that's a scipy implementation detail, and it doesn't really cost much. We need to make sure indices are sorted, since we create IntIndex with check_integrity=False that used to check for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment about that inline?

@@ -227,15 +227,24 @@ def from_spmatrix(cls, data, index=None, columns=None):
1 0.0 1.0 0.0
2 0.0 0.0 1.0
"""
from pandas import DataFrame
from pandas import DataFrame, SparseDtype
from . import IntIndex, SparseArray

data = data.tocsc()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be tocsc(copy=False) for newer scipy versions, but the gain in performance is minimal anyway with respect to the total runtime.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls
add an asv with a fixed random seed

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks nice! The asv would go in benchmarks/sparse.py.

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
rth and others added 3 commits March 19, 2020 14:14
Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@rth
Copy link
Contributor Author

rth commented Mar 19, 2020

pls add an asv with a fixed random seed

There is actually already an asv benchmark for this,

$ asv continuous --bench SparseDataFrameConstructor upstream/master HEAD
[...]
[100.00%] ··· sparse.SparseDataFrameConstructor.time_from_scipy                                                                 117±1ms
       before           after         ratio
     [dbd7a5d3]       [508fda51]
     <master>         <opt-sparse-init>
-         117±1ms      26.8±0.07ms     0.23  sparse.SparseDataFrameConstructor.time_from_scipy

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

that shows comparable improvement to the benchmarks I have done earlier in the issue description.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming that we have an existing benchmark for this.

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Mar 19, 2020
@jorisvandenbossche
Copy link
Member

Around 90% of remaining run time happens in DataFrame._from_arrays which goes deeper into pandas internals. Maybe some checks could be disabled there, but that looks less straightforward.

Yeah, it's not really optimized for the case you know you have all well-behaved arrays you just want to put in a DataFrame.
Investigating this a bit, one factor seems to be the unnecessary consolidation -> #32826

pandas/core/arrays/sparse/accessor.py Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits but otherwise lgtm

pandas/_libs/sparse.pyx Outdated Show resolved Hide resolved
data.sort_indices()
indices = data.indices
indptr = data.indptr
data = data.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assign to a different variable? This in theory could cause mypy complaints (though ndarray resolving to Any at the moment probably allows it to pass)

rth and others added 2 commits March 19, 2020 20:34
Co-Authored-By: William Ayd <william.ayd@icloud.com>
@rth
Copy link
Contributor Author

rth commented Mar 19, 2020

Thanks @WillAyd , I addressed your comments.

"Web and docs" CI job failed with the following, but I don't think it's due to this PR,

 Check ipython directive errors
0s
##[error]Process completed with exit code 1.
Run ! grep -B1 "^<<<-------------------------------------------------------------------------$" sphinx.log
  FutureWarning,
<<<-------------------------------------------------------------------------
##[error]Process completed with exit code 1.

@jorisvandenbossche
Copy link
Member

It's not very visible, but the actual error is :

 >>>-------------------------------------------------------------------------
Warning in /home/runner/work/pandas/pandas/doc/source/user_guide/scale.rst at block ending on line 254
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/fsspec/implementations/local.py:33: FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release.
  FutureWarning,
<<<-------------------------------------------------------------------------

so indeed not from this PR.

@jorisvandenbossche
Copy link
Member

See #32832

idx = IntIndex(n_rows, indices[sl], check_integrity=False)
arr = SparseArray._simple_new(array_data[sl], idx, dtype)
arrays.append(arr)
return DataFrame._from_arrays(arrays, columns=columns, index=index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will be able to use the verify_integrity=False from #32858 (or if this PR goes in first, I can add it in that PR)

@rth
Copy link
Contributor Author

rth commented Mar 20, 2020

@jreback Do you have any other comments on this? I addressed your review comment.

@jorisvandenbossche
Copy link
Member

@rth I merged #32858, so you can update this now to use the verify_integrity=False

@@ -45,8 +45,7 @@ def time_sparse_array(self, dense_proportion, fill_value, dtype):
class SparseDataFrameConstructor:
def setup(self):
N = 1000
self.arr = np.arange(N)
self.sparse = scipy.sparse.rand(N, N, 0.005)
self.sparse = scipy.sparse.rand(N, N, 0.005, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given what we said in the other PR, you can remove this random state again (it's covered by the global setup function which will be run for every benchmark)

@rth
Copy link
Contributor Author

rth commented Mar 20, 2020

Thanks @jorisvandenbossche . After merging in #32858 the benchmarks results are

$ asv continuous --bench SparseDataFrameConstructor upstream/master HEAD
...
[100.00%] ··· sparse.SparseDataFrameConstructor.time_from_scipy                                                               104±0.3ms
       before           after         ratio
     [3b406a32]       [e063c8a5]
     <opt-sparse-init~2^2>       <opt-sparse-init>
-       104±0.3ms       12.0±0.3ms     0.12  sparse.SparseDataFrameConstructor.time_from_scipy

(previous results in #32825 (comment))

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. pls merge master and ping on green.

@@ -224,6 +224,9 @@ Performance improvements
- The internal index method :meth:`~Index._shallow_copy` now copies cached attributes over to the new index,
avoiding creating these again on the new index. This can speed up many operations that depend on creating copies of
existing indexes (:issue:`28584`, :issue:`32640`, :issue:`32669`)
- Performance improvement when creating sparse :class:`DataFrame` from
``scipy.sparse`` matrices using the :meth:`DataFrame.sparse.from_spmatrix`
constructor (:issue:`32196`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number from joris PR as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added PRs by Joris, there have been 5 PRs on this in total.

Ideally another what's new entry should be added since PR's by @jorisvandenbossche also made initialization of extension arrays faster under some conditions, as far as I understand. Though I would rather not add it here, nor am I competent on accurately formulating it.

@rth
Copy link
Contributor Author

rth commented Mar 22, 2020

With #32856 merged, there is a new failure for non unique column names. In particular, TestFrameAccessor::test_from_spmatrix_columns[columns2] calling,

pd.DataFrame.sparse.from_spmatrix(mat, columns=['a', 'a'])

fails with,

pandas/core/arrays/sparse/accessor.py:251: in from_spmatrix
    return DataFrame._from_arrays(
pandas/core/frame.py:1919: in _from_arrays
    mgr = arrays_to_mgr(
pandas/core/internals/construction.py:77: in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
pandas/core/internals/managers.py:1689: in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arrays = [[0, 0.7151164908889105, 0, 0, 0.7790091888007832, 0.8726851819885113, 0.8634012065028459, 0, 0, 0.6582977581693065]
F...13, 0, 0, 0.6245750826925724, 0, 0, 0.6870139075931206]
Fill: 0
IntIndex
Indices: array([1, 2, 3, 6, 9], dtype=int32)
]
names = ['a', 'a'], axes = [['a', 'a'], RangeIndex(start=0, stop=10, step=1)]

    def form_blocks(arrays, names, axes):
        # put "leftover" items in float bucket, where else?
        # generalize?
        items_dict = defaultdict(list)
        extra_locs = []
    
        names_idx = ensure_index(names)
        if names_idx.equals(axes[0]):
            names_indexer = np.arange(len(names_idx))
        else:
>           assert names_idx.intersection(axes[0]).is_unique
E           AssertionError

any suggestions @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

Hmm, it's strange that it would be due to #32856, because it's failing inside form_blocks before the location that was changed in that PR. Maybe the verify_integrity=False causes it?

@rth
Copy link
Contributor Author

rth commented Mar 22, 2020

Hmm, it's strange that it would be due to #32856, because it's failing inside form_blocks before the location that was changed in that PR. Maybe the verify_integrity=False causes it?

Yes, sorry, you are right. This failure was already present 2 days ago before I merged master today.

@@ -228,14 +228,29 @@ def from_spmatrix(cls, data, index=None, columns=None):
2 0.0 0.0 1.0
"""
from pandas import DataFrame
from pandas._libs.sparse import IntIndex

data = data.tocsc()
index, columns = cls._prep_index(data, index, columns)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that this _pre_index doesn't actually return Index objects when columns or index is not None (and passing actual Index objects is required when doing verify_integrity=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Thanks for confirming that passing duplicate columns names in an Index object is expected to work in _from_arrays. Will look into it. Thanks Joris!

Copy link
Member

@jorisvandenbossche jorisvandenbossche Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be it, as doing pd.DataFrame.sparse.from_spmatrix(mat, columns=pd.Index(['a', 'a'])) instead of pd.DataFrame.sparse.from_spmatrix(mat, columns=['a', 'a']) works.

You can add in here a ensure_index call on both index and columns. (from pandas.core.indexes.api import ensure_index)

@rth
Copy link
Contributor Author

rth commented Mar 22, 2020

Thanks @jorisvandenbossche and @jreback ! CI is green now.

@jreback jreback merged commit d6de714 into pandas-dev:master Mar 22, 2020
@jreback
Copy link
Contributor

jreback commented Mar 22, 2020

thanks @rth very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices?
6 participants