PERF: faster placement creating extension blocks from arrays #32856

jorisvandenbossche · 2020-03-20T09:13:09Z

When creating a DataFrame from many arrays stored in ExtensionBlocks, it seems quite some time is taken inside BlockPlacement using np.require on the passed list. Specifying the placement as a slice instead gives a much faster creation of the BlockPlacement. This delays the conversion to an array, though, but afterwards the conversion of the slice to an array inside BlockPlacement when neeeded is faster than an initial creation of a BlockPlacement from a list/array of 1 element.

From investigating #32196 (comment)

@rth this reduces it with another third! (only from the dataframe creation, to be clear)

jorisvandenbossche · 2020-03-20T09:13:57Z

Using the same example from #32826. With:

arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)]
index = pd.Index(range(len(arrays[0])))  
columns = pd.Index(range(len(arrays)))

it gives

In [4]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)  
113 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
72.9 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- PR

jorisvandenbossche · 2020-03-20T10:15:52Z

This is also indirectly covered by the sparse benchmark, but adding some benchmarks specifically for _from_arrays:

       before           after         ratio
     [e31354f0]       [6d6e822a]
     <master>         <perf-placement>
-      14.5±0.7ms       9.58±0.6ms     0.66  frame_ctor.FromArrays.time_frame_from_arrays_sparse

asv_bench/benchmarks/frame_ctor.py

jreback · 2020-03-20T10:45:23Z

pandas/core/internals/managers.py

-            make_block(array, klass=ObjectValuesExtensionBlock, placement=[i])
+            make_block(
+                array, klass=ObjectValuesExtensionBlock, placement=slice(i, i + 1, 1)
+            )


rather than changing this here
simply convert a single integer into a slice at a lower level

Ah, indeed, I can create the slice inside BlockPlacement constructor. But don't we then want to explicitly pass a single integer instead of a list of 1 integer?

we require list / slice

To compare, I just pushed a commit that does both.
Personally, I like passing it as an integer. It's another 33% faster compared to passing it as a slice or 1-element list, and it makes it also explicit when constructing it that it is about a single column.

Whoops, I missed you proposed a single integer yourself (for some reason I thought you wanted to catch the single element list in BlockPlacement). Updated.

pandas/core/internals/managers.py

jbrockmendel · 2020-03-20T15:07:35Z

another option would be to pass BlockPlacement objects. If we do that consistently, we could remove the checks/casting in the constructor/property

jorisvandenbossche · 2020-03-20T15:15:58Z

Sorry, I don't understand that option. You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

jbrockmendel · 2020-03-20T15:35:17Z

You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

Yah, the thought was about creating the BlockPlacement object and passing it to make_block rather than having it created later. Can ignore as orthogonal, but might trim some overhead.

jorisvandenbossche · 2020-03-20T15:38:29Z

passing it to make_block rather than having it created later.

It shouldn't matter much I think, it is just passed through until ExtensionBlock init, and there it's doing a if not isinstance(placement, BlockPlacement): placement = BlockPlacement(placement)

So we would first need to eliminate all other places where we pass a slice/array as placement to block creation, and then it would only elimiate one isinstance call

jbrockmendel · 2020-03-20T15:45:23Z

So we would first need to eliminate all other places where we pass a slice/array

Yes, it is not a trivial idea.

and then it would only elimiate one isinstance call

We could also make mgr_locs not-a-property, so get marginally faster lookups.

But again, ignore as orthogonal.

jorisvandenbossche · 2020-03-20T15:49:12Z

Yep, any other comments on the PR itself?

jbrockmendel · 2020-03-20T15:51:04Z

LGTM

WillAyd · 2020-03-20T15:59:21Z

asv_bench/benchmarks/frame_ctor.py

+        self.columns = pd.Index(range(N_cols))
+
+    def time_frame_from_arrays_float(self):
+        self.df = DataFrame._from_arrays(


I wouldn't change this if no other feedback but I don't think you need the assignment here in any of the benchmarks

True (was only mimicking the other benchmarks in this file)

hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

jreback · 2020-03-21T20:20:20Z

asv_bench/benchmarks/frame_ctor.py

+        self.columns = pd.Index(range(N_cols))
+
+    def time_frame_from_arrays_float(self):
+        self.df = DataFrame._from_arrays(


hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

jreback · 2020-03-21T20:21:05Z

thanks @jorisvandenbossche

…dev#32856)

PERF: faster placement creating extension blocks from arrays

b813d2e

jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2020

jorisvandenbossche added this to the 1.1 milestone Mar 20, 2020

jorisvandenbossche mentioned this pull request Mar 20, 2020

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

Merged

add _from_arrays specific benchmark

6d6e822

rth reviewed Mar 20, 2020

View reviewed changes

asv_bench/benchmarks/frame_ctor.py Show resolved Hide resolved

jreback requested changes Mar 20, 2020

View reviewed changes

jorisvandenbossche added 2 commits March 20, 2020 12:00

convert single index inside BlockPlacement

ff4eeb6

just use integer

ca1a5fe

jbrockmendel reviewed Mar 20, 2020

View reviewed changes

pandas/core/internals/managers.py Show resolved Hide resolved

WillAyd reviewed Mar 20, 2020

View reviewed changes

jorisvandenbossche added 2 commits March 20, 2020 21:11

Merge remote-tracking branch 'upstream/master' into perf-placement

3b0c546

use verify_integrity in the benchmarks

7dfeb19

jreback approved these changes Mar 21, 2020

View reviewed changes

jreback added the Internals Related to non-user accessible pandas implementation label Mar 21, 2020

jreback merged commit 00ae98d into pandas-dev:master Mar 21, 2020

jorisvandenbossche deleted the perf-placement branch March 21, 2020 21:05

rth mentioned this pull request Mar 22, 2020

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

Merged

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

PERF: faster placement creating extension blocks from arrays (pandas-…

149dfd7

…dev#32856)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020

PERF: faster placement creating extension blocks from arrays (pandas-…

cba3a5d

…dev#32856)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster placement creating extension blocks from arrays #32856

PERF: faster placement creating extension blocks from arrays #32856

jorisvandenbossche commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020 •

edited

Loading

jorisvandenbossche commented Mar 20, 2020

jreback Mar 20, 2020

jorisvandenbossche Mar 20, 2020

jreback Mar 20, 2020

jorisvandenbossche Mar 20, 2020 •

edited

Loading

jorisvandenbossche Mar 20, 2020

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

WillAyd Mar 20, 2020

jorisvandenbossche Mar 20, 2020

jreback Mar 21, 2020

jreback Mar 21, 2020

jreback commented Mar 21, 2020

PERF: faster placement creating extension blocks from arrays #32856

PERF: faster placement creating extension blocks from arrays #32856

Conversation

jorisvandenbossche commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020 • edited Loading

jorisvandenbossche commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

jorisvandenbossche commented Mar 20, 2020

jbrockmendel commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 21, 2020

jorisvandenbossche commented Mar 20, 2020 •

edited

Loading

jorisvandenbossche Mar 20, 2020 •

edited

Loading