Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: faster placement creating extension blocks from arrays #32856

Merged
merged 6 commits into from
Mar 21, 2020

Conversation

jorisvandenbossche
Copy link
Member

When creating a DataFrame from many arrays stored in ExtensionBlocks, it seems quite some time is taken inside BlockPlacement using np.require on the passed list. Specifying the placement as a slice instead gives a much faster creation of the BlockPlacement. This delays the conversion to an array, though, but afterwards the conversion of the slice to an array inside BlockPlacement when neeeded is faster than an initial creation of a BlockPlacement from a list/array of 1 element.

From investigating #32196 (comment)

@rth this reduces it with another third! (only from the dataframe creation, to be clear)

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Mar 20, 2020
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 20, 2020

Using the same example from #32826. With:

arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)]
index = pd.Index(range(len(arrays[0])))  
columns = pd.Index(range(len(arrays)))    

it gives

In [4]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)  
113 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
72.9 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- PR

@jorisvandenbossche
Copy link
Member Author

This is also indirectly covered by the sparse benchmark, but adding some benchmarks specifically for _from_arrays:

       before           after         ratio
     [e31354f0]       [6d6e822a]
     <master>         <perf-placement>
-      14.5±0.7ms       9.58±0.6ms     0.66  frame_ctor.FromArrays.time_frame_from_arrays_sparse

make_block(array, klass=ObjectValuesExtensionBlock, placement=[i])
make_block(
array, klass=ObjectValuesExtensionBlock, placement=slice(i, i + 1, 1)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than changing this here
simply convert a single integer into a slice at a lower level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed, I can create the slice inside BlockPlacement constructor. But don't we then want to explicitly pass a single integer instead of a list of 1 integer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we require list / slice

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compare, I just pushed a commit that does both.
Personally, I like passing it as an integer. It's another 33% faster compared to passing it as a slice or 1-element list, and it makes it also explicit when constructing it that it is about a single column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I missed you proposed a single integer yourself (for some reason I thought you wanted to catch the single element list in BlockPlacement). Updated.

@jbrockmendel
Copy link
Member

another option would be to pass BlockPlacement objects. If we do that consistently, we could remove the checks/casting in the constructor/property

@jorisvandenbossche
Copy link
Member Author

Sorry, I don't understand that option. You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

@jbrockmendel
Copy link
Member

You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list)

Yah, the thought was about creating the BlockPlacement object and passing it to make_block rather than having it created later. Can ignore as orthogonal, but might trim some overhead.

@jorisvandenbossche
Copy link
Member Author

passing it to make_block rather than having it created later.

It shouldn't matter much I think, it is just passed through until ExtensionBlock init, and there it's doing a if not isinstance(placement, BlockPlacement): placement = BlockPlacement(placement)

So we would first need to eliminate all other places where we pass a slice/array as placement to block creation, and then it would only elimiate one isinstance call

@jbrockmendel
Copy link
Member

So we would first need to eliminate all other places where we pass a slice/array

Yes, it is not a trivial idea.

and then it would only elimiate one isinstance call

We could also make mgr_locs not-a-property, so get marginally faster lookups.

But again, ignore as orthogonal.

@jorisvandenbossche
Copy link
Member Author

Yep, any other comments on the PR itself?

@jbrockmendel
Copy link
Member

LGTM

self.columns = pd.Index(range(N_cols))

def time_frame_from_arrays_float(self):
self.df = DataFrame._from_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change this if no other feedback but I don't think you need the assignment here in any of the benchmarks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True (was only mimicking the other benchmarks in this file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

self.columns = pd.Index(range(N_cols))

def time_frame_from_arrays_float(self):
self.df = DataFrame._from_arrays(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)

@jreback jreback added the Internals Related to non-user accessible pandas implementation label Mar 21, 2020
@jreback jreback merged commit 00ae98d into pandas-dev:master Mar 21, 2020
@jreback
Copy link
Contributor

jreback commented Mar 21, 2020

thanks @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants