Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

Conversation

jorisvandenbossche
Copy link
Member

For cases where you know to have valid data (eg you just created them yourself, or they are already validated), it can be useful to skip the validation checks when creating a DataFrame from arrays.

Use case is for example #32825

From investigating #32196 (comment)

@rth this gives another 20% improvement on the dataframe creation part. Together with #32856, it gives a bit more than a 2x improvement on the dataframe creation part (once the sparse arrays are created)

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Mar 20, 2020
@jorisvandenbossche
Copy link
Member Author

In [1]: arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)] 
   ...: index = pd.Index(range(len(arrays[0])))   
   ...: columns = pd.Index(range(len(arrays)))

In [2]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)   
119 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns, verify_integrity=False)    
98.1 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@rth
Copy link
Contributor

rth commented Mar 20, 2020

Very nice, thanks @jorisvandenbossche ! It's good that is applies to extension arrays in general not just sparse frames. Though I guess the use case of a very large number of columns (>10k) is less common outside of sparse.

@rth
Copy link
Contributor

rth commented Mar 20, 2020

Actually my comment was more about #32856, got confused in your multiple performance improvement PRs :) This is nice too!

Optional dtype to enforce for all arrays.
verify_integrity : bool, default True
Validate and homogenize all input. If set to False, it is assumed
that all elements of `arrays` are actual arrays to be stored in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "actual arrays" do you mean numpy ndarray or pandas arrays? Might be worth specifying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of both. Basically the array as how it is stored in a block. Will mention it needs to be one of both.

@jorisvandenbossche
Copy link
Member Author

I think your comment applies to both ;)

@simonjayhawkins
Copy link
Member

Nice docstring. 😄

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What order do we want to do this and #32825 in?

Seems like we merge this first and then update #32825 to pass verify_integrity=False?

@jorisvandenbossche
Copy link
Member Author

Doesn't matter too much, can be added either here or there.
And I would also like to add verify_integrity to the benchmark that I am adding in #32856

@TomAugspurger
Copy link
Contributor

SGTM. I think this can go in since CI is passing :)

@jorisvandenbossche jorisvandenbossche merged commit 3b406a3 into pandas-dev:master Mar 20, 2020
@jorisvandenbossche jorisvandenbossche deleted the perf-arrays-skip-sanitize branch March 20, 2020 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants