Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: construct DataFrame with string array and dtype=str #36432

Merged
merged 4 commits into from Sep 19, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Sep 17, 2020

Avoid inefficient call to arr.astype() when dtype is str, and use ensure_string_array instead.

Performance example:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object).reshape(500_000, 2)
>>> %timeit pd.DataFrame(x, dtype=str)
391 ms ± 17.7 ms per loop  # master
11.9 ms ± 131 µs per loop  # after this PR

xref #35519, #36304 & #36317.

@jorisvandenbossche
Copy link
Member

Nice! Add a benchmark for it?
And is this speed-up only for constructing from an array, or also eg from dict {'col': x} ?

@topper-123
Copy link
Contributor Author

And is this speed-up only for constructing from an array, or also eg from dict {'col': x} ?

This PR just affects the path for multi-dimensional arrays, while {'col': x} takes the 1-dim path, so not affected by this. The one dim-path was however helped by #36317.

@topper-123
Copy link
Contributor Author

Nice! Add a benchmark for it?

Yeah, I'll add it after the travis checks work again.

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Sep 17, 2020
@jreback jreback added this to the 1.2 milestone Sep 17, 2020
@topper-123 topper-123 force-pushed the DataFrame_construction_str_perf branch from 9d30ee4 to 51bb0ad Compare September 18, 2020 08:44
@topper-123 topper-123 force-pushed the DataFrame_construction_str_perf branch from 51bb0ad to 256ad70 Compare September 19, 2020 07:15
@topper-123
Copy link
Contributor Author

Updated and added ASVs.

@jreback jreback merged commit 605efc6 into master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks very nice!

@topper-123 topper-123 deleted the DataFrame_construction_str_perf branch September 19, 2020 20:06
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants