Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: creating string Series/Arrays from sequence with many strings #36304

Merged
merged 2 commits into from Sep 12, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Sep 12, 2020

Improves performance of pandas._libs.lib.ensure_string_array is cases with many string elements.

Examples:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object)
>>> %timeit pd.Series(x, dtype=str)
344 ms ± 59.7 ms per loop  # v1.1.0
157 ms ± 7.04 ms per loop  # v1.1.1 and master
22.6 ms ± 191 µs per loop  # this PR
>>> %timeit pd.Series(x, dtype="string")
357 ms ± 40.2 ms per loop  # v1.1.0
148 ms ± 713 µs per loop  # v1.1.1 and master
26.3 ms ± 291 µs per loop  # this PR

#35519 is the cause of the improvement from 1.1.0 to 1.1.1.

Together with #35519 this PR means that the overhead of working with strings in pandas has gotten considerably smaller for cases when we many times instantiate new Series/PandasArrays with dtype str/StringDtype, i.e. probably quite often.

@topper-123 topper-123 changed the title PERF: creating Series/Arrays with dtype str/StringDtype from sequence with many strings PERF: creating string Series/Arrays from sequence with many strings Sep 12, 2020
@jbrockmendel
Copy link
Member

nice. Does this have/merit an asv?

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Sep 12, 2020
@jreback jreback added this to the 1.2 milestone Sep 12, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, ping when ready (if we need an asv) or if we have sufficient.

@topper-123
Copy link
Contributor Author

There are ASVs for both strand "string"in asv_bench/benchmarks/strings.py.

@jreback jreback merged commit 6100425 into pandas-dev:master Sep 12, 2020
@jreback
Copy link
Contributor

jreback commented Sep 12, 2020

thanks @topper-123 very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants