Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: StringArray construction #36325

Merged
merged 4 commits into from
Sep 17, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Sep 13, 2020

Currently, when constructing through Series(data, dtype="string"), pandas first converts to strings/NA, then does a check that all scalars are actually strings or NA. The check is not needed in cases where we explicitly already have converted.

Performance example:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object)
>>> %timeit pd.Series(x, dtype="string")
357 ms ± 40.2 ms per loop  # v1.1.0
148 ms ± 713 µs per loop  # after #35519
26.3 ms ± 291 µs per loop  # after #36304
12.6 ms ± 115 µs per loop  # this PR

xref #35519, #36304 & #36317.

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Sep 13, 2020
@@ -122,6 +122,9 @@ class StringArray(PandasArray):

copy : bool, default False
Whether to copy the array of data.
convert : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence), but even there we don't do coercion. I think this actually needs to be done in the caller. cc @jorisvandenbossche @TomAugspurger

Copy link
Contributor Author

@topper-123 topper-123 Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want this parameter generally in the actual constructors (you can instead do this in `_from_sequence)

The decision to validate or convert needs to be in the __init__, because the validation already is there and it has no way to know from the input array, whether the input array only contains strings and NA's. So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.

An alternative is to always run the conversion step in StringArray.__init__, I.e. non-string scalars. In that case validation is not needed at all. It would not cost in performance, after #36162 has been merged:

>>> x = np.array([str(u) for u in range(1_000_000)], dtype=object)
>>> %timeit pd._libs.lib.ensure_string_array(x, copy=False)
11.5 ms ± 145 µs per loop
>>> %timeit pd._libs.lib.is_string_array(x)
13.3 ms ± 80.5 µs per loop

Copy link
Member

@jorisvandenbossche jorisvandenbossche Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to recall we had discussion about trying to avoid this double validation, but don't directly find anything on the original PR #27949

So it needs a flag to tell it if it should validate or convert because validation after conversion is wasteful.

Alternative would be a flag to disable validation (instead of a flag to enable conversion),then _from_sequence can do the conversion and signal to the constructor it doesn't need to validate.

In general, I prefer the Array class constructors to be minimal and performant (they are not meant for day-to-day users, those should use pd.array(..)), eg the IntegerArray constructor only checks that the passed values are actually an ndarray and have the correct dtype (and not any other coercion or conversion). But of course, there it is a relatively cheap check, while for object dtype array this check is expensive.
So given that limiation for StringArray using object dtype, I am fine with adding such a "fastpath" keyword in this case (but I would use something like validate or verify_integrity)

@jbrockmendel
Copy link
Member

rebase/lint checks

@TomAugspurger
Copy link
Contributor

I'd prefer not to have a public argument that allows people to create an invalid object.

The check is not needed in cases where we explicitly already have converted.

Can you say a bit more / give an example where we're doing the conversion / validation twice? Keep in mind that even if we know that we have all strings, we want to normalize all the NA values to pd.NA.

@jorisvandenbossche
Copy link
Member

Can you say a bit more / give an example where we're doing the conversion / validation twice?

_from_sequence converts the input to an object array of all strings (or nans), and then __init__ validates that the input are all strings (which we know is the case, as _from_sequence already ensured the conversion)

@TomAugspurger
Copy link
Contributor

Right, Thanks.

Thoughts on refactoring _from_sequence to use object.__new__(StringArray) and then manually initializing by setting _ndarray and _dtype (and skipping the conversion / checks we don't need?) My preference would be to keep this out of the public API given the relative immaturity of StringArray.

@topper-123
Copy link
Contributor Author

I'll look into this tonight or tomorrow.

@topper-123
Copy link
Contributor Author

I’ve made a new version.

@jreback jreback added this to the 1.2 milestone Sep 17, 2020
@jreback jreback merged commit b4d0ae5 into pandas-dev:master Sep 17, 2020
@jreback
Copy link
Contributor

jreback commented Sep 17, 2020

thanks @topper-123

@topper-123 topper-123 deleted the StringArray_constrution_perf branch September 17, 2020 17:10
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants