Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add string extension type #27949

Merged
merged 59 commits into from
Oct 5, 2019
Merged

Conversation

TomAugspurger
Copy link
Contributor

This adds a new extension type 'string' for storing string data.

The data model is essentially unchanged from master. String are still
stored in an object-dtype ndarray. Scalar elements are still Python
strs, and np.nan is still used as the string dtype.

Things are pretty well contained. The major changes outside the new array are

  1. docs
  2. core/strings.py to handle things correctly (mostly, returning string dtype when there's a string input.

No rush on reviewing this. Just parking it here for now.

This adds a new extension type 'string' for storing string data.

The data model is essentially unchanged from master. String are still
stored in an object-dtype ndarray. Scalar elements are still Python
strs, and `np.nan` is still used as the string dtype.
@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 16, 2019
@jbrockmendel
Copy link
Member

Does the .str accessor only show up on string-dtype Series, or also object-coincidentally-all-string Series as in master?

@TomAugspurger
Copy link
Contributor Author

Also on object-dtype as in master.

This PR doesn't change the current behavior at all (modulo new bugs).

@mroeschke
Copy link
Member

May be a more suitable question for an issue, but dtype=str will still point to object in this PR and in the future(?)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 16, 2019

dtype=str will still point to object in this PR and in the future(?)

It'll have to for now (in my first commit I allowed dtype='str' as well as dtype='string', but that of course broke things.)

We'll want to think of a migration path for making 'str' mean StringDtype, and probably we'll want to infer ['a', 'b'] as string rather than object.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 16, 2019

I was curious about how much the string validation (ruling out pd.array(['a', 1], dtype="string") cost.

relative:

gh

absolute:

gh-2

code

from timeit import default_timer as tic
import pandas as pd
import numpy as np
import pandas.util.testing as tm

ns = [0, 10, 100, 1_000, 10_000, 100_000, 1_000_000]
times = []
for n in ns:
    data = tm.makeStringIndex(n).tolist()
    t0 = tic()
    np.array(data, dtype=object)
    t1 = tic()
    pd.core.arrays.StringArray._from_sequence(data)
    t2 = tic()
    
    times.append((n, t1 - t0, t2 - t1))

df = pd.DataFrame(times, columns=['n', 'numpy', 'pandas']).set_index('n')
df.div(df['numpy'], axis=0).plot(
    logx=True, title="Overhead Relative to NumPy"
);

I believe there are two sources of relative slowdown

  1. per-value validation that the scalars are strings or NA
  2. Ensuring that we have a consistent NA value (no mix of None and np.nan. Right now, this is done with an isna() (linear scan over the values) and a setitem if needed. I suspect that this could be rolled into the per-value scalar validation if desired, but I'm less concerned about perf right now.

@TomAugspurger
Copy link
Contributor Author

(CI is passing now).

@jbrockmendel
Copy link
Member

are you looking to get this merged quickly? i.e. can this go on the "review later this week" pile or does it belong in the "later today" pile?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 19, 2019 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intention to have this have an encoding (maybe parameterized)? or is that a unicode type? IOW shouldn't this be string[utf8] ? (and what you have here is a base class for this), we could certianly default dtype='string' -> string[utf8].

also rebase when you have a chance, have only glanced, but looks pretty good so far.

@TomAugspurger
Copy link
Contributor Author

is the intention to have this have an encoding (maybe parameterized)? or is that a unicode type?

Mmm, this type is for in-memory strings, so I don't think an encoding is necessary. I don't think we're exposing the actual storage anywhere, which is when the encoding would matter.

If we had a ByteArray, then yes I think that should be parametrized by encoding (with UTF8 as the default).

@pep8speaks
Copy link

pep8speaks commented Sep 9, 2019

Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-04 14:03:39 UTC

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of clarity this doesn't offer any real speed or memory savings over the object dtype right? Just a better way to ensure that an array only contains strings?

doc/source/user_guide/text.rst Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved
pandas/core/arrays/string_.py Outdated Show resolved Hide resolved
pandas/core/arrays/string_.py Outdated Show resolved Hide resolved
pandas/core/arrays/string_.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

I am a bit biased as I have clear preference, but trying to summarize arguments for strings vs text:

Reasons to go with "string":

  • It matches our own existing terminology (string methods with str accessor)
  • It matches Python terminology (str type, the scalar value of this array is a python string)
  • It matches what some other libraries / languages do (Arrow, Julia (they also have a Char), C++, ..; R uses character)

Reasons to go with "text":

  • A less subtle difference with "dtype=str" (which for now keeps giving the object dtype)
  • .. other?

SQL has VARCHAR (so also character), Postgres also has a non-SQL-standard TEXT type (but the equivalent of what we are building here is not TEXT but VARCHAR, AFAIU)

i think would be more amenable to (and alias of string) if we start showing. FutureWarning if passing ‘str’ as a dtype (in preparation for the switch)

IMO, we shouldn't do that now, let's first get some experience with it as an opt-in experimental feature.
But I think we agree the long term goal is to change "str" to mean this StringDtype at some point? For me, that is another reason to go with "string" now, as that will cause less confusion in the future (at some point, "str" and "string" can be aliases then).

@jreback
Copy link
Contributor

jreback commented Sep 30, 2019

yep i’ll retract my support for Text and go with String

but i think we need to make it very clear that for now str is not the same as string

@TomAugspurger
Copy link
Contributor Author

Yep, I think we'll want dtype='str' to mean this new EA dtype in the future.

And we'll likely want to deprecate .str on object-dtype columns. But that's down the road.

I'll make the switch back to StringDtype / StringArray.

self._validate()

def _validate(self):
"""Validate that we only store NA or strings."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this pass if you passing a StringArray itself? (and do we infer correctly in is_string_array)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this pass if you passing a StringArray itself?

Done.

and do we infer correctly in is_string_array

We (Cython) actually raises on lib.is_string_array(StringArray) since it's expecting an ndarray. Not sure what's best here. IIUC, we only use len(values) and values.dtype.

pandas/core/arrays/string_.py Show resolved Hide resolved
pandas/core/arrays/string_.py Show resolved Hide resolved
pandas/core/arrays/string_.py Show resolved Hide resolved
pandas/core/strings.py Show resolved Hide resolved
else:
dtype = None

result = arr._constructor_expanddim(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr.dtype work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always: #27953

pandas/tests/arrays/string_/test_text.py Outdated Show resolved Hide resolved
@jreback jreback merged commit 9cfb8b5 into pandas-dev:master Oct 5, 2019
@jreback
Copy link
Contributor

jreback commented Oct 5, 2019

thanks @TomAugspurger very nice. IIRC there are a couple of followups.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2019

did this have an issue to close?

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Oct 7, 2019
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants