TSK: Follow-up things for stringdtype #25693

mhvk · 2024-01-25T22:01:17Z

Proposed new feature or change:

A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.

Things possibly for longer term

Support in structured arrays (perhaps not super useful, but could be seen as similar to object).
Expose more of the currently private NpyString API. This will depend on feedback from users.
Fix longdouble to string, which is listed as broken in a TODO item in casts.c. isn't dragon able to deal with it?
Add a DType API callback that triggers when the initial filling of a newly created array (e.g. after PyArray_FromAny finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.
Might it make sense on 64bit systems, where normally the size is 16 bytes, to have a 8-byte version (short strings up to 7, only arena allocations for long ones; might use the size argument...).
In principle, .view(StringDType()) could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).
Dealing with array scalars vs str scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897.

Small warts, possibly not solvable

StringDType is added to add_dtype_helper late in the initialization of multiarraymodule; can this be easier?
Can the cases of having and not having gil be factored out, so that one doesn't get the kind of hybrid stuff in load_non_nullable_string with its has_gil argument.
to have dtype.hasobject be true is logical but not quite the right name.

The text was updated successfully, but these errors were encountered:

ngoldbaum · 2024-01-26T00:00:37Z

Added a few things I had on my personal list. Thanks for opening this and for all the review this week!

ngoldbaum · 2024-02-06T21:27:00Z

@asmeurer pointed out to me that we should add a proper scalar type. Right now stringdtype's scalar is str, which doesn't inherit from np.generic, so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

We can fix this by defining a proper scalar type that wraps a single packed string entry and exposes an API that is duck-compatible with str, making use of the ufunc implementations we're adding.

This may also lead to performance improvements, since scalar indexing won't need to copy to python's internal string representation.

I don't think this needs to happen for 2.0 since this is something we can improve later and I doubt it will have major user-facing impacts, since object string arrays also effectively have a str scalar.

seberg · 2024-02-06T21:45:32Z

so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

I personally don't see this as a big issue. Although it might be friendly to not convret to scalar as often as we currently do if it is a string scalar (bad timeline too though).

mhvk · 2024-02-06T22:21:05Z

I think the main problem may be that people expect generally that they can treat array[something] as array-ish, with a .shape, etc. In that sense we probably do need a scalar type (or, perhaps preferably, just not drop to a scalar in the first place...).

If we're going to be "not quite right" for 2.0, should we perhaps err on the side of not creating a str object?

ngoldbaum · 2024-02-06T22:22:08Z

Are you saying that we should create a 0D array instead?

ngoldbaum · 2024-02-06T22:25:04Z

If so, I think it's much more important for the scalar to be duck-compatible with str than with ndarray. Especially if the goal is replace object string arrays in downstream packages.

mhvk · 2024-02-06T22:42:37Z

Yes, I was, I really dislike array scalars and wish everything was 0-D arrays instead. But you make a good point that we want to make sure that object arrays can easily be replaced...

EDIT: because I dislike how the type that comes out of __getitem__ or .sum() depends on the arguments. Why should axis=None give me a different instance than axis=-1? And it is just a hassle if one subclasses ndarray...

asmeurer · 2024-02-07T00:02:14Z

I also completely agree that scalars are terrible for a dozen different reasons, and wish numpy just had 0-d arrays. But I guess this was too ambitious for NumPy 2.0. As it stands, scalars exist. Nathan makes a good point that object scalars are already kind of specially broken because they aren't even numpy types, and this new dtype replaces object str arrays for various use-cases (tbf, it also replaces np.str_ arrays for many use-cases too). But at least if you are working with object arrays you probably (or at least hopefully) are doing it on purpose and can be careful about this. For every other dtype, the scalar is at least array-like in most respects. And this is documented too https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_:

The object type is also special because an array containing object_ items does not return an object_ object on item access, but instead returns the actual object that the array item refers to.

Making stringdtype scalars subclass from both str and generic seems like the best option from a usability point of view, although if actually making it a real str subclass is not good for performance, just making it duck type and defining __instancecheck__ is probably good enough.

seberg · 2024-02-07T10:36:27Z

Even subclassing from generic gives you crazy things, like indexing must be string indexing and not array indexing! In other words: IMO, part of the problem with scalars is that they pretend to be array likes, even though it is currently vital for much code that they do (because we create them too often).
Plus, the ABI is likely odd, so care has to be taken (numpy strings pull it off, so not sure if it is easy or not).

I envisioned we could create a new np.NumPyScalar, fully abstract, class you can register with. But not sure it matters here.

In short, I don't care about the hierarchy at all, but I agree there are two things:

You cannot write np.array(scalar) and np.array(str_array[0]) without a dtype because we do not pick the new dtype for those for BC reasons.
Just like for object dtype, the fact that we don't preserve 0-D arrays is more harmful for this DType compared to numerical dtypes where the scalars somewhat duck-type as arrays.

I suggest moving discussion about point 2 to gh-24897. I could imagine making my try-0d-preservation opt-in more agressively for all but current dtypes (minus object maybe), even if we don't dare change it otherwise without 2.0.
And yes, maybe we should never convert to scalars and make that something users always would have to do explicitly with .item() or so, but not sure I think that is as realistic.

MarkusSintonen · 2024-05-23T11:17:51Z

Support in structured arrays (perhaps not super useful, but could be seen as similar to object)

I'm exited about StringDType but noticed that its not working with structure arrays which is a bummer. We are using structured arrays for efficiently identifying "good values" from a boolean mask. This boolean mask is carried together with the actual data to efficiently detect which are good values in the array. It looks something like this:
numpy.dtype([("values", some_type), ("is_valid", numpy.dtype("?"))])

But this wouldn't now work with some_type=StringDType() 😞 I see there is StringDType(na_object=None) way to define something like this but it wont work for efficiently detecting positions of good/bad values compared to being able to carry it with a mask in a struct.

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype=StringDType(), count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

ngoldbaum · 2024-05-23T16:06:35Z

Supporting structured dtypes is possible in principle but we punted on it for the initial version because it would add a bunch of complexity. The structured dtype code inside numpy (let alone in external libraries) makes strong assumptions about dtype instances being interchangeable, which aren't valid for stringdtype.

One path forward would be adding a mode to stringdtype that always heap-allocates. This isn't great for performance or cache locality, but does mean that the packed strings always store heap pointers and the dtype instance does not store any sidecar data. I'm not sure how hard that would be - the structured dtype code is quite complicated and I'm not sure how easy it is to determine whether a stringdtype value lives inside a structured dtype vs a stand-alone stringdtype instance.

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Thanks for the report about unique being slow, I opened #26510 to track that. Please also open followup issues with reports about performance if you have others, it's very appreciated, especially with a script to reproduce what you're seeing.

MarkusSintonen · 2024-05-23T18:30:05Z

Thanks, and thanks for your great work!

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Variable-width strings are the thing we are interested in :) Essentially improving performance with some columnar type of data processing that may have free form strings. There is the "O" type for this, but it has some issues. Eg the above shows how numpy.unique (for str-O) is quite slow compared to Pythons own set. Potentially the more native StringDType could help there.

seberg · 2024-05-23T18:40:51Z

FWIW, I think we should allow putting stringdtype into structs.

I think the main issue is that string dtype requires a new dtype instance to be created in some places and that property must be "inherited" by the structured dtype.
That might be as easy as implementing the hook, but may also need a new dtype flag and auditing some other paths (I am not sure, things tend to turn out harder than you think).

But I am not sure if anyone prioritize working on this soon, if anyone is interested, I think we will definitely help out to get started.

unique in NumPy doesn't use hashes (for the time being at least), but a custom sorting would presumably get it to comparable to the old strings (dunno if faster or slower). But comparing to Python has to be done with cares: Python strings cache their hash and can "cheat" because you have identical objects which are known to be equal without even looking at the string itself!

mhvk mentioned this issue Jan 25, 2024

API: Introduce stringdtype [NEP 55] #25347

Merged

ngoldbaum added the 17 - Task label Jan 31, 2024

ngoldbaum changed the title ~~Follow-up things for stringdtype~~ TSK: Follow-up things for stringdtype Jan 31, 2024

mhvk mentioned this issue Feb 9, 2024

MAINT: combine string ufuncs by passing on auxilliary data #25796

Merged

ngoldbaum mentioned this issue Feb 19, 2024

MAINT: Remove unnnecessary size argument in StringDType initializer #25856

Merged

mhvk mentioned this issue Feb 25, 2024

BUG: fix Stringdtype reinitialization after nan #25882

Closed

mhvk mentioned this issue Mar 25, 2024

ENH: Add partition/rpartition ufunc for string dtypes #26082

Merged

ngoldbaum mentioned this issue May 10, 2024

BUG: support nan-like null strings in [l,r]strip #26392

Merged

ngoldbaum mentioned this issue May 23, 2024

ENH: np.unique and sorting is slow for StringDType #26510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSK: Follow-up things for stringdtype #25693

TSK: Follow-up things for stringdtype #25693

mhvk commented Jan 25, 2024 •

edited by ngoldbaum

ngoldbaum commented Jan 26, 2024

ngoldbaum commented Feb 6, 2024

seberg commented Feb 6, 2024

mhvk commented Feb 6, 2024

ngoldbaum commented Feb 6, 2024

ngoldbaum commented Feb 6, 2024

mhvk commented Feb 6, 2024 •

edited

asmeurer commented Feb 7, 2024

seberg commented Feb 7, 2024

MarkusSintonen commented May 23, 2024 •

edited

ngoldbaum commented May 23, 2024 •

edited

MarkusSintonen commented May 23, 2024 •

edited

seberg commented May 23, 2024

TSK: Follow-up things for stringdtype #25693

TSK: Follow-up things for stringdtype #25693

Comments

mhvk commented Jan 25, 2024 • edited by ngoldbaum

Proposed new feature or change:

ngoldbaum commented Jan 26, 2024

ngoldbaum commented Feb 6, 2024

seberg commented Feb 6, 2024

mhvk commented Feb 6, 2024

ngoldbaum commented Feb 6, 2024

ngoldbaum commented Feb 6, 2024

mhvk commented Feb 6, 2024 • edited

asmeurer commented Feb 7, 2024

seberg commented Feb 7, 2024

MarkusSintonen commented May 23, 2024 • edited

ngoldbaum commented May 23, 2024 • edited

MarkusSintonen commented May 23, 2024 • edited

seberg commented May 23, 2024

mhvk commented Jan 25, 2024 •

edited by ngoldbaum

mhvk commented Feb 6, 2024 •

edited

MarkusSintonen commented May 23, 2024 •

edited

ngoldbaum commented May 23, 2024 •

edited

MarkusSintonen commented May 23, 2024 •

edited