Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSK: Follow-up things for stringdtype #25693

Open
5 of 13 tasks
mhvk opened this issue Jan 25, 2024 · 13 comments
Open
5 of 13 tasks

TSK: Follow-up things for stringdtype #25693

mhvk opened this issue Jan 25, 2024 · 13 comments

Comments

@mhvk
Copy link
Contributor

mhvk commented Jan 25, 2024

Proposed new feature or change:

A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.

  1. The doc for StringDType has a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...)
  2. The add ufunc needs a promoter so addition with a python string works.
  3. Add a cython interface for the stringdtype C API.
  4. It is likely better to use a flag for strings "long strings" (stored outside of the numpy array proper) instead of one for short ones (stored inside), so that an all-zero entry correctly implies a zero-length string (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  5. Refactor the flags in the vstring implementation to use bitfields. This will improve clarity and eliminate complicated bitflag math.
  6. Possibly, the arena should have more recoverable flags/size information (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  7. Investigate refactoring new_stringdtype_instance into tp_init
  8. Replace the long #define in casts.c with templating (or .c.src)
  9. Replace ufunc wrappers with passing functions into *auxdata (see here, here, here, and here) [e.g., minimum and maximum; the various comparison functions; the templated unary functions; find, rfind and maybe count; startswith and endswith; lstrip, rstrip and strip, plus whitespace versions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796
  10. Check array2string formatter overrides.
  11. Adjust error messages referring to "object array" (e.g., a.view('2u8') currently errors with "Cannot change data-type for object array.").
  12. Have some helper functions that make it easy for StringDType ufuncs to use out arguments, also for in-place operations.
  13. See whether null-handling code in ufunc loops and casts can be consolidated into a helper function to reduce code duplication.

Things possibly for longer term

  • Support in structured arrays (perhaps not super useful, but could be seen as similar to object).
  • Expose more of the currently private NpyString API. This will depend on feedback from users.
  • Fix longdouble to string, which is listed as broken in a TODO item in casts.c. isn't dragon able to deal with it?
  • Add a DType API callback that triggers when the initial filling of a newly created array (e.g. after PyArray_FromAny finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.
  • Might it make sense on 64bit systems, where normally the size is 16 bytes, to have a 8-byte version (short strings up to 7, only arena allocations for long ones; might use the size argument...).
  • In principle, .view(StringDType()) could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).
  • Dealing with array scalars vs str scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897.

Small warts, possibly not solvable

  • StringDType is added to add_dtype_helper late in the initialization of multiarraymodule; can this be easier?
  • Can the cases of having and not having gil be factored out, so that one doesn't get the kind of hybrid stuff in load_non_nullable_string with its has_gil argument.
  • to have dtype.hasobject be true is logical but not quite the right name.
@ngoldbaum
Copy link
Member

Added a few things I had on my personal list. Thanks for opening this and for all the review this week!

@ngoldbaum ngoldbaum changed the title Follow-up things for stringdtype TSK: Follow-up things for stringdtype Jan 31, 2024
@ngoldbaum
Copy link
Member

@asmeurer pointed out to me that we should add a proper scalar type. Right now stringdtype's scalar is str, which doesn't inherit from np.generic, so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

We can fix this by defining a proper scalar type that wraps a single packed string entry and exposes an API that is duck-compatible with str, making use of the ufunc implementations we're adding.

This may also lead to performance improvements, since scalar indexing won't need to copy to python's internal string representation.

I don't think this needs to happen for 2.0 since this is something we can improve later and I doubt it will have major user-facing impacts, since object string arrays also effectively have a str scalar.

@seberg
Copy link
Member

seberg commented Feb 6, 2024

so stringdtype ends up as an oddball in the API from the perspective of the type hierarchy of the scalar types.

I personally don't see this as a big issue. Although it might be friendly to not convret to scalar as often as we currently do if it is a string scalar (bad timeline too though).

@mhvk
Copy link
Contributor Author

mhvk commented Feb 6, 2024

I think the main problem may be that people expect generally that they can treat array[something] as array-ish, with a .shape, etc. In that sense we probably do need a scalar type (or, perhaps preferably, just not drop to a scalar in the first place...).

If we're going to be "not quite right" for 2.0, should we perhaps err on the side of not creating a str object?

@ngoldbaum
Copy link
Member

Are you saying that we should create a 0D array instead?

@ngoldbaum
Copy link
Member

If so, I think it's much more important for the scalar to be duck-compatible with str than with ndarray. Especially if the goal is replace object string arrays in downstream packages.

@mhvk
Copy link
Contributor Author

mhvk commented Feb 6, 2024

Yes, I was, I really dislike array scalars and wish everything was 0-D arrays instead. But you make a good point that we want to make sure that object arrays can easily be replaced...

EDIT: because I dislike how the type that comes out of __getitem__ or .sum() depends on the arguments. Why should axis=None give me a different instance than axis=-1? And it is just a hassle if one subclasses ndarray...

@asmeurer
Copy link
Member

asmeurer commented Feb 7, 2024

I also completely agree that scalars are terrible for a dozen different reasons, and wish numpy just had 0-d arrays. But I guess this was too ambitious for NumPy 2.0. As it stands, scalars exist. Nathan makes a good point that object scalars are already kind of specially broken because they aren't even numpy types, and this new dtype replaces object str arrays for various use-cases (tbf, it also replaces np.str_ arrays for many use-cases too). But at least if you are working with object arrays you probably (or at least hopefully) are doing it on purpose and can be careful about this. For every other dtype, the scalar is at least array-like in most respects. And this is documented too https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_:

The object type is also special because an array containing object_ items does not return an object_ object on item access, but instead returns the actual object that the array item refers to.

Making stringdtype scalars subclass from both str and generic seems like the best option from a usability point of view, although if actually making it a real str subclass is not good for performance, just making it duck type and defining __instancecheck__ is probably good enough.

@seberg
Copy link
Member

seberg commented Feb 7, 2024

Even subclassing from generic gives you crazy things, like indexing must be string indexing and not array indexing! In other words: IMO, part of the problem with scalars is that they pretend to be array likes, even though it is currently vital for much code that they do (because we create them too often).
Plus, the ABI is likely odd, so care has to be taken (numpy strings pull it off, so not sure if it is easy or not).

I envisioned we could create a new np.NumPyScalar, fully abstract, class you can register with. But not sure it matters here.

In short, I don't care about the hierarchy at all, but I agree there are two things:

  1. You cannot write np.array(scalar) and np.array(str_array[0]) without a dtype because we do not pick the new dtype for those for BC reasons.
  2. Just like for object dtype, the fact that we don't preserve 0-D arrays is more harmful for this DType compared to numerical dtypes where the scalars somewhat duck-type as arrays.

I suggest moving discussion about point 2 to gh-24897. I could imagine making my try-0d-preservation opt-in more agressively for all but current dtypes (minus object maybe), even if we don't dare change it otherwise without 2.0.
And yes, maybe we should never convert to scalars and make that something users always would have to do explicitly with .item() or so, but not sure I think that is as realistic.

@MarkusSintonen
Copy link

MarkusSintonen commented May 23, 2024

Support in structured arrays (perhaps not super useful, but could be seen as similar to object)

I'm exited about StringDType but noticed that its not working with structure arrays which is a bummer. We are using structured arrays for efficiently identifying "good values" from a boolean mask. This boolean mask is carried together with the actual data to efficiently detect which are good values in the array. It looks something like this:
numpy.dtype([("values", some_type), ("is_valid", numpy.dtype("?"))])

But this wouldn't now work with some_type=StringDType() 😞 I see there is StringDType(na_object=None) way to define something like this but it wont work for efficiently detecting positions of good/bad values compared to being able to carry it with a mask in a struct.

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype=StringDType(), count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

@ngoldbaum
Copy link
Member

ngoldbaum commented May 23, 2024

Supporting structured dtypes is possible in principle but we punted on it for the initial version because it would add a bunch of complexity. The structured dtype code inside numpy (let alone in external libraries) makes strong assumptions about dtype instances being interchangeable, which aren't valid for stringdtype.

One path forward would be adding a mode to stringdtype that always heap-allocates. This isn't great for performance or cache locality, but does mean that the packed strings always store heap pointers and the dtype instance does not store any sidecar data. I'm not sure how hard that would be - the structured dtype code is quite complicated and I'm not sure how easy it is to determine whether a stringdtype value lives inside a structured dtype vs a stand-alone stringdtype instance.

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Thanks for the report about unique being slow, I opened #26510 to track that. Please also open followup issues with reports about performance if you have others, it's very appreciated, especially with a script to reproduce what you're seeing.

@MarkusSintonen
Copy link

MarkusSintonen commented May 23, 2024

Thanks, and thanks for your great work!

Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much.

Variable-width strings are the thing we are interested in :) Essentially improving performance with some columnar type of data processing that may have free form strings. There is the "O" type for this, but it has some issues. Eg the above shows how numpy.unique (for str-O) is quite slow compared to Pythons own set. Potentially the more native StringDType could help there.

@seberg
Copy link
Member

seberg commented May 23, 2024

FWIW, I think we should allow putting stringdtype into structs.

I think the main issue is that string dtype requires a new dtype instance to be created in some places and that property must be "inherited" by the structured dtype.
That might be as easy as implementing the hook, but may also need a new dtype flag and auditing some other paths (I am not sure, things tend to turn out harder than you think).

But I am not sure if anyone prioritize working on this soon, if anyone is interested, I think we will definitely help out to get started.

unique in NumPy doesn't use hashes (for the time being at least), but a custom sorting would presumably get it to comparable to the old strings (dunno if faster or slower). But comparing to Python has to be done with cares: Python strings cache their hash and can "cheat" because you have identical objects which are known to be equal without even looking at the string itself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants