New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSK: Follow-up things for stringdtype #25693
Comments
Added a few things I had on my personal list. Thanks for opening this and for all the review this week! |
@asmeurer pointed out to me that we should add a proper scalar type. Right now stringdtype's scalar is We can fix this by defining a proper scalar type that wraps a single packed string entry and exposes an API that is duck-compatible with This may also lead to performance improvements, since scalar indexing won't need to copy to python's internal string representation. I don't think this needs to happen for 2.0 since this is something we can improve later and I doubt it will have major user-facing impacts, since object string arrays also effectively have a |
I personally don't see this as a big issue. Although it might be friendly to not convret to scalar as often as we currently do if it is a string scalar (bad timeline too though). |
I think the main problem may be that people expect generally that they can treat If we're going to be "not quite right" for 2.0, should we perhaps err on the side of not creating a |
Are you saying that we should create a 0D array instead? |
If so, I think it's much more important for the scalar to be duck-compatible with |
Yes, I was, I really dislike array scalars and wish everything was 0-D arrays instead. But you make a good point that we want to make sure that object arrays can easily be replaced... EDIT: because I dislike how the type that comes out of |
I also completely agree that scalars are terrible for a dozen different reasons, and wish numpy just had 0-d arrays. But I guess this was too ambitious for NumPy 2.0. As it stands, scalars exist. Nathan makes a good point that object scalars are already kind of specially broken because they aren't even numpy types, and this new dtype replaces object
Making stringdtype scalars subclass from both |
Even subclassing from generic gives you crazy things, like indexing must be string indexing and not array indexing! In other words: IMO, part of the problem with scalars is that they pretend to be array likes, even though it is currently vital for much code that they do (because we create them too often). I envisioned we could create a new In short, I don't care about the hierarchy at all, but I agree there are two things:
I suggest moving discussion about point 2 to gh-24897. I could imagine making my try-0d-preservation opt-in more agressively for all but current dtypes (minus object maybe), even if we don't dare change it otherwise without 2.0. |
I'm exited about But this wouldn't now work with I also did some basic performance tests in print(numpy.__version__) # '2.0.0rc2'
options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype=StringDType(), count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))
print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000)) # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000)) # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000)) # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000)) # 0.078 |
Supporting structured dtypes is possible in principle but we punted on it for the initial version because it would add a bunch of complexity. The structured dtype code inside numpy (let alone in external libraries) makes strong assumptions about dtype instances being interchangeable, which aren't valid for stringdtype. One path forward would be adding a mode to stringdtype that always heap-allocates. This isn't great for performance or cache locality, but does mean that the packed strings always store heap pointers and the dtype instance does not store any sidecar data. I'm not sure how hard that would be - the structured dtype code is quite complicated and I'm not sure how easy it is to determine whether a stringdtype value lives inside a structured dtype vs a stand-alone stringdtype instance. Another thing I have in mind is to eventually add a fixed-width string mode to stringdtype, which would we could also straightforwardly support in structured dtypes, although if you need variable-width strings that doesn't help much. Thanks for the report about unique being slow, I opened #26510 to track that. Please also open followup issues with reports about performance if you have others, it's very appreciated, especially with a script to reproduce what you're seeing. |
Thanks, and thanks for your great work!
Variable-width strings are the thing we are interested in :) Essentially improving performance with some columnar type of data processing that may have free form strings. There is the |
FWIW, I think we should allow putting stringdtype into structs. I think the main issue is that string dtype requires a new dtype instance to be created in some places and that property must be "inherited" by the structured dtype. But I am not sure if anyone prioritize working on this soon, if anyone is interested, I think we will definitely help out to get started.
|
Proposed new feature or change:
A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.
StringDType
has a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...)add
ufunc needs a promoter so addition with a python string works.new_stringdtype_instance
intotp_init
#define
incasts.c
with templating (or.c.src
)*auxdata
(see here, here, here, and here) [e.g.,minimum
andmaximum
; the various comparison functions; the templated unary functions;find
,rfind
and maybecount
;startswith
andendswith
;lstrip
,rstrip
andstrip
, pluswhitespace
versions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796array2string
formatter overrides.a.view('2u8')
currently errors with"Cannot change data-type for object array."
).StringDType
ufuncs to useout
arguments, also for in-place operations.Things possibly for longer term
object
).NpyString
API. This will depend on feedback from users.longdouble
to string, which is listed as broken in aTODO
item incasts.c
. isn'tdragon
able to deal with it?PyArray_FromAny
finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.size
argument...)..view(StringDType())
could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).str
scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897.Small warts, possibly not solvable
StringDType
is added toadd_dtype_helper
late in the initialization ofmultiarraymodule
; can this be easier?load_non_nullable_string
with itshas_gil
argument.dtype.hasobject
be true is logical but not quite the right name.The text was updated successfully, but these errors were encountered: