NEP: add NEP 55 for a variable width string dtype #24483

ngoldbaum · 2023-08-21T20:32:31Z

This is the content of NEP 55. Per NEP 0 I will post to the mailing list and update the PR to include links to the discussions as soon as this is submitted. Please let me know if there's anything else I missed in the NEP process.

WarrenWeckesser · 2023-08-21T22:27:11Z

This looks like a great addition to NumPy!

I haven't read the full NEP in depth (started skimming near the end); so far I have just a couple technical questions that I think deserve clarification in the NEP to avoid confusion:

The struct npy_static_string has a len field, and a pointer to a null-terminated string. Why the redundancy of null-termination and storing the length? And how does the string type handle embedded zeros if the string is null-terminated? E.g. s = "ABC\0DEF\0\0" is a valid Python string with length 9. How would this string look as an element of an array of StringDType?
When one creates a StringDType with a na_object, how are the instances of the na_object stored in the array? All you have in the array is the npy_static_string struct. What convention is used to indicate that it is actually the na_object? Edit (partially answering my own question): I reread this:

We do allow NULL npy_static_string entries in the array buffer, representing either an empty string or a missing data sentinel, depending on the parameters of the StringDType instance associated with the array, ...

I interpret this to mean: if na_object is specified, then buf == NULL means the array element is the missing data sentinel. If na_object was not specified, then buf == NULL implies the array element is an empty string. (Presumably the code should enforce len == 0 in that case.) Is that correct?

mhvk · 2023-08-22T09:39:52Z

Really nice and very clear NEP! My only question was the one already asked by @WarrenWeckesser, why a null-terminated string when there is a len?

ngoldbaum · 2023-08-22T17:05:12Z

Thanks for the quick feedback!

Regarding storing the NA object, you're right, it's represented as a NULL in the array buffer. If the dtype instance doesn't have an NA sentinel a NULL represents the empty string. Using NULL has the nice property that np.empty just needs to be a call to calloc.

Regarding storing null-terminated strings, that choice is to ease working with C APIs that expect NULL-terminated strings and also to more clearly separate array elements in the sidecar buffer. That said, the point about Python supporting strings with embedded nulls is a good one and probably negates any benefit of storing null-terminated strings. For what it's worth, the stringdtype prototype does support python strings with embedded nulls and doesn't depend on the null-termination character:

In [1]: import numpy as np

In [2]: from stringdtype import StringDType

In [3]: arr = np.array([r"hello\0world"], dtype=StringDType())

In [4]: print(arr)
['hello\\0world']

In [5]: print(arr[0])
hello\0world

There's no particularly important reason why the strings have to be null-terminated. If the consensus is not to do that, I can add a C API function that takes an npy_static_string and returns a null-terminated char* buffer that must be manually de-allocated for cases when a NULL-terminated buffer is needed or external libraries expect a C string.

I'm going to wait a week or so to update the text so I can get more comments on the current draft before updating to clarify the points that have been raised so far. At that point, it would be nice to get this quickly merged so that we can continue discussion on the mailing list, per NEP 0.

WarrenWeckesser · 2023-08-22T18:30:48Z

Your example needs a small tweak to make your point. Because of the r prefix, r"hello\0world" includes the literal character '\' followed by the character '0' (i.e. ASCII code 48); it does not have an embedded null character. But fixing that does demonstrate your point 👍, e.g.

In [13]: from stringdtype import StringDType

In [14]: data = ['hello\0world', 'ABC\0DEF\0\0']

In [15]: arr = np.array(data, dtype=StringDType())

In [16]: print(arr)
['hello\x00world' 'ABC\x00DEF\x00\x00']

That also shows that trailing null characters are accepted, which is another improvement over the existing fixed-length strings. Note in the following that, because of the null-padding convention used by the fixed-length strings, the trailing null bytes of the input 'ABC\0DEF\0\0' are lost:

In [17]: b = np.array(data)

In [18]: print(b)
['hello\x00world' 'ABC\x00DEF']

ngoldbaum · 2023-08-22T18:32:19Z

But fixing that does demonstrate your point 👍

Argh, sorry, thanks for double-checking me and confirming.

mhvk · 2023-08-23T12:26:13Z

FWIW, my sense would be not to include the extra nulls.

More general, in the NEP, you mention the option to (eventually) store short strings in the structure itself. This seems nice! It might be good to tell how you ensure that this can happen. E.g., do you envisage that, say, the highest bit of len becomes a flag? Might it be good to reserve the highest byte for flags?

ngoldbaum · 2023-08-23T14:44:07Z

E.g., do you envisage that, say, the highest bit of len becomes a flag? Might it be good to reserve the highest byte for flags?

Good point.

Most references I can find describing the small string optimization are working with dynamic strings that have size and capacity fields, so if the most significant bit of size is 1 and capacity is 0, then it's a small string (since size would be greater than capacity). We don't store a capacity, so I guess we would need to use a bit in the size. We could do bitmasking or make the size a signed integer, and make negative sizes indicate a small string. That probably argues more strongly in favor of using a 64 bit length, since it's much less likely someone would notice the arbitrary limit of 2^63 bytes per element than 2^31. Also using the small string optimization means the extra overhead for long strings can be balanced by storing small strings directly in the array.

I'll make sure to update the relevant text with this as well.

ngoldbaum · 2023-08-28T21:22:37Z

If there aren’t any more comments it would be nice to get this merged as a draft. Having a rendered version on the website makes it a lot easier to read.

mattip · 2023-08-29T05:46:31Z

NEP 0 states

At the earliest convenience, the PR should be merged (regardless of whether it is accepted during discussion). Additional PRs may be made by the Author to update or expand the NEP, or by maintainers to set its status, discussion URL, etc.

so in it goes

Thanks @ngoldbaum

shoyer · 2023-08-31T05:32:57Z

doc/neps/nep-0055-string_dtype.rst

+By default, zeroed out entries in the array buffer represent empty
+strings. However, if the DType instance was created with an ``na_object`` field,
+zeroed-out entries represent missing data. By making this choice, a zero-filled
+newly allocated buffer returned by ``calloc`` does not need any additional
+post-processing to produce an empty array. This choice also means casts between
+different missing data representations are views.


To clarify, does this mean that NumPy strings will make no distinction between empty strings '' and missing data?

I don't know if there are many use cases for this, but this is a potential point of incompatibility with strings in pandas (via object arrays), which do make this distinction.

Thanks for looking the proposal over!

I can add some additional text to clarify but short answer is you can have empty strings and NAs in the same array:

In [1]: import numpy as np In [2]: from stringdtype import StringDType In [3]: dt = StringDType(na_object=np.nan) In [4]: arr = np.array(["", np.nan, "string with embedded\0\0 and trailing nulls \0\0\0"], dtype=dt) In [5]: arr Out[5]: array(['', nan, 'string with embedded\x00\x00 and trailing nulls \x00\x00\x00'], dtype=StringDType(na_object=nan))

Under the hood, this is handled by representing empty strings like:

npy_static_string empty_string = {0, "\0"};

and "null" strings like:

npy_static_string null_string = {0, NULL};

If no NA object is set on the dtype instance, these two have the same meaning - empty string - but if an NA object is set, they have a different in-memory representation so we can tell them apart and insert empty strings or NAs when someone accesses those values from Python.

OK great, good to know :)

WarrenWeckesser · 2023-09-02T03:59:38Z

doc/neps/nep-0055-string_dtype.rst

+As a result of this discussion, improving support for string data was added to
+the NumPy project roadmap [3]_, with an explicit call-out to add a DType better
+suited to memory-mapping bytes with any or no encoding, and a variable-width
+string DType that supports missing data to replace usages of object string
+arrays.


@ngoldbaum, I'm reading some of the references to have better context for a review of the NEP. I don't see any comment in the roadmap about strings supporting missing data, and I didn't see it in the 2017 mega-thread. Was missing data for strings specifically discussed somewhere else? Or are you referring to the items about MaskedArrays in the Maintenance section of the roadmap, where a bullet point is "dtypes that support missing values"? I guess the implicit suggestion is to use missing values instead of masked arrays?

I'm not sure offhand if missing data support is mentioned in the discussions you're looking at. I think it was brought up, but you're right it didn't make it into the roadmap and I was probably a bit too strong with the wording here.

The main morivation for adding it wasn't previous discussions so much as writing a functioning version of Pandas that supports the new dtype. It may be possible to have a version of Pandas that handles missing data without explicit support in the dtype, but I don't think it's possible to do without substantial performance implications.

If you'd like I can update it and text explicitly mentioning that.

The text in the NEP seems to suggest that adding the capability for missing data was discussed in the April 2017 discussion, but I didn't see any mention of missing data in there.

To be clear, I'm definitely in favor of adding some form of missing data or "not a string" capability. But for the purpose of reviewing the NEP, I'm interested in what has already been discussed about this feature, to ensure that any specific needs or use-cases that have already been mentioned are addressed.

It did come up, e.g. here. I'd also argue that the requirement that it needs to replace usages of object arrays implicitly requires supporting missing data, since object arrays get that for free (albeit with no type checking).

Ah, so this line from @rkern

Maybe add a dtype similar to object_ that only permits unicode/str
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).

Thanks for finding that. That counts as "discussion" I guess, but I thought I would find a bit more.

There may be others, that was just a quick search by me on my phone...

When I say something, it Has Been Discussed. 😉

More seriously, consider it a proxy for all of the design discussion that went behind Pandas' choice to use object arrays for strings and then for Arrow to support nulls in StringArrays. They have a real use case, and improving their situation should (I think) be a core motivation of this NEP.

WarrenWeckesser · 2023-09-20T02:48:54Z

doc/neps/nep-0055-string_dtype.rst

+NaN-like Sentinels
++++++++++++++++++
+
+A NaN-like sentinel returns itself as the result of comparison operations. This


@ngoldbaum, for the next time you edit the NEP: I think you meant "arithmetic operations" here, not "comparison operations". Comparing a floating point nan with something else always returns False, not itself.

tylerjereddy added the component: NEP label Aug 21, 2023

ngoldbaum mentioned this pull request Aug 23, 2023

ENH: char arrays auto-extending on indexed assignments #24506

Closed

ngoldbaum added 2 commits August 28, 2023 14:16

DOC: variable width string dtype, NEP 55

d743437

NEP: add discussion links

d043fe7

ngoldbaum force-pushed the nep-55 branch from 426f6d0 to 5937d41 Compare August 28, 2023 20:16

NEP: respond to NEP 55 PR review comments

bc323aa

ngoldbaum force-pushed the nep-55 branch from 5937d41 to bc323aa Compare August 28, 2023 21:26

mattip merged commit 5ba26f0 into numpy:main Aug 29, 2023
52 checks passed

shoyer reviewed Aug 31, 2023

View reviewed changes

WarrenWeckesser reviewed Sep 2, 2023

View reviewed changes

WarrenWeckesser reviewed Sep 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEP: add NEP 55 for a variable width string dtype #24483

NEP: add NEP 55 for a variable width string dtype #24483

ngoldbaum commented Aug 21, 2023

WarrenWeckesser commented Aug 21, 2023 •

edited

Loading

mhvk commented Aug 22, 2023

ngoldbaum commented Aug 22, 2023 •

edited

Loading

WarrenWeckesser commented Aug 22, 2023 •

edited

Loading

ngoldbaum commented Aug 22, 2023

mhvk commented Aug 23, 2023

ngoldbaum commented Aug 23, 2023 •

edited

Loading

ngoldbaum commented Aug 28, 2023 •

edited

Loading

mattip commented Aug 29, 2023

shoyer Aug 31, 2023

ngoldbaum Aug 31, 2023

shoyer Aug 31, 2023

WarrenWeckesser Sep 2, 2023 •

edited

Loading

ngoldbaum Sep 2, 2023 •

edited

Loading

WarrenWeckesser Sep 2, 2023

ngoldbaum Sep 2, 2023

WarrenWeckesser Sep 2, 2023

ngoldbaum Sep 2, 2023

rkern Sep 2, 2023

WarrenWeckesser Sep 20, 2023

NEP: add NEP 55 for a variable width string dtype #24483

NEP: add NEP 55 for a variable width string dtype #24483

Conversation

ngoldbaum commented Aug 21, 2023

WarrenWeckesser commented Aug 21, 2023 • edited Loading

mhvk commented Aug 22, 2023

ngoldbaum commented Aug 22, 2023 • edited Loading

WarrenWeckesser commented Aug 22, 2023 • edited Loading

ngoldbaum commented Aug 22, 2023

mhvk commented Aug 23, 2023

ngoldbaum commented Aug 23, 2023 • edited Loading

ngoldbaum commented Aug 28, 2023 • edited Loading

mattip commented Aug 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WarrenWeckesser Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

ngoldbaum Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WarrenWeckesser commented Aug 21, 2023 •

edited

Loading

ngoldbaum commented Aug 22, 2023 •

edited

Loading

WarrenWeckesser commented Aug 22, 2023 •

edited

Loading

ngoldbaum commented Aug 23, 2023 •

edited

Loading

ngoldbaum commented Aug 28, 2023 •

edited

Loading

WarrenWeckesser Sep 2, 2023 •

edited

Loading

ngoldbaum Sep 2, 2023 •

edited

Loading