Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP: add NEP 55 for a variable width string dtype #24483

Merged
merged 3 commits into from
Aug 29, 2023
Merged

Conversation

ngoldbaum
Copy link
Member

This is the content of NEP 55. Per NEP 0 I will post to the mailing list and update the PR to include links to the discussions as soon as this is submitted. Please let me know if there's anything else I missed in the NEP process.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Aug 21, 2023

This looks like a great addition to NumPy!

I haven't read the full NEP in depth (started skimming near the end); so far I have just a couple technical questions that I think deserve clarification in the NEP to avoid confusion:

  • The struct npy_static_string has a len field, and a pointer to a null-terminated string. Why the redundancy of null-termination and storing the length? And how does the string type handle embedded zeros if the string is null-terminated? E.g. s = "ABC\0DEF\0\0" is a valid Python string with length 9. How would this string look as an element of an array of StringDType?

  • When one creates a StringDType with a na_object, how are the instances of the na_object stored in the array? All you have in the array is the npy_static_string struct. What convention is used to indicate that it is actually the na_object? Edit (partially answering my own question): I reread this:

    We do allow NULL npy_static_string entries in the array buffer, representing either an empty string or a missing data sentinel, depending on the parameters of the StringDType instance associated with the array, ...

    I interpret this to mean: if na_object is specified, then buf == NULL means the array element is the missing data sentinel. If na_object was not specified, then buf == NULL implies the array element is an empty string. (Presumably the code should enforce len == 0 in that case.) Is that correct?

@mhvk
Copy link
Contributor

mhvk commented Aug 22, 2023

Really nice and very clear NEP! My only question was the one already asked by @WarrenWeckesser, why a null-terminated string when there is a len?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Aug 22, 2023

Thanks for the quick feedback!

Regarding storing the NA object, you're right, it's represented as a NULL in the array buffer. If the dtype instance doesn't have an NA sentinel a NULL represents the empty string. Using NULL has the nice property that np.empty just needs to be a call to calloc.

Regarding storing null-terminated strings, that choice is to ease working with C APIs that expect NULL-terminated strings and also to more clearly separate array elements in the sidecar buffer. That said, the point about Python supporting strings with embedded nulls is a good one and probably negates any benefit of storing null-terminated strings. For what it's worth, the stringdtype prototype does support python strings with embedded nulls and doesn't depend on the null-termination character:

In [1]: import numpy as np

In [2]: from stringdtype import StringDType

In [3]: arr = np.array([r"hello\0world"], dtype=StringDType())

In [4]: print(arr)
['hello\\0world']

In [5]: print(arr[0])
hello\0world

There's no particularly important reason why the strings have to be null-terminated. If the consensus is not to do that, I can add a C API function that takes an npy_static_string and returns a null-terminated char* buffer that must be manually de-allocated for cases when a NULL-terminated buffer is needed or external libraries expect a C string.

I'm going to wait a week or so to update the text so I can get more comments on the current draft before updating to clarify the points that have been raised so far. At that point, it would be nice to get this quickly merged so that we can continue discussion on the mailing list, per NEP 0.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Aug 22, 2023

Your example needs a small tweak to make your point. Because of the r prefix, r"hello\0world" includes the literal character '\' followed by the character '0' (i.e. ASCII code 48); it does not have an embedded null character. But fixing that does demonstrate your point 👍, e.g.

In [13]: from stringdtype import StringDType

In [14]: data = ['hello\0world', 'ABC\0DEF\0\0']

In [15]: arr = np.array(data, dtype=StringDType())

In [16]: print(arr)
['hello\x00world' 'ABC\x00DEF\x00\x00']

That also shows that trailing null characters are accepted, which is another improvement over the existing fixed-length strings. Note in the following that, because of the null-padding convention used by the fixed-length strings, the trailing null bytes of the input 'ABC\0DEF\0\0' are lost:

In [17]: b = np.array(data)

In [18]: print(b)
['hello\x00world' 'ABC\x00DEF']

@ngoldbaum
Copy link
Member Author

But fixing that does demonstrate your point 👍

Argh, sorry, thanks for double-checking me and confirming.

@mhvk
Copy link
Contributor

mhvk commented Aug 23, 2023

FWIW, my sense would be not to include the extra nulls.

More general, in the NEP, you mention the option to (eventually) store short strings in the structure itself. This seems nice! It might be good to tell how you ensure that this can happen. E.g., do you envisage that, say, the highest bit of len becomes a flag? Might it be good to reserve the highest byte for flags?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Aug 23, 2023

E.g., do you envisage that, say, the highest bit of len becomes a flag? Might it be good to reserve the highest byte for flags?

Good point.

Most references I can find describing the small string optimization are working with dynamic strings that have size and capacity fields, so if the most significant bit of size is 1 and capacity is 0, then it's a small string (since size would be greater than capacity). We don't store a capacity, so I guess we would need to use a bit in the size. We could do bitmasking or make the size a signed integer, and make negative sizes indicate a small string. That probably argues more strongly in favor of using a 64 bit length, since it's much less likely someone would notice the arbitrary limit of 2^63 bytes per element than 2^31. Also using the small string optimization means the extra overhead for long strings can be balanced by storing small strings directly in the array.

I'll make sure to update the relevant text with this as well.

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Aug 28, 2023

If there aren’t any more comments it would be nice to get this merged as a draft. Having a rendered version on the website makes it a lot easier to read.

@mattip
Copy link
Member

mattip commented Aug 29, 2023

NEP 0 states

At the earliest convenience, the PR should be merged (regardless of whether it is accepted during discussion). Additional PRs may be made by the Author to update or expand the NEP, or by maintainers to set its status, discussion URL, etc.

so in it goes

Thanks @ngoldbaum

@mattip mattip merged commit 5ba26f0 into numpy:main Aug 29, 2023
52 checks passed
Comment on lines +454 to +459
By default, zeroed out entries in the array buffer represent empty
strings. However, if the DType instance was created with an ``na_object`` field,
zeroed-out entries represent missing data. By making this choice, a zero-filled
newly allocated buffer returned by ``calloc`` does not need any additional
post-processing to produce an empty array. This choice also means casts between
different missing data representations are views.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, does this mean that NumPy strings will make no distinction between empty strings '' and missing data?

I don't know if there are many use cases for this, but this is a potential point of incompatibility with strings in pandas (via object arrays), which do make this distinction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking the proposal over!

I can add some additional text to clarify but short answer is you can have empty strings and NAs in the same array:

In [1]: import numpy as np

In [2]: from stringdtype import StringDType

In [3]: dt = StringDType(na_object=np.nan)

In [4]: arr = np.array(["", np.nan, "string with embedded\0\0 and trailing nulls \0\0\0"], dtype=dt)

In [5]: arr
Out[5]: 
array(['', nan,
       'string with embedded\x00\x00 and trailing nulls \x00\x00\x00'],
      dtype=StringDType(na_object=nan))

Under the hood, this is handled by representing empty strings like:

npy_static_string empty_string = {0, "\0"};

and "null" strings like:

npy_static_string null_string = {0, NULL};

If no NA object is set on the dtype instance, these two have the same meaning - empty string - but if an NA object is set, they have a different in-memory representation so we can tell them apart and insert empty strings or NAs when someone accesses those values from Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great, good to know :)

Comment on lines +115 to +119
As a result of this discussion, improving support for string data was added to
the NumPy project roadmap [3]_, with an explicit call-out to add a DType better
suited to memory-mapping bytes with any or no encoding, and a variable-width
string DType that supports missing data to replace usages of object string
arrays.
Copy link
Member

@WarrenWeckesser WarrenWeckesser Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngoldbaum, I'm reading some of the references to have better context for a review of the NEP. I don't see any comment in the roadmap about strings supporting missing data, and I didn't see it in the 2017 mega-thread. Was missing data for strings specifically discussed somewhere else? Or are you referring to the items about MaskedArrays in the Maintenance section of the roadmap, where a bullet point is "dtypes that support missing values"? I guess the implicit suggestion is to use missing values instead of masked arrays?

Copy link
Member Author

@ngoldbaum ngoldbaum Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure offhand if missing data support is mentioned in the discussions you're looking at. I think it was brought up, but you're right it didn't make it into the roadmap and I was probably a bit too strong with the wording here.

The main morivation for adding it wasn't previous discussions so much as writing a functioning version of Pandas that supports the new dtype. It may be possible to have a version of Pandas that handles missing data without explicit support in the dtype, but I don't think it's possible to do without substantial performance implications.

If you'd like I can update it and text explicitly mentioning that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text in the NEP seems to suggest that adding the capability for missing data was discussed in the April 2017 discussion, but I didn't see any mention of missing data in there.

To be clear, I'm definitely in favor of adding some form of missing data or "not a string" capability. But for the purpose of reviewing the NEP, I'm interested in what has already been discussed about this feature, to ensure that any specific needs or use-cases that have already been mentioned are addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did come up, e.g. here. I'd also argue that the requirement that it needs to replace usages of object arrays implicitly requires supporting missing data, since object arrays get that for free (albeit with no type checking).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this line from @rkern

  • Maybe add a dtype similar to object_ that only permits unicode/str
    (2.x/3.x) strings (and maybe None to represent missing data a la pandas).

Thanks for finding that. That counts as "discussion" I guess, but I thought I would find a bit more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be others, that was just a quick search by me on my phone...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say something, it Has Been Discussed. 😉

More seriously, consider it a proxy for all of the design discussion that went behind Pandas' choice to use object arrays for strings and then for Arrow to support nulls in StringArrays. They have a real use case, and improving their situation should (I think) be a core motivation of this NEP.

NaN-like Sentinels
++++++++++++++++++

A NaN-like sentinel returns itself as the result of comparison operations. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngoldbaum, for the next time you edit the NEP: I think you meant "arithmetic operations" here, not "comparison operations". Comparing a floating point nan with something else always returns False, not itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants