-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEP: add NEP 55 for a variable width string dtype #24483
Conversation
This looks like a great addition to NumPy! I haven't read the full NEP in depth (started skimming near the end); so far I have just a couple technical questions that I think deserve clarification in the NEP to avoid confusion:
|
Really nice and very clear NEP! My only question was the one already asked by @WarrenWeckesser, why a null-terminated string when there is a |
Thanks for the quick feedback! Regarding storing the NA object, you're right, it's represented as a NULL in the array buffer. If the dtype instance doesn't have an NA sentinel a NULL represents the empty string. Using NULL has the nice property that Regarding storing null-terminated strings, that choice is to ease working with C APIs that expect NULL-terminated strings and also to more clearly separate array elements in the sidecar buffer. That said, the point about Python supporting strings with embedded nulls is a good one and probably negates any benefit of storing null-terminated strings. For what it's worth, the
There's no particularly important reason why the strings have to be null-terminated. If the consensus is not to do that, I can add a C API function that takes an I'm going to wait a week or so to update the text so I can get more comments on the current draft before updating to clarify the points that have been raised so far. At that point, it would be nice to get this quickly merged so that we can continue discussion on the mailing list, per NEP 0. |
Your example needs a small tweak to make your point. Because of the
That also shows that trailing null characters are accepted, which is another improvement over the existing fixed-length strings. Note in the following that, because of the null-padding convention used by the fixed-length strings, the trailing null bytes of the input
|
Argh, sorry, thanks for double-checking me and confirming. |
FWIW, my sense would be not to include the extra nulls. More general, in the NEP, you mention the option to (eventually) store short strings in the structure itself. This seems nice! It might be good to tell how you ensure that this can happen. E.g., do you envisage that, say, the highest bit of |
Good point. Most references I can find describing the small string optimization are working with dynamic strings that have size and capacity fields, so if the most significant bit of size is 1 and capacity is 0, then it's a small string (since size would be greater than capacity). We don't store a capacity, so I guess we would need to use a bit in the size. We could do bitmasking or make the size a signed integer, and make negative sizes indicate a small string. That probably argues more strongly in favor of using a 64 bit length, since it's much less likely someone would notice the arbitrary limit of 2^63 bytes per element than 2^31. Also using the small string optimization means the extra overhead for long strings can be balanced by storing small strings directly in the array. I'll make sure to update the relevant text with this as well. |
If there aren’t any more comments it would be nice to get this merged as a draft. Having a rendered version on the website makes it a lot easier to read. |
NEP 0 states
so in it goes Thanks @ngoldbaum |
By default, zeroed out entries in the array buffer represent empty | ||
strings. However, if the DType instance was created with an ``na_object`` field, | ||
zeroed-out entries represent missing data. By making this choice, a zero-filled | ||
newly allocated buffer returned by ``calloc`` does not need any additional | ||
post-processing to produce an empty array. This choice also means casts between | ||
different missing data representations are views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, does this mean that NumPy strings will make no distinction between empty strings ''
and missing data?
I don't know if there are many use cases for this, but this is a potential point of incompatibility with strings in pandas (via object arrays), which do make this distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking the proposal over!
I can add some additional text to clarify but short answer is you can have empty strings and NAs in the same array:
In [1]: import numpy as np
In [2]: from stringdtype import StringDType
In [3]: dt = StringDType(na_object=np.nan)
In [4]: arr = np.array(["", np.nan, "string with embedded\0\0 and trailing nulls \0\0\0"], dtype=dt)
In [5]: arr
Out[5]:
array(['', nan,
'string with embedded\x00\x00 and trailing nulls \x00\x00\x00'],
dtype=StringDType(na_object=nan))
Under the hood, this is handled by representing empty strings like:
npy_static_string empty_string = {0, "\0"};
and "null" strings like:
npy_static_string null_string = {0, NULL};
If no NA object is set on the dtype instance, these two have the same meaning - empty string - but if an NA object is set, they have a different in-memory representation so we can tell them apart and insert empty strings or NAs when someone accesses those values from Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great, good to know :)
As a result of this discussion, improving support for string data was added to | ||
the NumPy project roadmap [3]_, with an explicit call-out to add a DType better | ||
suited to memory-mapping bytes with any or no encoding, and a variable-width | ||
string DType that supports missing data to replace usages of object string | ||
arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngoldbaum, I'm reading some of the references to have better context for a review of the NEP. I don't see any comment in the roadmap about strings supporting missing data, and I didn't see it in the 2017 mega-thread. Was missing data for strings specifically discussed somewhere else? Or are you referring to the items about MaskedArray
s in the Maintenance section of the roadmap, where a bullet point is "dtypes that support missing values"? I guess the implicit suggestion is to use missing values instead of masked arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure offhand if missing data support is mentioned in the discussions you're looking at. I think it was brought up, but you're right it didn't make it into the roadmap and I was probably a bit too strong with the wording here.
The main morivation for adding it wasn't previous discussions so much as writing a functioning version of Pandas that supports the new dtype. It may be possible to have a version of Pandas that handles missing data without explicit support in the dtype, but I don't think it's possible to do without substantial performance implications.
If you'd like I can update it and text explicitly mentioning that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text in the NEP seems to suggest that adding the capability for missing data was discussed in the April 2017 discussion, but I didn't see any mention of missing data in there.
To be clear, I'm definitely in favor of adding some form of missing data or "not a string" capability. But for the purpose of reviewing the NEP, I'm interested in what has already been discussed about this feature, to ensure that any specific needs or use-cases that have already been mentioned are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did come up, e.g. here. I'd also argue that the requirement that it needs to replace usages of object arrays implicitly requires supporting missing data, since object arrays get that for free (albeit with no type checking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this line from @rkern
- Maybe add a dtype similar to
object_
that only permitsunicode/str
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
Thanks for finding that. That counts as "discussion" I guess, but I thought I would find a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be others, that was just a quick search by me on my phone...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say something, it Has Been Discussed. 😉
More seriously, consider it a proxy for all of the design discussion that went behind Pandas' choice to use object arrays for strings and then for Arrow to support nulls in StringArray
s. They have a real use case, and improving their situation should (I think) be a core motivation of this NEP.
NaN-like Sentinels | ||
++++++++++++++++++ | ||
|
||
A NaN-like sentinel returns itself as the result of comparison operations. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngoldbaum, for the next time you edit the NEP: I think you meant "arithmetic operations" here, not "comparison operations". Comparing a floating point nan with something else always returns False, not itself.
This is the content of NEP 55. Per NEP 0 I will post to the mailing list and update the PR to include links to the discussions as soon as this is submitted. Please let me know if there's anything else I missed in the NEP process.