NEP: Update NEP 55 #25162

ngoldbaum · 2023-11-16T17:15:32Z

This substantially updates the text of NEP 55 to reflect the current state of the implementation. The old sections that outlined the plan for implementing an arena allocator and handling thread safety concerns have been replaced with a description of what ended up getting implemented.

I've updated the text elsewhere to reflect these changes as well as upstream changes in pandas - they moved away from object string arrays to PyArrow backed string arrays by default.

I also added three new diagrams that visually explain the memory layout. I created these using draw.io and to avoid issues with font rendering clicked the setting to export text as svg. This makes the svg kinda large although still less than a megabyte per file. Not sure if that's an issue. I could export pngs too.

I'll be pinging the mailing list separately along with a call to test out stringdtype and discuss how we should go about final approval of the NEP but I want to update the installation instructions first.

~~Ping @WarrenWeckesser since most of these updates are in response to your prior review. The offer to add you as a co-author on the NEP still stands!~~ Added Warren after a request over e-mail.

ngoldbaum · 2023-11-16T17:19:48Z

One note - I haven't made a PR for the cosmetic changes and bugfixes I made to stringdtype while updating the NEP text yet, so the e-mail to the mailing list will need to wait on that too.

mhvk

Very nice to see the short strings implemented! The comments are all small and mostly fine to ignore them. At some level, I'm not sure the arena allocator truly belongs in the NEP - it seems a bit of an implementation detail, where one takes over memory allocation for a bit of extra efficiency (and leads to questions like mine that are probably not so relevant for the NEP...). But for read-only data, a different scheme may be more efficient, and since you keep this opaque anyway, that could always be done later.

mhvk · 2023-11-17T03:40:52Z

doc/neps/nep-0055-string_dtype.rst

+
+   typedef struct _short_string_buffer {
+      char buf[sizeof(_npy_static_string_t) - 1];
+      unsigned char size_and_flags;


Might it make sense to define size and flags as separate bit fields?

char buf[sizeof(_npy_static_string_t) - 1]; unsigned char size : 4; unsigned char flags : 4;

Though looking at https://en.cppreference.com/w/cpp/language/bit_field, it seems the order of packing is implementation dependent... On the other hand, you do not seem to worry about saving this information, so maybe still an idea? It is nicely explicit.

Yeah, maybe... we haven't really used bit-fields much, but maybe that is just a remnant of a past where they didn't exist...

(In either case, I don't care much about it though, these are implementation details, that are OK to change.)

Might it make sense to define size and flags as separate bit fields?

Neat, I wasn't aware of this feature. I'm not sure it actually helps much besides making the code clearer in this case since we're packing the bits into a char anyway, so using a bitfield doesn't save us any RAM, it just makes it explicit what each of the bits are for.

you do not seem to worry about saving this information

In principle we could use them for serialization but right now we don't save the bitflags.

Yes, it would just be for clarity (which of course is important!)

For now I'd like to focus on other things, would you be ok with coming back to this later as a refactor? I definitely agree clarity is important, I just have a lot of stuff that I'd like to get done in time for NumPy 2.0!

mhvk · 2023-11-17T03:47:50Z

doc/neps/nep-0055-string_dtype.rst

+arena where the last arena allocation ended. The arena is filled using an
+exponentially expanding buffer, allowing amortized O(1) insertion.
+
+Each string entry in the arena is prepended by a size, stored either in a


Why is the size information duplicated?

mhvk · 2023-11-17T03:56:57Z

doc/neps/nep-0055-string_dtype.rst

+Empty Strings and Missing Data
++++++++++++++++++++++++++++++
+
+The empty string will be defined to be a zero-filled


In principle, a short string with size = 0 is also empty. Will it compare equal to the all-zero item here? Naively I would have expected no flags to indicate a short string to make all-zero work as empty.

There are size checks that short-circuit before creating a short string with size zero, so in practice that can't happen unless shenanigans are happening mutating the array buffer outside of the control of stringdtype.

We don't actually depend anywhere on empty string needing to always be a zero-filed packed buffer, so I loosened that requirement in the text here. Now it's just any string with size zero, including short strings. Missing strings are the same, except the have the missing string flag set.

mhvk · 2023-11-17T03:59:43Z

doc/neps/nep-0055-string_dtype.rst

+Arena strings store string data in a heap-allocated arena buffer that is managed
+by the ``StringDType`` instance attached to the array. In this example, the
+string contents are "Numpy is a very cool library", stored at offset ``0x94C``
+in the arena allocation. Note that the ``size`` is stored twice, once in the


Here, it is mentioned explicitly that the size is stored twice, but it would be good to give a reason! Is it to keep track of the full allocated space? I.e., if a string is shrunk, does this size stay the same?

This is stored to re-use the arena allocation later. We need to store the size to see if a future allocation for the string that owns space in the arena will fit. This only happens if the string is mutated to be the same size or smaller than the original contents.

We could simplify the design by not storing the sizes and not reusing space in the arena when we can and instead whenever a string is mutated to be larger than the original contents always do a heap allocation and never try to reuse the space in the arena after that. That might or might not save memory depending on the situation by not storing the sizes in the arena buffer.

Yeah, I figured it was something like this. It perhaps is a bit more detail than needed, since in the end there can be multiple allocators. Since you wrote the text and it explains things nicely, do keep it, but perhaps follow @seberg's suggestion for a bit of a reorganization.

Yeah I'll reorganize and try to make sure the points you got tripped up on clearer. Thanks for the read!

I added a paragraph explaining why we store sizes in the arena and pointing out that we could also make a simpler choice if we'd like at the cost of doing heap allocations more often when a string is mutated

seberg

I have not read the allocator explanations as carefully as Marten probably, but I am not worried about them.

I have two concrete suggestions (but if you prefer, happy to just merge this to iterate more easily!):

The missing value, ufunc support, etc. Feels more fitting as subsections to "Python API". It is probably also easier to understand the C part if we first mention these (can we just move those sections up?)
It would be nice to emphasize (very briefly!) the new C-API functions we want to add right now (it would be totally fine to even say that C-API is locked down so much that users cannot write ufuncs initially). I think this means:
- Mention how to aquire the lock
- Make the pack/unpack functions public (as proper functions)
- Omit the "opaque" element struct (so we can change the itemsize if we want to)
- ...

Maybe some of the Memory handling stuff could then just be moved to the "Implementation" section, and the short C-API part of the detailed description can just say to read that for details.

That would also help with assessing the NEP, I think. Because I am mainly worried about public API, not about implementation. If the API has a few functions to aquire a lock (and the allocator struct) from the descriptors then I suspect you could swap out the implementation completely (even without arena allocator, etc.)

seberg · 2023-11-17T09:10:33Z

doc/neps/nep-0055-string_dtype.rst

@@ -372,12 +377,12 @@ DType using the DType class.
 We will also extend the ``NPY_TYPES`` enum in the C API with an ``NPY_VSTRING``
 entry (there is already an ``NPY_STRING`` entry). This should not interfere with
 legacy user-defined DTypes since the integer type numbers for these data types
-begin at 256, so in principle there is still room for hundreds more builtin
+begin at 256. In principle there is still room for hundreds more builtin
 DTypes in the integer range available in the ``NPY_TYPES`` enum.


A bit random: Happy if we do also happy if not because it turns out to be a burden in practice. (We should make the actual DTypes more accessible anyway.)

seberg · 2023-11-17T09:24:28Z

doc/neps/nep-0055-string_dtype.rst

+   struct npy_unpacked_static_string {
+       size_t size;
+       const char *buf;
+   };


I think the plan is to make this part public, in which case, maybe we need to mention the way to load/store it:

int npy_unpack_string( PyArray_Descr/StringDescr *descr, void *ptr, npy_unpacked_static_string *out_unpacked);

as new public API Functions (forgot what we called it)? (or did I miss it)

Maybe it would actually be better to move this to a public C-API section, which can come first as an intro (so memory details are hidden).
I am also think that npy_packed_static_string should remain private! It is useful for us internally, but for now I think we should not be beholden to the size staying fixed!

In principle, some further details beyond what is "public" could then also be moved to the implementation.

In either case, I don't think that needs to be reordered, just thoughts.

seberg · 2023-11-17T09:28:00Z

doc/neps/nep-0055-string_dtype.rst

+
+   typedef struct _short_string_buffer {
+      char buf[sizeof(_npy_static_string_t) - 1];
+      unsigned char size_and_flags;


Yeah, maybe... we haven't really used bit-fields much, but maybe that is just a remnant of a past where they didn't exist...

(In either case, I don't care much about it though, these are implementation details, that are OK to change.)

ngoldbaum · 2023-11-17T16:40:02Z

but if you prefer, happy to just merge this to iterate more easily!

I think I can integrate your and Marten's suggestions today so let's not merge quite yet.

The missing value, ufunc support, etc. Feels more fitting as subsections to "Python API". It is probably also easier to understand the C part if we first mention these (can we just move those sections up?)

Makes sense, I'll move them up.

It would be nice to emphasize (very briefly!) the new C-API functions we want to add right now (it would be totally fine to even say that C-API is locked down so much that users cannot write ufuncs initially)

Sure, I'll add a "C API for StringDType" section and try to keep the public API limited to packing/unpacking from a char * pointer to the beginning of an array element and acquiring and releasing the lock. I see your point about not setting all this in stone and leaving it an implementation detail and I'll leave out making the rest of the string C API public.

ngoldbaum · 2023-11-20T18:53:25Z

See the last pushes which should integrate the review comments.

I also updated the stringdtype prototype in the numpy-user-dtypes repo and it should fully match the latest version of the text.

I rearranged the text, hopefully this flows in a more sensible way.

I'm now proposing only exposing an API to acquire and release the mutex (along with variants that know how to deal with multiple descriptors that may or may not be the same object) and pack and unpack strings. Users can then manipulate unpacked strings as they wish.

I also made the npy_packed_static_string struct opaque.

I added some discussion about why we're storing sizes in the arena allocation.

I ended up deciding not to use the bitfield syntax since I'd really like to get this done in time for NumPy 2.0 and have a lot of other stuff that needs to get done before that happens. We can definitely come back to that in the future as a refactor to improve clarity.

Let me know if you have any additional concerns before I ping the mailing list to ask for a final round of review.

seberg

Thanks, it is clearer now I think. What I still think would help (but I don't want to wait or point it too much), is to have a clearer overview. One thing I notice (but maybe that is just the hasty reader), is that e.g. NpyString_acquire_allocator shows up in the example, but I don't have a clear overview yet. A similar thing maybe, is that to me it would be easier to see the full list of flags somewhere early on (even if unexplained!).

In either case though, I would be OK with merging it soon for annoucements, we can always tweak a bit (and in the end, I think it is good, maybe Marten has more concrete feedback, though).

Maybe overstepping, but I could imagine the C-API section something like:

To access or write to an array typed as StringDType, we provide the main public sturcture representing the string data:

npy_static_string...

This is the API to work with the strings, but is not how they are stored (packed). To access or write to the data, users must first first lock it and fetch an allocator, from the arrays dtype (descriptor in C). This ensures thread-safety and allows for custom allocation strategies. only then can users access the data.
The reason for these choices are outlined in the following sections.

The initial public API consists of:

npy_string_allocator *NpyString_acquire_allocator(PyArrar_Descr *descr) to prepare writing to an array. With further API to deal with multiple arrays/descriptors at once.
NpyString_release_allocator(descr or allocator?) to relase it again.
NpyString_load(npy_string_allocator *allocator, npy_packed_string *data_ptr, npy_static_string *sdata) reads the array data from data_ptr into the static string sdata.
NpyString_pack(npy_string_allocator *allocator, npy_packed_string *data_ptr, char *utf8_string, size_t length) to pack a new utf8 string into the array.
NpyString_pack_null(npy_string_allocator *allocator, npy_packed_string *data_ptr) to set the array entry to the masked value.

Above ny_packed_string and npy_string_allocator or opaque structs for the time being. The following is a full example using all of these functions and outlining how to deal with masked arrays:
...

The reason we provide functions to lock multiple arrays at once is...

doc/neps/nep-0055-string_dtype.rst

ngoldbaum · 2023-11-20T22:10:24Z

Thanks for the comments! I'll do one more pass on this to address your comments since I want this to be as clear as possible before more eyes are on it.

ngoldbaum · 2023-11-21T22:04:54Z

Updated again with an overview of the C API following your suggestion. I also moved the C API and memory layout sections to the very end so you have context about the rest of the design before getting to that content.

ngoldbaum · 2023-11-22T16:08:57Z

Let's merge this so we can have an updated version on the website. Thanks so much for your review @seberg and @mhvk!

ngoldbaum added 4 commits November 10, 2023 07:40

DOC: respond to warren's comment from september

2715a00

DOC: update nep 55 text

9249feb

DOC: finish NEP 55 updates

34ff2c0

Merge branch 'main' into nep-55

69d9a2c

NEP: Add Warren as a NEP 55 author

d836130

mhvk reviewed Nov 17, 2023

View reviewed changes

seberg reviewed Nov 17, 2023

View reviewed changes

melissawm added the component: NEP label Nov 17, 2023

ngoldbaum added 2 commits November 20, 2023 11:43

NEP: reorganizing and responding to review comments

b7ca718

NEP: flesh out description of mutex acquire/release functions

e330b2f

seberg approved these changes Nov 20, 2023

View reviewed changes

Add an overview of the C API

e56895c

ngoldbaum merged commit b0bb396 into numpy:main Nov 22, 2023
60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEP: Update NEP 55 #25162

NEP: Update NEP 55 #25162

ngoldbaum commented Nov 16, 2023 •

edited

Loading

ngoldbaum commented Nov 16, 2023

mhvk left a comment

mhvk Nov 17, 2023

seberg Nov 17, 2023

ngoldbaum Nov 17, 2023

mhvk Nov 17, 2023

ngoldbaum Nov 17, 2023

mhvk Nov 17, 2023

mhvk Nov 17, 2023

ngoldbaum Nov 17, 2023

ngoldbaum Nov 17, 2023

mhvk Nov 17, 2023

ngoldbaum Nov 17, 2023

mhvk Nov 17, 2023

ngoldbaum Nov 17, 2023

ngoldbaum Nov 20, 2023

seberg left a comment

seberg Nov 17, 2023

seberg Nov 17, 2023

seberg Nov 17, 2023

seberg Nov 17, 2023

ngoldbaum commented Nov 17, 2023

ngoldbaum commented Nov 20, 2023

seberg left a comment

ngoldbaum commented Nov 20, 2023

ngoldbaum commented Nov 21, 2023

ngoldbaum commented Nov 22, 2023

NEP: Update NEP 55 #25162

NEP: Update NEP 55 #25162

Conversation

ngoldbaum commented Nov 16, 2023 • edited Loading

ngoldbaum commented Nov 16, 2023

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum commented Nov 17, 2023

ngoldbaum commented Nov 20, 2023

seberg left a comment

Choose a reason for hiding this comment

ngoldbaum commented Nov 20, 2023

ngoldbaum commented Nov 21, 2023

ngoldbaum commented Nov 22, 2023

ngoldbaum commented Nov 16, 2023 •

edited

Loading