Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP: Update NEP 55 #25162

Merged
merged 8 commits into from
Nov 22, 2023
Merged

NEP: Update NEP 55 #25162

merged 8 commits into from
Nov 22, 2023

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Nov 16, 2023

This substantially updates the text of NEP 55 to reflect the current state of the implementation. The old sections that outlined the plan for implementing an arena allocator and handling thread safety concerns have been replaced with a description of what ended up getting implemented.

I've updated the text elsewhere to reflect these changes as well as upstream changes in pandas - they moved away from object string arrays to PyArrow backed string arrays by default.

I also added three new diagrams that visually explain the memory layout. I created these using draw.io and to avoid issues with font rendering clicked the setting to export text as svg. This makes the svg kinda large although still less than a megabyte per file. Not sure if that's an issue. I could export pngs too.

I'll be pinging the mailing list separately along with a call to test out stringdtype and discuss how we should go about final approval of the NEP but I want to update the installation instructions first.

Ping @WarrenWeckesser since most of these updates are in response to your prior review. The offer to add you as a co-author on the NEP still stands! Added Warren after a request over e-mail.

@ngoldbaum
Copy link
Member Author

One note - I haven't made a PR for the cosmetic changes and bugfixes I made to stringdtype while updating the NEP text yet, so the e-mail to the mailing list will need to wait on that too.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to see the short strings implemented! The comments are all small and mostly fine to ignore them. At some level, I'm not sure the arena allocator truly belongs in the NEP - it seems a bit of an implementation detail, where one takes over memory allocation for a bit of extra efficiency (and leads to questions like mine that are probably not so relevant for the NEP...). But for read-only data, a different scheme may be more efficient, and since you keep this opaque anyway, that could always be done later.


typedef struct _short_string_buffer {
char buf[sizeof(_npy_static_string_t) - 1];
unsigned char size_and_flags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make sense to define size and flags as separate bit fields?

char buf[sizeof(_npy_static_string_t) - 1];
unsigned char size : 4;
unsigned char flags : 4;

Though looking at https://en.cppreference.com/w/cpp/language/bit_field, it seems the order of packing is implementation dependent... On the other hand, you do not seem to worry about saving this information, so maybe still an idea? It is nicely explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe... we haven't really used bit-fields much, but maybe that is just a remnant of a past where they didn't exist...

(In either case, I don't care much about it though, these are implementation details, that are OK to change.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make sense to define size and flags as separate bit fields?

Neat, I wasn't aware of this feature. I'm not sure it actually helps much besides making the code clearer in this case since we're packing the bits into a char anyway, so using a bitfield doesn't save us any RAM, it just makes it explicit what each of the bits are for.

you do not seem to worry about saving this information

In principle we could use them for serialization but right now we don't save the bitflags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would just be for clarity (which of course is important!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'd like to focus on other things, would you be ok with coming back to this later as a refactor? I definitely agree clarity is important, I just have a lot of stuff that I'd like to get done in time for NumPy 2.0!

arena where the last arena allocation ended. The arena is filled using an
exponentially expanding buffer, allowing amortized O(1) insertion.

Each string entry in the arena is prepended by a size, stored either in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size information duplicated?

Empty Strings and Missing Data
++++++++++++++++++++++++++++++

The empty string will be defined to be a zero-filled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, a short string with size = 0 is also empty. Will it compare equal to the all-zero item here? Naively I would have expected no flags to indicate a short string to make all-zero work as empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are size checks that short-circuit before creating a short string with size zero, so in practice that can't happen unless shenanigans are happening mutating the array buffer outside of the control of stringdtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually depend anywhere on empty string needing to always be a zero-filed packed buffer, so I loosened that requirement in the text here. Now it's just any string with size zero, including short strings. Missing strings are the same, except the have the missing string flag set.

Arena strings store string data in a heap-allocated arena buffer that is managed
by the ``StringDType`` instance attached to the array. In this example, the
string contents are "Numpy is a very cool library", stored at offset ``0x94C``
in the arena allocation. Note that the ``size`` is stored twice, once in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it is mentioned explicitly that the size is stored twice, but it would be good to give a reason! Is it to keep track of the full allocated space? I.e., if a string is shrunk, does this size stay the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stored to re-use the arena allocation later. We need to store the size to see if a future allocation for the string that owns space in the arena will fit. This only happens if the string is mutated to be the same size or smaller than the original contents.

We could simplify the design by not storing the sizes and not reusing space in the arena when we can and instead whenever a string is mutated to be larger than the original contents always do a heap allocation and never try to reuse the space in the arena after that. That might or might not save memory depending on the situation by not storing the sizes in the arena buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured it was something like this. It perhaps is a bit more detail than needed, since in the end there can be multiple allocators. Since you wrote the text and it explains things nicely, do keep it, but perhaps follow @seberg's suggestion for a bit of a reorganization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll reorganize and try to make sure the points you got tripped up on clearer. Thanks for the read!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a paragraph explaining why we store sizes in the arena and pointing out that we could also make a simpler choice if we'd like at the cost of doing heap allocations more often when a string is mutated

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not read the allocator explanations as carefully as Marten probably, but I am not worried about them.

I have two concrete suggestions (but if you prefer, happy to just merge this to iterate more easily!):

  • The missing value, ufunc support, etc. Feels more fitting as subsections to "Python API". It is probably also easier to understand the C part if we first mention these (can we just move those sections up?)
  • It would be nice to emphasize (very briefly!) the new C-API functions we want to add right now (it would be totally fine to even say that C-API is locked down so much that users cannot write ufuncs initially). I think this means:
    • Mention how to aquire the lock
    • Make the pack/unpack functions public (as proper functions)
    • Omit the "opaque" element struct (so we can change the itemsize if we want to)
    • ...

Maybe some of the Memory handling stuff could then just be moved to the "Implementation" section, and the short C-API part of the detailed description can just say to read that for details.

That would also help with assessing the NEP, I think. Because I am mainly worried about public API, not about implementation. If the API has a few functions to aquire a lock (and the allocator struct) from the descriptors then I suspect you could swap out the implementation completely (even without arena allocator, etc.)

@@ -372,12 +377,12 @@ DType using the DType class.
We will also extend the ``NPY_TYPES`` enum in the C API with an ``NPY_VSTRING``
entry (there is already an ``NPY_STRING`` entry). This should not interfere with
legacy user-defined DTypes since the integer type numbers for these data types
begin at 256, so in principle there is still room for hundreds more builtin
begin at 256. In principle there is still room for hundreds more builtin
DTypes in the integer range available in the ``NPY_TYPES`` enum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit random: Happy if we do also happy if not because it turns out to be a burden in practice. (We should make the actual DTypes more accessible anyway.)

struct npy_unpacked_static_string {
size_t size;
const char *buf;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan is to make this part public, in which case, maybe we need to mention the way to load/store it:

int
npy_unpack_string(
    PyArray_Descr/StringDescr *descr, void *ptr,
    npy_unpacked_static_string *out_unpacked);

as new public API Functions (forgot what we called it)? (or did I miss it)

Maybe it would actually be better to move this to a public C-API section, which can come first as an intro (so memory details are hidden).
I am also think that npy_packed_static_string should remain private! It is useful for us internally, but for now I think we should not be beholden to the size staying fixed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, some further details beyond what is "public" could then also be moved to the implementation.

In either case, I don't think that needs to be reordered, just thoughts.


typedef struct _short_string_buffer {
char buf[sizeof(_npy_static_string_t) - 1];
unsigned char size_and_flags;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe... we haven't really used bit-fields much, but maybe that is just a remnant of a past where they didn't exist...

(In either case, I don't care much about it though, these are implementation details, that are OK to change.)

@ngoldbaum
Copy link
Member Author

but if you prefer, happy to just merge this to iterate more easily!

I think I can integrate your and Marten's suggestions today so let's not merge quite yet.

The missing value, ufunc support, etc. Feels more fitting as subsections to "Python API". It is probably also easier to understand the C part if we first mention these (can we just move those sections up?)

Makes sense, I'll move them up.

It would be nice to emphasize (very briefly!) the new C-API functions we want to add right now (it would be totally fine to even say that C-API is locked down so much that users cannot write ufuncs initially)

Sure, I'll add a "C API for StringDType" section and try to keep the public API limited to packing/unpacking from a char * pointer to the beginning of an array element and acquiring and releasing the lock. I see your point about not setting all this in stone and leaving it an implementation detail and I'll leave out making the rest of the string C API public.

@ngoldbaum
Copy link
Member Author

See the last pushes which should integrate the review comments.

I also updated the stringdtype prototype in the numpy-user-dtypes repo and it should fully match the latest version of the text.

I rearranged the text, hopefully this flows in a more sensible way.

I'm now proposing only exposing an API to acquire and release the mutex (along with variants that know how to deal with multiple descriptors that may or may not be the same object) and pack and unpack strings. Users can then manipulate unpacked strings as they wish.

I also made the npy_packed_static_string struct opaque.

I added some discussion about why we're storing sizes in the arena allocation.

I ended up deciding not to use the bitfield syntax since I'd really like to get this done in time for NumPy 2.0 and have a lot of other stuff that needs to get done before that happens. We can definitely come back to that in the future as a refactor to improve clarity.

Let me know if you have any additional concerns before I ping the mailing list to ask for a final round of review.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is clearer now I think. What I still think would help (but I don't want to wait or point it too much), is to have a clearer overview. One thing I notice (but maybe that is just the hasty reader), is that e.g. NpyString_acquire_allocator shows up in the example, but I don't have a clear overview yet. A similar thing maybe, is that to me it would be easier to see the full list of flags somewhere early on (even if unexplained!).

In either case though, I would be OK with merging it soon for annoucements, we can always tweak a bit (and in the end, I think it is good, maybe Marten has more concrete feedback, though).

Maybe overstepping, but I could imagine the C-API section something like:

To access or write to an array typed as StringDType, we provide the main public sturcture representing the string data:

npy_static_string...

This is the API to work with the strings, but is not how they are stored (packed). To access or write to the data, users must first first lock it and fetch an allocator, from the arrays dtype (descriptor in C). This ensures thread-safety and allows for custom allocation strategies. only then can users access the data.
The reason for these choices are outlined in the following sections.

The initial public API consists of:

  • npy_string_allocator *NpyString_acquire_allocator(PyArrar_Descr *descr) to prepare writing to an array. With further API to deal with multiple arrays/descriptors at once.
  • NpyString_release_allocator(descr or allocator?) to relase it again.
  • NpyString_load(npy_string_allocator *allocator, npy_packed_string *data_ptr, npy_static_string *sdata) reads the array data from data_ptr into the static string sdata.
  • NpyString_pack(npy_string_allocator *allocator, npy_packed_string *data_ptr, char *utf8_string, size_t length) to pack a new utf8 string into the array.
  • NpyString_pack_null(npy_string_allocator *allocator, npy_packed_string *data_ptr) to set the array entry to the masked value.

Above ny_packed_string and npy_string_allocator or opaque structs for the time being. The following is a full example using all of these functions and outlining how to deal with masked arrays:
...

The reason we provide functions to lock multiple arrays at once is...

doc/neps/nep-0055-string_dtype.rst Show resolved Hide resolved
doc/neps/nep-0055-string_dtype.rst Outdated Show resolved Hide resolved
doc/neps/nep-0055-string_dtype.rst Outdated Show resolved Hide resolved
doc/neps/nep-0055-string_dtype.rst Show resolved Hide resolved
doc/neps/nep-0055-string_dtype.rst Show resolved Hide resolved
@ngoldbaum
Copy link
Member Author

Thanks for the comments! I'll do one more pass on this to address your comments since I want this to be as clear as possible before more eyes are on it.

@ngoldbaum
Copy link
Member Author

Updated again with an overview of the C API following your suggestion. I also moved the C API and memory layout sections to the very end so you have context about the rest of the design before getting to that content.

@ngoldbaum
Copy link
Member Author

Let's merge this so we can have an updated version on the website. Thanks so much for your review @seberg and @mhvk!

@ngoldbaum ngoldbaum merged commit b0bb396 into numpy:main Nov 22, 2023
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants