Better naming of public symbols and small string optimization #86

ngoldbaum · 2023-09-15T20:25:08Z

This refactors things to better match the naming scheme in the NEP.

It also adds small string optimization following @WarrenWeckesser's prototype.

I first refactored everything outside the implementation of the string struct to not use direct field access, then I redefined the old struct to be a union following Warren's design and refactored the string implementation.

The only behavior change relative to the current NEP draft is that np.empty(string_array) always returns an array filled with empty strings, no matter what NA object is set for the dtype. This is both for expediency (I'd need to add a new empty initialization hook in the numpy dtype API to keep the old behavior) and because on second thought it makes more sense.

I changed the naming a bit relative to Warren's proof-of-concept to make things make more sense to me but other than that there should only be minimal changes, if any. The changes to support small string optimization should be contained to static_string.c if you want to only look at that.

@seberg if you have a chance it would be great to have you look at this too.

seberg

This looks good to me and a general improvement. I still have some worries around thread-safety, etc. but they are unrelated.

In theory, one could even spin this further and allow users to "bloat" the storage size if they know that they have a lot of strings with size 8 and want to fit them.

I do wonder slightly if it may be useful to keep the old code mostly intact and use a pattern of:

unpacked_string str;
npy_load_str(*pointer, &str);

Where unpacked_string is defined as:

struct {
   size_t length;
   char *buf;
}

(maybe with the NA info also). Not sure that is actually simpler though. The union typing seems a bit complicated, but since the main struct is effectively opaque, it seems fine.

One thing I do wonder: Could we make the public API be more minimal (i.e. actual functions that unpack/pack to a representation we are certain about; although the potential need for locking is another issue)?
I am slightly worried that we may not be ready to ABI freeze this. I suppose the question is how important user defined ufuncs are here (and how important it is to promise that we won't just break them in a minor but not bug-fix release).

Did you already ensure things work with big endian systems? It looks a bit like that may need tweaks/additional bit shifts.

seberg · 2023-09-18T13:45:48Z

stringdtype/stringdtype/src/static_string.h

+
+typedef struct _npy_static_string_t {
+    char *buf;
+    size_t size;


We will limit 32bit string sizes a bit, but I guess that is fine.

I don't think special-casing the struct layout on 32 bit systems is worth adding any complexity here. I'm planning to add better errors for when someone tries to construct a string array that hits these limits.

seberg · 2023-09-18T13:47:22Z

stringdtype/stringdtype/src/static_string.h

+    _npy_static_string_u base;
+} npy_static_string;
+
+// room for two more flags with values 0x20 and 0x10


Technically, we do have more room if it is not a short string, I guess. (just thinking loudly)

seberg · 2023-09-18T14:05:14Z

stringdtype/stringdtype/src/dtype.c

+            }
+            else if (res == -2) {
+                // this should never happen
+                assert(0);


It seems like this can happen if users provide nonsense for the na_object?!

Maybe a helper would make sense, if we have the two different return values? Or pass int nogil and set the error already in npy_string_newsize?
(since more than MemoryError is possible here, unless we check the size for validity first.)

It seems like this can happen if users provide nonsense for the na_object?!

It would have to be a nonsense string that passes PyUnicode_Check()

You're right that the -2 return value here is a an API wart. I added it as a way for me to catch programming errors. Maybe I should replace that return case with a debug abort inside the static string library and make it so this only fails if you run out of memory?

Yeah, true. The user would have to try very hard... In principle tha twould mean we have to hceck the size up-front?
TBH, it is also just a minor wart if we raise a MemoryError rather than a size error in practice.

So OK either way, I though passing nogil and setting the error inside might be a way, but not sure it is much better.

I'd like to fix this in another PR but will get to it.

ngoldbaum · 2023-09-18T18:20:52Z

Thank you for the careful look over the code, I really appreciate you taking the time.

The union typing seems a bit complicated, but since the main struct is effectively opaque

I tried to make it an opaque struct but then opaque structs also don't have a size known at compile time, so that can't work. This is on the edge of my C knowledge - is it possible to have an opaque sized struct?

Maybe I could make the "public" struct something like:

typedef struct opaque_npy_static_string {
    char[sizeof(private_union_type)]
}

I'd cast to the union inside the static_string library.

I like your idea for the public API to return an unpacked struct with a size and buf field. How about something like:

unpacked_string str;
int is_null = npy_load_str(*pointer, &str);

I like the idea of combining the null checking and unpacking, since you definitely need to do null checking before accessing a struct field. This API lets you use the return value of the unpacking operation to structure control flow for that case.

I am slightly worried that we may not be ready to ABI freeze this. I suppose the question is how important user defined ufuncs are here (and how important it is to promise that we won't just break them in a minor but not bug-fix release).

Yeah, I think this API definitely needs more people than me working with it before we declare it stable. Do you see a problem with marking the C API and binary layout as experimental when we merge it in NumPy?

Did you already ensure things work with big endian systems? It looks a bit like that may need tweaks/additional bit shifts.

I tested a slightly modified version of Warren's prototype on a bigendian powerpc debian QEMU image and that worked. I haven't yet tested my code on that image, I was planning to do so this week.

seberg · 2023-09-19T06:58:27Z

like your idea for the public API to return an unpacked struct with a size and buf field. How about something like:

Looks good if you like it. The only thing I am still slightly worried is that we may end up needing some form of locking additionally.

About the opacity of the struct, I am happy with it, was just briefly surprised. For public API we might want to think about doing the version you gave there (for public API even the struct size may be unimportant, actually).

Do you see a problem with marking the C API and binary layout as experimental when we merge it in NumPy?

No, I don't see it as a problem at all. Might be nice to add a warning in a header somewhere where we define the structs.
The one problem would be about pickling/unpickling (or npy), but I suspect we will use objects for now anyway.
(Sure, someon will likely complain, but I think that will be a sign of success ;).)

QEMU image and that worked. I haven't yet tested my code on that image, I was planning to do so this week.

Yeah, no worries, was just curious. TBH, this are effectively minor tweaks that NumPy CI would flush out either way!

ngoldbaum · 2023-09-19T18:20:53Z

In the NEP the locking is handled by the arena allocator. In the next iteration I'm going to make all data access that requires touching the heap be done by functions that take both an arena allocator struct and a heap pointer. The allocator figures out if the pointer is in a contiguous read-only storage (e.g. an array that hasn't yet been mutated) and if so allows read access to the data and if not, guards access to the heap data with a fine-grained read/write lock. Writes only happen after all reads are complete.

Does that all sounds reasonable for the locking? The way this will shake out in the C API is that people will need to pass a reference to an allocator which will be a member of the dtype struct to an API function along with an ss struct to get access to the string data. Hopefully I'll have something more concrete for you to look at soon!

seberg · 2023-09-20T07:16:27Z

Right, it probably is reasonable for locking (I am not 100% sure, but starting to mix up two different issues here).

ngoldbaum · 2023-09-21T17:42:11Z

I've updated this to use a packing/unpacking API that I really like. Now the details of the endian-dependant union implementation are hidden in static_string.c and we only publicly expose two structs and two constants:

typedef struct npy_packed_static_string {
    char packed_buffer[sizeof(char *) + sizeof(size_t)];
} npy_packed_static_string;

typedef struct npy_static_string {
    size_t size;
    const char *buf;
} npy_static_string;

// represents the empty string and can be passed safely to npy_static_string                                                                            
// API functions                                                                                                                                        
extern const npy_packed_static_string *NPY_EMPTY_STRING;
// represents a sentinel value, *CANNOT* be passed safely to npy_static_string                                                                          
// API functions, use npy_string_isnull to check if a value is null before                                                                              
// working with it.                                                                                                                                     
extern const npy_packed_static_string *NPY_NULL_STRING;

I found a nice pattern where you unpack a pointer to a npy_packed_static_string into a stack-allocated npy_static_string, and then work with the stack-allocated data.

I think this more clearly delineates data access and data storage. Data are stored inside the packed struct, which must be freed if we want to overwrite it, but are viewed with a stack-allocated unpacked struct, which should never be freed and are just zero-copy views onto the packed string data or heap-allocated string data.

ngoldbaum added 6 commits August 31, 2023 10:46

refactor to avoid get_loop use in string to unicode cast

1bb3bdf

replace ss with npy_static_string

73ae9f9

rename len to size in npy_static_string struct

7ea409b

access string internals via functions

74e97fd

namespace static string sentinel macros

56ff099

implement small string optimization

97f3da5

seberg reviewed Sep 18, 2023

View reviewed changes

ngoldbaum added 3 commits September 21, 2023 11:20

refactor to use a packing/unpacking API

f86e5ce

fix clearing errors

245f0e3

move macros for short string implementation out of the header

7972b7b

ngoldbaum merged commit 7c5bcbb into numpy:main Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better naming of public symbols and small string optimization #86

Better naming of public symbols and small string optimization #86

Uh oh!

ngoldbaum commented Sep 15, 2023

Uh oh!

seberg left a comment

Uh oh!

seberg Sep 18, 2023

Uh oh!

ngoldbaum Sep 18, 2023

Uh oh!

seberg Sep 18, 2023

Uh oh!

seberg Sep 18, 2023

Uh oh!

seberg Sep 18, 2023

Uh oh!

ngoldbaum Sep 18, 2023

Uh oh!

seberg Sep 19, 2023

Uh oh!

ngoldbaum Sep 21, 2023

Uh oh!

ngoldbaum commented Sep 18, 2023

Uh oh!

seberg commented Sep 19, 2023

Uh oh!

ngoldbaum commented Sep 19, 2023

Uh oh!

seberg commented Sep 20, 2023

Uh oh!

ngoldbaum commented Sep 21, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Better naming of public symbols and small string optimization #86

Better naming of public symbols and small string optimization #86

Uh oh!

Conversation

ngoldbaum commented Sep 15, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Sep 18, 2023

Uh oh!

seberg commented Sep 19, 2023

Uh oh!

ngoldbaum commented Sep 19, 2023

Uh oh!

seberg commented Sep 20, 2023

Uh oh!

ngoldbaum commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Sep 21, 2023 •

edited

Loading