Skip to content

Better naming of public symbols and small string optimization #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 22, 2023

Conversation

ngoldbaum
Copy link
Member

This refactors things to better match the naming scheme in the NEP.

It also adds small string optimization following @WarrenWeckesser's prototype.

I first refactored everything outside the implementation of the string struct to not use direct field access, then I redefined the old struct to be a union following Warren's design and refactored the string implementation.

The only behavior change relative to the current NEP draft is that np.empty(string_array) always returns an array filled with empty strings, no matter what NA object is set for the dtype. This is both for expediency (I'd need to add a new empty initialization hook in the numpy dtype API to keep the old behavior) and because on second thought it makes more sense.

I changed the naming a bit relative to Warren's proof-of-concept to make things make more sense to me but other than that there should only be minimal changes, if any. The changes to support small string optimization should be contained to static_string.c if you want to only look at that.

@seberg if you have a chance it would be great to have you look at this too.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and a general improvement. I still have some worries around thread-safety, etc. but they are unrelated.

In theory, one could even spin this further and allow users to "bloat" the storage size if they know that they have a lot of strings with size 8 and want to fit them.

I do wonder slightly if it may be useful to keep the old code mostly intact and use a pattern of:

unpacked_string str;
npy_load_str(*pointer, &str);

Where unpacked_string is defined as:

struct {
   size_t length;
   char *buf;
}

(maybe with the NA info also). Not sure that is actually simpler though. The union typing seems a bit complicated, but since the main struct is effectively opaque, it seems fine.

One thing I do wonder: Could we make the public API be more minimal (i.e. actual functions that unpack/pack to a representation we are certain about; although the potential need for locking is another issue)?
I am slightly worried that we may not be ready to ABI freeze this. I suppose the question is how important user defined ufuncs are here (and how important it is to promise that we won't just break them in a minor but not bug-fix release).

Did you already ensure things work with big endian systems? It looks a bit like that may need tweaks/additional bit shifts.


typedef struct _npy_static_string_t {
char *buf;
size_t size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will limit 32bit string sizes a bit, but I guess that is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think special-casing the struct layout on 32 bit systems is worth adding any complexity here. I'm planning to add better errors for when someone tries to construct a string array that hits these limits.

_npy_static_string_u base;
} npy_static_string;

// room for two more flags with values 0x20 and 0x10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we do have more room if it is not a short string, I guess. (just thinking loudly)

}
else if (res == -2) {
// this should never happen
assert(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this can happen if users provide nonsense for the na_object?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a helper would make sense, if we have the two different return values? Or pass int nogil and set the error already in npy_string_newsize?
(since more than MemoryError is possible here, unless we check the size for validity first.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this can happen if users provide nonsense for the na_object?!

It would have to be a nonsense string that passes PyUnicode_Check()

You're right that the -2 return value here is a an API wart. I added it as a way for me to catch programming errors. Maybe I should replace that return case with a debug abort inside the static string library and make it so this only fails if you run out of memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. The user would have to try very hard... In principle tha twould mean we have to hceck the size up-front?
TBH, it is also just a minor wart if we raise a MemoryError rather than a size error in practice.

So OK either way, I though passing nogil and setting the error inside might be a way, but not sure it is much better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to fix this in another PR but will get to it.

@ngoldbaum
Copy link
Member Author

Thank you for the careful look over the code, I really appreciate you taking the time.

The union typing seems a bit complicated, but since the main struct is effectively opaque

I tried to make it an opaque struct but then opaque structs also don't have a size known at compile time, so that can't work. This is on the edge of my C knowledge - is it possible to have an opaque sized struct?

Maybe I could make the "public" struct something like:

typedef struct opaque_npy_static_string {
    char[sizeof(private_union_type)]
}

I'd cast to the union inside the static_string library.

I like your idea for the public API to return an unpacked struct with a size and buf field. How about something like:

unpacked_string str;
int is_null = npy_load_str(*pointer, &str);

I like the idea of combining the null checking and unpacking, since you definitely need to do null checking before accessing a struct field. This API lets you use the return value of the unpacking operation to structure control flow for that case.

I am slightly worried that we may not be ready to ABI freeze this. I suppose the question is how important user defined ufuncs are here (and how important it is to promise that we won't just break them in a minor but not bug-fix release).

Yeah, I think this API definitely needs more people than me working with it before we declare it stable. Do you see a problem with marking the C API and binary layout as experimental when we merge it in NumPy?

Did you already ensure things work with big endian systems? It looks a bit like that may need tweaks/additional bit shifts.

I tested a slightly modified version of Warren's prototype on a bigendian powerpc debian QEMU image and that worked. I haven't yet tested my code on that image, I was planning to do so this week.

@seberg
Copy link
Member

seberg commented Sep 19, 2023

like your idea for the public API to return an unpacked struct with a size and buf field. How about something like:

Looks good if you like it. The only thing I am still slightly worried is that we may end up needing some form of locking additionally.

About the opacity of the struct, I am happy with it, was just briefly surprised. For public API we might want to think about doing the version you gave there (for public API even the struct size may be unimportant, actually).

Do you see a problem with marking the C API and binary layout as experimental when we merge it in NumPy?

No, I don't see it as a problem at all. Might be nice to add a warning in a header somewhere where we define the structs.
The one problem would be about pickling/unpickling (or npy), but I suspect we will use objects for now anyway.
(Sure, someon will likely complain, but I think that will be a sign of success ;).)

QEMU image and that worked. I haven't yet tested my code on that image, I was planning to do so this week.

Yeah, no worries, was just curious. TBH, this are effectively minor tweaks that NumPy CI would flush out either way!

@ngoldbaum
Copy link
Member Author

In the NEP the locking is handled by the arena allocator. In the next iteration I'm going to make all data access that requires touching the heap be done by functions that take both an arena allocator struct and a heap pointer. The allocator figures out if the pointer is in a contiguous read-only storage (e.g. an array that hasn't yet been mutated) and if so allows read access to the data and if not, guards access to the heap data with a fine-grained read/write lock. Writes only happen after all reads are complete.

Does that all sounds reasonable for the locking? The way this will shake out in the C API is that people will need to pass a reference to an allocator which will be a member of the dtype struct to an API function along with an ss struct to get access to the string data. Hopefully I'll have something more concrete for you to look at soon!

@seberg
Copy link
Member

seberg commented Sep 20, 2023

Right, it probably is reasonable for locking (I am not 100% sure, but starting to mix up two different issues here).

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Sep 21, 2023

I've updated this to use a packing/unpacking API that I really like. Now the details of the endian-dependant union implementation are hidden in static_string.c and we only publicly expose two structs and two constants:

typedef struct npy_packed_static_string {
    char packed_buffer[sizeof(char *) + sizeof(size_t)];
} npy_packed_static_string;

typedef struct npy_static_string {
    size_t size;
    const char *buf;
} npy_static_string;

// represents the empty string and can be passed safely to npy_static_string                                                                            
// API functions                                                                                                                                        
extern const npy_packed_static_string *NPY_EMPTY_STRING;
// represents a sentinel value, *CANNOT* be passed safely to npy_static_string                                                                          
// API functions, use npy_string_isnull to check if a value is null before                                                                              
// working with it.                                                                                                                                     
extern const npy_packed_static_string *NPY_NULL_STRING;

I found a nice pattern where you unpack a pointer to a npy_packed_static_string into a stack-allocated npy_static_string, and then work with the stack-allocated data.

I think this more clearly delineates data access and data storage. Data are stored inside the packed struct, which must be freed if we want to overwrite it, but are viewed with a stack-allocated unpacked struct, which should never be freed and are just zero-copy views onto the packed string data or heap-allocated string data.

@ngoldbaum ngoldbaum merged commit 7c5bcbb into numpy:main Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants