-
-
Notifications
You must be signed in to change notification settings - Fork 9
Better naming of public symbols and small string optimization #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me and a general improvement. I still have some worries around thread-safety, etc. but they are unrelated.
In theory, one could even spin this further and allow users to "bloat" the storage size if they know that they have a lot of strings with size 8 and want to fit them.
I do wonder slightly if it may be useful to keep the old code mostly intact and use a pattern of:
unpacked_string str;
npy_load_str(*pointer, &str);
Where unpacked_string
is defined as:
struct {
size_t length;
char *buf;
}
(maybe with the NA info also). Not sure that is actually simpler though. The union typing seems a bit complicated, but since the main struct is effectively opaque, it seems fine.
One thing I do wonder: Could we make the public API be more minimal (i.e. actual functions that unpack/pack to a representation we are certain about; although the potential need for locking is another issue)?
I am slightly worried that we may not be ready to ABI freeze this. I suppose the question is how important user defined ufuncs are here (and how important it is to promise that we won't just break them in a minor but not bug-fix release).
Did you already ensure things work with big endian systems? It looks a bit like that may need tweaks/additional bit shifts.
|
||
typedef struct _npy_static_string_t { | ||
char *buf; | ||
size_t size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will limit 32bit string sizes a bit, but I guess that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think special-casing the struct layout on 32 bit systems is worth adding any complexity here. I'm planning to add better errors for when someone tries to construct a string array that hits these limits.
_npy_static_string_u base; | ||
} npy_static_string; | ||
|
||
// room for two more flags with values 0x20 and 0x10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we do have more room if it is not a short string, I guess. (just thinking loudly)
} | ||
else if (res == -2) { | ||
// this should never happen | ||
assert(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this can happen if users provide nonsense for the na_object
?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a helper would make sense, if we have the two different return values? Or pass int nogil
and set the error already in npy_string_newsize
?
(since more than MemoryError is possible here, unless we check the size for validity first.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this can happen if users provide nonsense for the na_object?!
It would have to be a nonsense string that passes PyUnicode_Check()
You're right that the -2
return value here is a an API wart. I added it as a way for me to catch programming errors. Maybe I should replace that return case with a debug abort inside the static string library and make it so this only fails if you run out of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. The user would have to try very hard... In principle tha twould mean we have to hceck the size up-front?
TBH, it is also just a minor wart if we raise a MemoryError rather than a size error in practice.
So OK either way, I though passing nogil
and setting the error inside might be a way, but not sure it is much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to fix this in another PR but will get to it.
Thank you for the careful look over the code, I really appreciate you taking the time.
I tried to make it an opaque struct but then opaque structs also don't have a size known at compile time, so that can't work. This is on the edge of my C knowledge - is it possible to have an opaque sized struct? Maybe I could make the "public" struct something like:
I'd cast to the union inside the I like your idea for the public API to return an unpacked struct with a
I like the idea of combining the null checking and unpacking, since you definitely need to do null checking before accessing a struct field. This API lets you use the return value of the unpacking operation to structure control flow for that case.
Yeah, I think this API definitely needs more people than me working with it before we declare it stable. Do you see a problem with marking the C API and binary layout as experimental when we merge it in NumPy?
I tested a slightly modified version of Warren's prototype on a bigendian powerpc debian QEMU image and that worked. I haven't yet tested my code on that image, I was planning to do so this week. |
Looks good if you like it. The only thing I am still slightly worried is that we may end up needing some form of locking additionally. About the opacity of the struct, I am happy with it, was just briefly surprised. For public API we might want to think about doing the version you gave there (for public API even the struct size may be unimportant, actually).
No, I don't see it as a problem at all. Might be nice to add a warning in a header somewhere where we define the structs.
Yeah, no worries, was just curious. TBH, this are effectively minor tweaks that NumPy CI would flush out either way! |
In the NEP the locking is handled by the arena allocator. In the next iteration I'm going to make all data access that requires touching the heap be done by functions that take both an arena allocator struct and a heap pointer. The allocator figures out if the pointer is in a contiguous read-only storage (e.g. an array that hasn't yet been mutated) and if so allows read access to the data and if not, guards access to the heap data with a fine-grained read/write lock. Writes only happen after all reads are complete. Does that all sounds reasonable for the locking? The way this will shake out in the C API is that people will need to pass a reference to an allocator which will be a member of the dtype struct to an API function along with an |
Right, it probably is reasonable for locking (I am not 100% sure, but starting to mix up two different issues here). |
I've updated this to use a packing/unpacking API that I really like. Now the details of the endian-dependant union implementation are hidden in typedef struct npy_packed_static_string {
char packed_buffer[sizeof(char *) + sizeof(size_t)];
} npy_packed_static_string;
typedef struct npy_static_string {
size_t size;
const char *buf;
} npy_static_string;
// represents the empty string and can be passed safely to npy_static_string
// API functions
extern const npy_packed_static_string *NPY_EMPTY_STRING;
// represents a sentinel value, *CANNOT* be passed safely to npy_static_string
// API functions, use npy_string_isnull to check if a value is null before
// working with it.
extern const npy_packed_static_string *NPY_NULL_STRING; I found a nice pattern where you unpack a pointer to a I think this more clearly delineates data access and data storage. Data are stored inside the packed struct, which must be freed if we want to overwrite it, but are viewed with a stack-allocated unpacked struct, which should never be freed and are just zero-copy views onto the packed string data or heap-allocated string data. |
This refactors things to better match the naming scheme in the NEP.
It also adds small string optimization following @WarrenWeckesser's prototype.
I first refactored everything outside the implementation of the string struct to not use direct field access, then I redefined the old struct to be a union following Warren's design and refactored the string implementation.
The only behavior change relative to the current NEP draft is that
np.empty(string_array)
always returns an array filled with empty strings, no matter what NA object is set for the dtype. This is both for expediency (I'd need to add a new empty initialization hook in the numpy dtype API to keep the old behavior) and because on second thought it makes more sense.I changed the naming a bit relative to Warren's proof-of-concept to make things make more sense to me but other than that there should only be minimal changes, if any. The changes to support small string optimization should be contained to
static_string.c
if you want to only look at that.@seberg if you have a chance it would be great to have you look at this too.