Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: casting not using configurable memory functions from NEP 49? #22686

Open
bluescarni opened this issue Nov 28, 2022 · 19 comments
Open

BUG: casting not using configurable memory functions from NEP 49? #22686

bluescarni opened this issue Nov 28, 2022 · 19 comments
Labels

Comments

@bluescarni
Copy link

Describe the issue:

Hello!

I am trying to use the new configurable memory routines from NEP 49, described here:

https://numpy.org/devdocs/reference/c-api/data_memory.html#configurable-memory-routines-in-numpy-nep-49

My use case is that I want to store non-trivial C++ objects as elements of NumPy arrays. In order to correctly manage the lifetime of my C++ objects (including correct construction and destruction without memory leaking), I override the default malloc/calloc/realloc/free implementations so that metadata is associated to every array data segment allocated by NumPy. This metadata keeps track of which C++ objects have been constructed in the data segment. Then, in the basic NumPy primitives for my C++ dtype (e.g., get/setitem, nonzero, copyswap, etc.), I query the metadata of the data segment and construct C++ object in-place as needed. The free function override takes care of calling the destructors of the constructed C++ objects before de-allocating the data segment.

For reference, here is the implementation of the overriding memory functions + metadata handling:

https://github.com/bluescarni/heyoka.py/blob/b661040030ff4d05c0dc2e03c8c58217d9ca0d71/heyoka/numpy_memory.cpp

And here is the code for exposing my C++ type (called mppp::real) to Python and NumPy:

https://github.com/bluescarni/heyoka.py/blob/b661040030ff4d05c0dc2e03c8c58217d9ca0d71/heyoka/expose_real.cpp

So far, most of NumPy's basic functionality seems to work fine: I can create new arrays with my custom C++ dtype, access elements, etc. See here for a WIP testing suite I am working on:

https://github.com/bluescarni/heyoka.py/blob/b661040030ff4d05c0dc2e03c8c58217d9ca0d71/heyoka/test.py#L5399

I am running however in an issue when I attempt to cast arrays of my custom dtype to/from other NumPy dtypes.

What happens is that the casting functions seem to receive from/to arrays in which the data segment has NOT been allocated via the configurable memory routines from NEP 49. This of course is a big issue, because there is no metadata associated to the data segments, and my whole scheme for managing the lifetimes of my C++ objects comes to a crashing halt.

I have spent a few hours with a debug build of NumPy and a debugger. Although I haven't fully understood the intricacies of the casting logic, I think I have managed to pinpoint where the data segments passed to the casting functions originate from, and it seems to be here:

https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/dtype_transfer.c#L2865

I.e., from a direct PyMem_Malloc() call (rather than leveraging the configurable memory routines from NEP 49). The to/from buffers that are passed to the casting functions are defined a bit below:

https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/dtype_transfer.c#L2876

(from_buffer and to_buffer).

Am I wrong in expecting that these two buffers should be allocated via the NEP 49 functions?

Reproduce the code example:

No easy reproducer, sorry - but I included links in the issue description.

Error message:

No response

NumPy/Python version information:

1.25.0.dev0+20.gd4b2d4f80 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]

Context for the issue:

No response

@seberg
Copy link
Member

seberg commented Nov 29, 2022

Thanks for the interest in wrapping such objects! Maybe @mattip thinks differently, but I think that is approach is fundamentally broken, unfortunately. That is, the DType should not can control the allocation scheme itself (we could allow it to, but I suspect it is not a great approach, also because generally we may want traversal and not just deallocation).

I just looked into reorganizing how NumPy clears memory yesterday. Which replaces the current refcounting scheme (it is working) and would be available to new user DTypes (but only NEP 41-42 ones).
That should solve wrapping C++ objects (I am still assuming that initializing to zero is generally, although considering narrowing that down a bit).

The new API is still unstable and hidden behind an "experimental" flag, but I don't really expect massive changes, just that ABI breaks will happen and some things might just crash because of an old code path that doesn't work.

What is your time line? Do you want to meet up and talk about such new API or help doing it? There is a bit more interest in developing new style DTypes now, so I hope that will help settle the API and fill in the missing things relatively quickly.

@bluescarni
Copy link
Author

bluescarni commented Nov 29, 2022

Hi @seberg , thanks for the reply!

To clear up misconceptions, the approach I outlined above is not about having dtype-specific memory management strategies. Rather, I believe that the problem that I am seeing is that the NEP 49 memory allocation functions are not used consistently throughout the NumPy codebase, and situations in which NumPy arrays (or at least, data segments) are allocated without going through the customisable memory allocation routines can arise.

Perhaps this is all working as intended, in that case my approach is destined to fail.

I would be very interested in learning more/chatting about the new dtype API - does it solve the general issue of wrapping C++ types with arbitrary constructors/destructors/assignment operators? Can I read about it somewhere?

@seberg
Copy link
Member

seberg commented Nov 29, 2022

To clear up misconceptions, the approach I outlined above is not about having dtype-specific memory management strategies.

But your dtypes need a specific memory allocation/deallocation function to work? So if I write a "pinned memory" memory allocator that replaces yours, your dtype will break. I suppose the DType could try to guess at that, but that seems rather involved?
So I was saying: OK, your DType could indicate which memory allocator to use, which is plausible to me, but I don't quite see for the C++ use-case.

You can have a look at the NEPs, maybe starting from 42 (some parts may be a bit outdated, but basically that is what we got now as an experimental header you can use).

The new API can get a generic deallocation function. I had not prioritized that for a long time, because an early argument was, that maybe we should just ask everyone to store Python objects and support that explicitly. That is also a solution (that doesn't quite work yet), but I now think that it is easier to just allow you to hook into deallocation.

Simple as said: I will create for now a new dtype.get_clear_function(rather_many_args) which would allow you to have a custom function to clear array data (clear, as in dealloc and set to 0 again).

One thing I would like to keep, but I am not quite sure about for C++ objects is to generally only 0 the array data as initialization for "objects with references". Maybe with some limits (i.e. only some functions like clear would need to support zeros memory, not all).

We are creating an example repo in https://github.com/numpy/numpy-user-dtypes, they are not production ready for a while of course. I have some more examples myself and a new one coming, which I may add there (the new one isn't public yet).

@bluescarni
Copy link
Author

But your dtypes need a specific memory allocation/deallocation function to work? So if I write a "pinned memory" memory allocator that replaces yours, your dtype will break. I suppose the DType could try to guess at that, but that seems rather involved? So I was saying: OK, your DType could indicate which memory allocator to use, which is plausible to me, but I don't quite see for the C++ use-case.

You are right that if you override my (de)allocation functions, the whole machinery will break down.

You can have a look at the NEPs, maybe starting from 42 (some parts may be a bit outdated, but basically that is what we got now as an experimental header you can use).

The new API can get a generic deallocation function. I had not prioritized that for a long time, because an early argument was, that maybe we should just ask everyone to store Python objects and support that explicitly. That is also a solution (that doesn't quite work yet), but I now think that it is easier to just allow you to hook into deallocation.

Thanks for the pointer, I will do some reading.

One data point: I think it's vital to ensure that people can create NumPy views on memory buffers created on the C++ side. I.e., I would really like to be able to create a NumPy array that has as a base object a capsule storing internally a std::vector<T> (or a pointer/reference to it), so that the NumPy array points to the memory in the std::vector<T> and allows me to avoid needless copying around. Forcing people to store Python objects in a NumPy array will break this ability of having a zero-copy interoperability with C++ buffers.

Simple as said: I will create for now a new dtype.get_clear_function(rather_many_args) which would allow you to have a custom function to clear array data (clear, as in dealloc and set to 0 again).

One thing I would like to keep, but I am not quite sure about for C++ objects is to generally only 0 the array data as initialization for "objects with references". Maybe with some limits (i.e. only some functions like clear would need to support zeros memory, not all).

We are creating an example repo in https://github.com/numpy/numpy-user-dtypes, they are not production ready for a while of course. I have some more examples myself and a new one coming, which I may add there (the new one isn't public yet).

From the little time I have spent in the NumPy codebase, it seems like it's peppered with liberal uses of memset(), calloc(), realloc() etc. when dealing with array data segments. I.e., it seems to me (but perhaps I am wrong) that there is an implicit assumption that bit-copying and zeroing memory is enough to copy/init the underlying data, which is most definitely not the case for C++ objects. Is the plan to go through the NumPy codebase and replace these occurrences with a more abstract, dtype-dependent API?

@mattip
Copy link
Member

mattip commented Nov 29, 2022

Am I wrong in expecting that these two buffers should be allocated via the NEP 49 functions?

The dtype auxdata is part of the dtype, not part of the "data segment" in the sense described in NEP 49 which was meant to allow overriding the memory management scheme for the continuous chunk numpy.ndarray.data. The auxdata lives on the dtype, and so is not covered by the NEP 49 memory management scheme.

I am not sure how far we can strech the basic NumPy premise that data is stored as a single contiguous segment over a different premise where each data element is to be interpreted as a c++ object. Stretching it to cover the case where the data is a c struct is hard enough. If the user-defined dtype need more layers to support such an idea, personally I would vote for keeping things simple. The casting machinery is quite hard to understand as-is, without further complications.

From the little time I have spent in the NumPy codebase, it seems like it's peppered with liberal uses of memset(), calloc(), realloc() etc

I would be interested in seeing where those are used on the data segment. We have a passing test that sets a NEP 49 memory management strategy and re-runs the entire test suite with it. Additionally, when I implemented NEP 49 I tried to carefully check the places that use those functions to make sure they do not touch the data segment of a managed array (they may be used for some local, temporary arrays). Did I miss any?

which is most definitely not the case for C++ objects

In summary: I am not convinced you can safely use NumPy to store C++ objects in the data segment without some large changes. Perhaps you could get away with storing pointers to C++ objects, in something like an object array.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

"littered with" is a bit of an exaggeration, I think :). It is a maze, but there are a couple of bad functions mostly written 20 years ago that are just broken. Then there are a few places that do incorrect safety checks.

But beyond that there should only be a hand-full of places that really need adjustment for custom initialization to work.
Probably have to live with holes/bugs, but so long we fix those up as they go, I don't see it as daunting.

I am not sure about realloc should be OK for C++ objects, no? I guess we could have some hook to "reinit" when a realloc happened? There are not super many places where we realloc, but there are a few where it is used and does make sense.

One caveat I see is double clearing. It would be easier if you assume that it is fine to call the dtype.clear_buffer_data() function twice on the same memory. OTOH, I am not sure how bad it would actually be to find another solution for that.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

Ah, maybe one point about memmov, etc. is that the NumPy DTypes assume they can use those, yes. But a custom DType should be able to avoid code paths that call them explicitly. So if you are hoping to hook into them through a memory allocator you are definitely out of luck, yes. But through the DType it should be OK.

@bluescarni
Copy link
Author

Hi @seberg and @mattip and thanks a lot for your replies.

The dtype auxdata is part of the dtype, not part of the "data segment" in the sense described in NEP 49 which was meant to allow overriding the memory management scheme for the continuous chunk numpy.ndarray.data. The auxdata lives on the dtype, and so is not covered by the NEP 49 memory management scheme.

I think I more or less understand the rationale. My surprise comes from the fact that these pointers originating from a PyMemAlloc() call

https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/dtype_transfer.c#L2876

are eventually passed into the user-defined casting functions (I mean, those that are registered via PyArray_RegisterCastFunc()). I had always assumed that the casting functions, like several (most?) other functions that can be registered for a custom dtype, accept in input data pointers from a data segment covered by NEP 49, but if that cannot be relied upon, then my bad :)

I would be interested in seeing where those are used on the data segment. We have a passing test that sets a NEP 49 memory management strategy and re-runs the entire test suite with it. Additionally, when I implemented NEP 49 I tried to carefully check the places that use those functions to make sure they do not touch the data segment of a managed array (they may be used for some local, temporary arrays). Did I miss any?

My expectations for what a data segment is were probably wrong, so I cannot really claim you missed anything :)

In summary: I am not convinced you can safely use NumPy to store C++ objects in the data segment without some large changes. Perhaps you could get away with storing pointers to C++ objects, in something like an object array.

My impression working on this is that custom memory allocation strategies can get you 90% there. As long as you are careful about keeping track of where in the data segment you have constructed C++ objects, there's no reason why it should not work (assuming of course that all the relevant buffer allocations are tracked under this machinery).

"littered with" is a bit of an exaggeration, I think :)

Apologies, I realised this could come across as a bit on the snarky side and wasn't meant to, so I edited it :)

But beyond that there should only be a hand-full of places that really need adjustment for custom initialization to work.
Probably have to live with holes/bugs, but so long we fix those up as they go, I don't see it as daunting.

Still it seem like the dtype auxdata should then consider the possibility of having to initialise memory buffers in a custom fashion, or did I misunderstand?

I am not sure about realloc should be OK for C++ objects, no? I guess we could have some hook to "reinit" when a realloc happened? There are not super many places where we realloc, but there are a few where it is used and does make sense.

Unfortunately realloc() won't work for C++ objects.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

Please ignore "dtype auxdata", it is old API that needs to be replaced, just say DType class and dtype instance: And we can tag on anything to the DType class, including a custom init, we just need to make sure it is actually called. And that shouldn't be implausibly hard.

OK, so for realloc() things are probably tricky, because other C++ objects could point back to them. But there are two style of "objects"

  1. A pointer to a C++ object living elsewhere (similar to a Python object). Realloc'ing the pointers themselves doesn't matter.
  2. The full object struct, which may cointain objects elsewhere. This is not fine if:
    • it contains internal pointers (they would need to be fixed)
    • Someone from outside may point into our array. I suspect this should just not be allowed, if you need that use 1.

I don't know if there is a way to tell if a C++ object is safe for direct storage in a C-style array? But how does std::vector resizing work? If you can store it directly (as in 2.) into an std::vector I would assume you can also do so in a NumPy array.

@mattip
Copy link
Member

mattip commented Nov 29, 2022

Here is one example of a problematic code stanza, in PyArray_AssignRawScalar it is assumed that PyArray_malloc can safely create a single element that is then copied into the dst array. Perhaps instead of PyDataType_REFCHK, we should also check for NPY_NEEDS_INIT to avoid this code?

* Make a copy of the src data if it's a different dtype than 'dst'
* or isn't aligned, and the destination we're copying to has
* more than one element. To avoid having to manage object lifetimes,
* we also skip this if 'dst' has an object dtype.
*/
if ((!PyArray_EquivTypes(PyArray_DESCR(dst), src_dtype) ||
!(npy_is_aligned(src_data, npy_uint_alignment(src_dtype->elsize)) &&
npy_is_aligned(src_data, src_dtype->alignment))) &&
PyArray_SIZE(dst) > 1 &&
!PyDataType_REFCHK(PyArray_DESCR(dst))) {

Do we have a dtype with NPY_NEEDS_INIT that is not an object dtype in our tests?

@seberg
Copy link
Member

seberg commented Nov 29, 2022

I have to figure out what C++ objects need that makes realloc, etc. so hard (Pointers to objects should be fine I think, but is there a way to make more work? And are there any constraints we have for such pointers?)
OTOH, we use realloc only for constructing new arrays from iterators or files. We have the Python side realloc function, but we could just refuse to call that on DTypes which contain C++ objects, it really isn't the most important API to begin with. Or we could not use realloc under the hood for those.

I am not sure why you want more than the PyDataType_REFCHK here? The new data is initialized (currently to 0, but that could be changed). The REFCHK allures to "Python reference", but I think it actually means "you cannot just use memcpy on this", use the custom copy functions instead.

We do break that in a few places (i.e. check NEEDS_PYAPI rather REFCHK), but I don't think here?

@mattip
Copy link
Member

mattip commented Nov 29, 2022

I guess @bluescarni could test if PyDataType_REFCHK is sufficient by adding the NPY_NEEDS_REFCHK flag to the new dtype. Then the code path here would avoid PyArray_CastRawArrays -> PyArray_GetDTypeTransferFunction -> define_cast_for_descrs -> _multistep_cast_auxdata_clone_int (the latter is the problematic one)

@seberg
Copy link
Member

seberg commented Nov 29, 2022

I don't see the issue right now? As of now, I assume it is OK if a DType:

  • Can do custom initialization (right now this is just 0'ing of memory)
  • Can do custom clearing of memory (working on that, but iterating the API a bit)

I would assume those two things are enough for most use-cases even many C++ objects, although using typed Python objects is maybe just as well for anything larger.

Now, I don't want to pretend that the above isn't problematic. We have the problem of "ownership" for example in the HPy sense or DTypes that may wish to have a custom memory arena (strings). For those, we need to figure out how such buffering works (maybe it is fine, because they attach the arena to the dtype itself, maybe not).

Of course there could be other interesting reasons for requiring a dtype specific allocation. Such dtypes couldn't be stored in a structured dtype (unless they all share that allocation scheme), though!
And yes, they most definitely would have to either avoid this type of casts completely or we need to use a more elaborate buffer allocation scheme here (and elsewhere).

@bluescarni
Copy link
Author

bluescarni commented Nov 29, 2022

@seberg realloc() cannot work in C++ for (at least) these reasons:

  • if realloc() shrinks the memory buffer, the destructor of the elements at the end of the buffer is never called;
  • if realloc() increases the size of the buffer, the default constructor of the new elements is never called;
  • if realloc() has to copy data, it does so via memcpy(), which in C++ is valid only for objects which are trivially-copyable (see https://en.cppreference.com/w/cpp/types/is_trivially_copyable).

Thus, from a strictly language-lawyer point of view, realloc() can never work in C++, even though it may work in practice for trivially constructible, copyable and destructible types (e.g., built-in arithmetic types such as double or int).

std::vector works more or less like this:

  • a memory buffer is allocated (e.g., via malloc() or new), but not initialised,
  • the elements of the vector are constructed one by one using the placement new operator (see https://en.wikipedia.org/wiki/Placement_syntax),
  • if the vector is up-sized, a new chunk of memory is allocated and its elements are copy constructed from the old elements, while the old elements are destroyed. The new elements at the end are default-constructed;
  • if the vector is down-sized, the elements at the end are destroyed.

In other words, in order to properly support C++ types in a NumPy array, one must be able to specify how to construct, assign and destroy individual array elements.

@bluescarni
Copy link
Author

I guess @bluescarni could test if PyDataType_REFCHK is sufficient by adding the NPY_NEEDS_REFCHK flag to the new dtype. Then the code path here would avoid PyArray_CastRawArrays -> PyArray_GetDTypeTransferFunction -> define_cast_for_descrs -> _multistep_cast_auxdata_clone_int (the latter is the problematic one)

@mattip you mean the NPY_ITEM_REFCOUNT flag?

@bluescarni
Copy link
Author

Adding the NPY_ITEM_REFCOUNT flag results in this exception at import time:

ValueError: Failed to register dtype for <class 'heyoka.core.real'>: Legacy user dtypes using `NPY_ITEM_IS_POINTER` or `NPY_ITEM_REFCOUNT` are unsupported.  It is possible to create such a dtype only if it is a structured dtype with names and fields hardcoded at registration time.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

@bluescarni you should give up on old style dtypes for this. The only thing they can possibly support through ugly hacks is storing Python objects inside.

@bluescarni the point being , so long your object std::is_trivially_copyable we are fine :). If it is not, I think you are also fine for most objects created with via new as the "storage scheme" (unless the object is trivial, chances are that putting it into a pointer isn't that terrible).

  • If a memory is downsized by realloc, we clear the chunk that is not used before the realloc.
  • If a memory is upsized we initialize that memory (currently maybe to NULL, but as I said, we could look into making that more nifty).

The problem is only if you need to copy the elements, since we use realloc in places where we want the kernel to resize for us efficiently. But of course we can work around everywhere (worst case, it is inefficient). We can easily do the "allocate something larger and copy the data" scheme in NumPy.

@mattip
Copy link
Member

mattip commented Nov 29, 2022

It still seems to me to be problematic to create a copy of src data in PyArray_AssignRawScalar. @seberg are you sure the alignment issue justifies the whole allocate-a-tmp-scalar dance? I would be interested in seeing a benchmark. The only other use of PyArray_CastRawArrays is in PyArray_HolidaysConverter, which is specific to datetime dtypes. While that won't totally remove the need for PyArray_GetDTypeTransferFunction, I think it simplifies this particular code path a bit.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

@mattip I don't see what is problematic. We make copies all the time, there is nothing special here. The only special thing is that the temporary copy isn't wrapped into a full blown array object?

Now, the point of that is of course the assumption that copying the data from a scalar to the array may be much faster than doing the equivalent cast. I expect:

  • This probably doesn't matter at all for numeric casts (they are memory bound and it doesn't seem like we have scalar_to_contig special code paths that we don't have for casts). (Likely they are just not worthwhile in practice because they don't help with the memory bound task, although there are more important things than this)
  • May matter a lot of if the scalar is unaligned (which should be rare, except maybe for complex, because we impose a larger than normal alignment for copy).
  • May be a big deal for complicated casts (in the current machinery) like arr_int8[3] = np.array(3, dtype=">f8") where byte-swapping is needed (but this is niche)

So, I dunno if it is overall worthwhile. Maybe the special cases for effectively stride == 0 here would also be better pushed into some lower machinery.

Also, I bet we can get more speed improvement out of SIMDfying (with compiler hints only if necessar) the existing casts, than we would lose if we just remove this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants