-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEP: propose new nep for allocator policies #18805
Conversation
cc: @pentschev @jakirkham |
Silly question: Is there a webpage for this NEP rendered somewhere by the CI? |
The NEP is rendered at https://19289-908607-gh.circle-artifacts.com/0/doc/neps/_build/html/nep-0049.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just some comments. While I always see that there is a lot of interest in this, I am a bit fuzzy on the gains (and maybe also if there are alternatives such as subclassing, which doesn't sound like a particlarly good alternative). It would be nice to add a (very short) paragraph about which end-user scenarios can gain a lot (I assume these are things where we ensure that arrays are allocated in memory shared between CPU and GPU/FPGA?)
One thing I also wonder if, it might make sense at some point to have an array creation function where the handler is passed explicitly?
to the ``ndarray->data`` pointer were already breaking the current memory | ||
management strategy (backed by ``npy_alloc_cache``) and should restore | ||
``ndarray->data`` before calling ``Py_DECREF``. As mentioned above, the change | ||
in size should not impact end-users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite true. The old strategy was backward compatible. The worst thing that could happen is that you allocate a huge chunk, create an array with it, set OWNDATA and then the free will put it into a "too large" bucket (wasting memory). But overall, there was never any API change the way the cache was implemented.
This PR has a tiny chance of that, because if an external project would replace ndarray->data
(i.e. realloc or use free/malloc), that would have worked before but it will stop working if the allocation strategy was modified.
Now, I am not worried about that. Nobody should be doing that, and it can only be a problem if you use one project modifying the allocation strategy AND a project modifying ndarray->data
together.
Your PR makes sure that if the user provides the data (and may have allocated it), we will end up using the normal free
strategy, so no issue exists for arrays created from passed in data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same also applies to ->shape
, but that is even crazier, since the shape and strides were always allocated together, so you have to dig into NumPy code to get it right. Overall, it makes me almost wonder if we could get away with breaking ABI and replacing the allocation strategy there with PyMem_Malloc()
by default, but I guess it doesn't matter.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the shape/strides allocation in the PR. Should I make it explicit in the NEP that the policy only touches the data and not the shape/strides ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I misread the abstract a bit and got sidetracked. Maybe rearrange the second paragraph of the abstract slightly would set the track clearer (brainstorming here):
This NEP proposes a mechanism to override the memory management strategy used for
ndarray->data
with user-provided alternatives. This allocation holds the arrays data and is can be very large. As this is the data that operations work on it can benefit from custom allocation strategies to guarantee data alignment or placement in memory shared with the GPU.
EDIT: To be clear, I think its fine, but maybe could be a bit clearer there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
included later in the NEP. The added ``struct`` will increase the size of the | ||
``ndarray`` object. It is one of the major drawbacks of this approach. We can | ||
be reasonably sure that the change in size will have a minimal impact on | ||
end-user code because NumPy version 1.20 already changed the object size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine here, in line with my questions below, it might be nice to have one paragraph that explains how users will benefit (that is probably the users of the libraries that modify the allocation strategy).
I assume this are things like: Alignment guarantees or NUMA core pinning can speed up their code. Allocating in memory shared by GPU and CPU speeds up certain operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment guarantees or NUMA core pinning can speed up their code
Yes. There are libraries like pnumpy
that would like to keep data on a specific core or hardware where unaligned memory is expensive (like ARMv7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you guarantee "pinned" memory in the CUDA sense as well? I believe that means that the memory is banned from swapping so the CPU doesn't have to be used. I've seen this special memory type used for passing seamlessly between CPU and GPU; however, the allocation of "formally" pinned memory in this sense normally uses the CUDA C API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using pinned memory would hit some snags due to memcpy
. A future enhancement might be to enable user-supplied memcpy
routines and using them inside numPy when creating a copy of an ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI. CuPy recently added a few APIs to allocate pinned memory that backs a NumPy array:
For using pinned memory more conveniently, we also provide a few high-level APIs in the cupyx namespace, including cupyx.empty_pinned(), cupyx.empty_like_pinned(), cupyx.zeros_pinned(), and cupyx.zeros_like_pinned(). They return NumPy arrays backed by pinned memory. If CuPy’s pinned memory pool is in use, the pinned memory is allocated from the pool.
https://docs.cupy.dev/en/stable/user_guide/memory.html#memory-management. I think PyCUDA also have similar APIs.
Indeed, memcpy could be an issue for D2H transfers, but not H2D (at least in CuPy, arr_cp[:] = arr_np_on_pinned
would handle this properly). It's something I am working to circumvent (by calling arr_np_on_pinned = cp.asnumpy(arr_cp, out=arr_np_on_pinned)
).
doc/neps/nep-0049.rst
Outdated
These were discussed in `issue 17467`_. `PR 5457`_ proposed a | ||
global interface for specifying aligned allocations. But alignment can be | ||
crucial for some applications, but in general is just extra overhead, so it | ||
should be configurable by the user/app. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, not sure I follow. Is this that you want to distinguish between small/large arrays but that would be difficult or impossible with a global "please align to XY" setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Maybe. I think that is out of scope for the NEP. |
The NEP has been published to the mailing list http://numpy-discussion.10968.n7.nabble.com/NEP-49-Data-allocation-strategies-td49185.html |
doc/neps/nep-0049.rst
Outdated
The ``numpy.ndarray`` requires additional memory allocations | ||
to hold ``numpy.ndarray.strides``, ``numpy.ndarray.shape`` and | ||
``numpy.ndarray.data`` attributes. These attributes are specially allocated | ||
after creating the python object in ``__new__`` method. The ``strides`` and | ||
``shape`` are stored in a piece of memory allocated internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per #18805 (comment), can we rephrase/remove this paragraph please? I misread this too and thought the new allocator interface would change how shape
and strides
are allocated, which I couldn't care less.
doc/neps/nep-0049.rst
Outdated
own. Two such use-cases are to ensure data alignment and to pin certain | ||
allocations to certain NUMA cores. This desire for alignment was disucssed | ||
multiple times on the mailing list `in 2005`_, and in `issue 5312`_ in 2014, | ||
which led to `PR 5457` and more mailing list discussions here_ `and here`_. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we need to link to PR 5457
.
doc/neps/nep-0049.rst
Outdated
a comment on the issue `from 2017`_, a user described how 64-byte alignment | ||
improved performance by 40x. | ||
|
||
Also related is `issue 14177` around the use of ``madvise`` and huge pages on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we need to link to issue 14177
doc/neps/nep-0049.rst
Outdated
enable exploration of this rich area of possible optimizations. The intention | ||
is to create a flexible enough interface without burdening normative users. | ||
|
||
_`issue 5312`: https://github.com/numpy/numpy/issues/5312 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hyperlinks are rendered incorrectly. I think this is the correct syntax:
_`issue 5312`: https://github.com/numpy/numpy/issues/5312 | |
.. _issue 5312: https://github.com/numpy/numpy/issues/5312 |
(Same applies to the links below.)
doc/neps/nep-0049.rst
Outdated
---------------- | ||
|
||
The new functions can only be accessed via the NumPy C-API. An example is | ||
included later in the NEP. The added ``struct`` will increase the size of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
included later in the NEP. The added ``struct`` will increase the size of the | |
included later in this NEP. The added ``struct`` will increase the size of the |
doc/neps/nep-0049.rst
Outdated
When the hook is called, the GIL will be held by the calling | ||
thread. The hook should be written to be reentrant, if it performs | ||
operations that might cause new allocation events (such as the | ||
creation/destruction numpy objects, or creating/destroying Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creation/destruction numpy objects, or creating/destroying Python | |
creation/destruction NumPy objects, or creating/destroying Python |
doc/neps/nep-0049.rst
Outdated
thread. The hook should be written to be reentrant, if it performs | ||
operations that might cause new allocation events (such as the | ||
creation/destruction numpy objects, or creating/destroying Python | ||
objects which might cause a gc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better be explicit:
objects which might cause a gc) | |
objects which might cause a garbage collection). |
doc/neps/nep-0049.rst
Outdated
Related Work | ||
------------ | ||
|
||
The NEP is being tracked by the pnumpy_ project and a `comment in the PR`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NEP is being tracked by the pnumpy_ project and a `comment in the PR`_ | |
This NEP is being tracked by the pnumpy_ project and a `comment in the PR`_ |
doc/neps/nep-0049.rst
Outdated
Implementation | ||
-------------- | ||
|
||
The NEP has been implemented in `PR 17582`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NEP has been implemented in `PR 17582`_. | |
This NEP has been implemented in `PR 17582`_. |
doc/neps/nep-0049.rst
Outdated
``PyArray_malloc_aligned`` and friends were added to NumPy with the random API | ||
refactor. and are used there for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is there a link to PyArray_malloc_aligned
, or is it internal?
``PyArray_malloc_aligned`` and friends were added to NumPy with the random API | |
refactor. and are used there for performance. | |
``PyArray_malloc_aligned`` and friends were added to NumPy with the ``random`` API | |
refactor and are used there for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is there a link to PyArray_malloc_aligned, or is it internal?
It is internal
Here is the rendered NEP. It would be nice to merge this as a draft now, we can fix more before it reaches final status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only had a chance to look at this now. It looks good, I left a few minor suggestions on typo and formatting fixes, as well as comment on struct alignment.
doc/neps/nep-0049.rst
Outdated
structure to hold pointers to functions used to manage the data memory. The | ||
calls are wrapped by internal routines to call :c:func:`PyTraceMalloc_Track`, | ||
:c:func:`PyTraceMalloc_Untrack`, and will use the | ||
:c:func:`PyDataMem_EventHookFunc` mechanism already present in NumPy for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:c:func:`PyDataMem_EventHookFunc` mechanism already present in NumPy for | |
:c:func:`PyDataMem_EventHookFunc` mechanism already present in NumPy for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was removed, are you looking at an older version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this may have been from an unfinished prior review.
doc/neps/nep-0049.rst
Outdated
.. code-block:: c | ||
|
||
typedef struct { | ||
char name[128]; /* multiple of 64 to keep the struct unaligned */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this read "... keep the struct aligned", or am I misunderstanding something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/neps/nep-0049.rst
Outdated
|
||
This NEP proposes a mechanism to override the memory management strategy used | ||
for ``ndarray->data`` with user-provided alternatives. This allocation holds | ||
the arrays data and is can be very large. As accessing this data often becomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the arrays data and is can be very large. As accessing this data often becomes | |
the arrays data and can be very large. As accessing this data often becomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/neps/nep-0049.rst
Outdated
the arrays data and is can be very large. As accessing this data often becomes | ||
a performance bottleneck, custom allocation strategies to guarantee data | ||
alignment or pinning allocations to specialized memory hardware can enable | ||
hardware-specific optimizations. The other allocatations remain unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardware-specific optimizations. The other allocatations remain unchanged. | |
hardware-specific optimizations. The other allocations remain unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/neps/nep-0049.rst
Outdated
|
||
Users may wish to override the internal data memory routines with ones of their | ||
own. Two such use-cases are to ensure data alignment and to pin certain | ||
allocations to certain NUMA cores. This desire for alignment was disucssed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocations to certain NUMA cores. This desire for alignment was disucssed | |
allocations to certain NUMA cores. This desire for alignment was discussed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/neps/nep-0049.rst
Outdated
Sets a new allocation policy. If the input value is NULL, will reset | ||
the policy to the default. Returns the previous policy, NULL if the | ||
previous policy was the default. We wrap the user-provided functions | ||
so they will still call the Python and NumPy memory management callback | ||
hooks. All the function pointers must be filled in, NULL is not accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sets a new allocation policy. If the input value is NULL, will reset | |
the policy to the default. Returns the previous policy, NULL if the | |
previous policy was the default. We wrap the user-provided functions | |
so they will still call the Python and NumPy memory management callback | |
hooks. All the function pointers must be filled in, NULL is not accepted. | |
Sets a new allocation policy. If the input value is ``NULL``, will reset | |
the policy to the default. Returns the previous policy, ``NULL`` if the | |
previous policy was the default. We wrap the user-provided functions | |
so they will still call the Python and NumPy memory management callback | |
hooks. All the function pointers must be filled in, ``NULL`` is not accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Thanks, Matti, lets put it in. Editing a NEP later if necessary is easier. Anyone can continue commenting here for a while, especially for in-line comments. |
@mattip Thanks for putting this together! Not sure if we should continue the discussion here or in a new thread, but do you think a Python API for this NEP is in scope? At least an API for querying the handler name (i.e. wrapping |
In the spirit of minimality, I would say we can probably start of with A NumPy function could return the name itself more naturally, or check whether it was allocated with the default allocator of course. |
Thanks, @seberg. I had my own pet projects in mind and was not speaking for CuPy 😅 I had in mind something as simple as For CuPy, we actually have to figure out a way to use this, because currently we do not rely on any NumPy C API and I think it's preferable if we stay with the status quo. This discussion has yet to happen. |
We could expose a python function from |
I am not opposed, just wasn't sure it actually adds much. Agreed about the namespace, if we can put it into some submodule namespace I am happy about just adoing it! We also could have more |
You might want to check our integration. It is based on https://github.com/inaccel/numpy-allocator This is how we use it: |
Thanks, @mattip @seberg. I don't have any preference for the namespace as long as it's guaranteed to be a public API 🙂 I think relying on |
@leofang, a good idea/proposal of where to put such API would help a lot in getting it in smoothly, whether you care about where that is or not. Not sure that capsules are useful here, they are probably more useful for special attributes. |
.. code-block:: c | ||
|
||
typedef struct { | ||
char name[128]; /* multiple of 64 to keep the struct aligned */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware this had any effect - doesn't an array in a struct only contribute it's element size to the alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser I understood it as relying on the assumption that the struct itself is 64 byte (cache-line) aligned. In that case the interesting part of the struct will remain aligned if the name
is a multiple of 64 byte.
But, I might be misunderstanding it and I think you are right that it may make sense to add some attributes/pragma to ensure the struct itself is cache-line aligned. (Although, that is not part of the header.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note - Python doesn't use static char buffers for any of its object names, as I remember. Is there a reason for doing so here rather than using a char*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we make it clear who allocates and frees the content of the char *
and prevent corruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just posted to the mailing list with a related comment, which I'll repeat here for ease.
It's easy to think of allocators that have other state besides the name, for which we need to solve this problem anyway:
Consider an allocator that aligns to
N
bytes, whereN
is configurable from a python call in someone else's extension module. Where do they storeN
?
They can hide it inPyDataMem_Handler::name
but that's obviously an abuse of the API.
They can store it as a global variable, but then obviously the idea of tracking the allocator used to construct an array doesn't work, as the state ends up changing with the global allocator.The easy way out here would be to add a
void* context
field to the structure, and pass it into all the methods.
This doesn't really solve the problem though, as now there's no way to cleanup any allocations used to populatecontext
, or worse decrement references to python objects stored withincontext
.
I think we want to bundlePyDataMem_Handler
in aPyObject
somehow, either via a new C type, or by using the PyCapsule API which has the cleanup and state hooks we need.
PyDataMem_GetHandlerName
would then return this PyObject rather than an opaque name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess more concretely, my suggestion is:
typedef void *(PyDataMem_AllocFunc)(PyObject *self, size_t size);
typedef void *(PyDataMem_ZeroedAllocFunc)(PyObject *self, size_t nelems, size_t elsize);
typedef void (PyDataMem_FreeFunc)(PyObject *self, void *ptr, size_t size);
typedef void *(PyDataMem_ReallocFunc)(PyObject *self, void *ptr, size_t size);
typedef struct {
PyObject_HEAD
PyDataMem_AllocFunc *alloc;
PyDataMem_ZeroedAllocFunc *zeroed_alloc;
PyDataMem_FreeFunc *free;
PyDataMem_ReallocFunc *realloc;
} PyDataMem_HandlerObject;
// I haven't used the CAPI in a while, but there's not much boilerplate here anyway
PyType_Spec PyDataMem_HandlerTypeSpec =
{
.name = "numpy.core.datamemhandler",
.basicsize = sizeof(PyDataMem_HandlerObject),
.itemsize = 0,
.flags = Py_TPFLAGS_BASETYPE
};
Users who are fine with a stateless allocator would use:
PyDataMem_HandlerObject *handler = PyObject_New(PyDataMem_HandlerObject, &PyDataMem_HandlerType)
and then can set the fields as appropriate.
Users who want a stateful allocator can then just create a derived type with some extra state, eg:
typedef struct {
PyDataMem_HandlerObject handler;
npy_uintp alignment;
} MyAlignedHandlerObject;
// similar boilerplate to above
PyType_Spec MyAlignedHandlerObject_HandlerTypeSpec = ...;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded on the mailing list.
if (strncmp(real, "originally allocated", 20) != 0) { | ||
fprintf(stdout, "uh-oh, unmatched shift_free, " | ||
"no appropriate prefix\\n"); | ||
/* Make gcc crash by calling free on the wrong address */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume by gcc you mean "the c runtime"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Added a mention of this in the NEP. |
Related to PR gh-17582. Once this gets some review I will propose it on the mailing list.