Buffer Protocol Support for Custom DType Array #18442

RaulMurillo · 2021-02-18T18:00:20Z

Feature

I am defining a Python extension type, as well as adding built-in support for this new type to NumPy with an extension module that allows generating ndarrays of this custom dtype. It has a similar implementation to quaternion package (so I will use this type for illustration).

This new data type also implements the “buffer protocol”, so applying memoryview on an object of this type gets the expected output. However, if an array with objects of this type is created, applying memoryview raises a ValueError due to the fact that this type is not supported by the buffer protocol of the NumPy library.
Is there any way to obtain the memory view of a NumPy array containing custom extension types?

Similar issues #4983 and PR #15309 expose this problem for the datetime dtype, and also solution proposed in #4983 (comment) seems to be in the right direction.

However, it would be desirable to support buffer for any specialized types that implement themselves this protocol (in the same manner as numpy scalar types do). Finally, my purpose is to integrate this type in a Cython module to speed-up the code, but the typed memoryviews used in this language require the NumPy array buffer support when passing array arguments to Cython functions, so in this case solution from #4983 (comment) is not an option either.

Reproducing code example:

>>> import numpy as np
>>> import quaternion
>>> a = np.quaternion()
>>> memoryview(a)
<memory at 0x7f75c1350b80>
>>> memoryview(np.asarray(a))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot include dtype 'q' in a buffer

The buffer protocol is implemented in the quaternion class/type, as described here and in the official documentation.
When trying to get the memory view of an array containing objects of this class, ValueError is risen.

NumPy/Python version information:

NumPy Version: 1.20.1

Python: 3.8

The text was updated successfully, but these errors were encountered:

seberg · 2021-02-18T20:57:55Z

Just so you are aware of it, we/I are in the process of revamping the dtype API, which will lead to a new API (and eventually force you to move).

I am aware of this limitation, but have treated it as a backburner, partially because I am not sure how often it actually makes sense to export it! For example, given quaternion you can use arr.view("d,d,d,d") and pass that, since the buffer protocol is not able to represent "quaternion" (as a type, meaning "behaviour") in any case.

But, we definitely can move the creation of the buffer format string to be accessible to user DTypes in the future (I will only aim for after the new API, which is hopefully not far off now).

Another thing that should work (but I am not immediately sure how to write in cython), is to not use FORMAT compeltely (i.e. not request it in the buffer protocol). In that case, NumPy will export the buffer happily. (You probably need NumPy 1.20 for that, it was a recent fix, but I think it is included there.)

Just curious, but can you say what you are working on? I am trying to see what kind of DTypes will benefit from all my changes (and keep in mind what kind of holes in the API might crop up for them).

seberg · 2021-02-18T20:58:46Z

Forgot to include a link to the new DType NEPs (42 might be more interesting, but this is the entry point): https://numpy.org/neps/nep-0041-improved-dtype-support.html

RaulMurillo · 2021-02-18T23:06:30Z

I am working on an extension of posit datatype in NumPy. In the case you are unaware of this, it is an alternative format to floating-point arithmetic (see Beating Floating Point at its Own Game) which provides higher accuracy and seems to be really useful in areas such as numerical analysis or Deep Learning (see https://github.com/RaulMurillo/deep-pensieve).

Currently, there are few libraries that support this format (only in software emulation), most of them in C/C++. However, none of them provide an interface for array handling as NumPy does, and other characteristics of the library such as broadcasting and ufuncs could help in the development of applications and the format itself, IMO.

I referred to quaternion because it is a well-known extension package, and also this posit package is still WIP and not publicly available (but it will be in the near future, I will link it here when it is). I supposed your solution of using view as double DType, but I am afraid this does not perform as desired with operations like multiplication, since the multiplication for complex numbers is different as for vectors (in fact, this operation is redefined in the C files of the library).

A similar problem arises with posits. The posit types in this package are implemented in C language as a typedef uint8_t, uint16_t, uint32_t for the different precisions, but just as a compact storage format. All the arithmetic for posits is re-defined using bitwise operations on unsigned integers (I use SoftPosit library as C backend).
In fact, I was able to compile a Cython program using this view trick, but the operations were performed as unsigned integers, obtaining then unexpected results. There is no a NumPy/Python DType that behaves like posits in terms of arithmetic operations.

I think the API changes you mention could be quite beneficial for this package. During its development, I ran into many of the issues mentioned in the User Impact section of NEP 41.

seberg · 2021-02-19T00:25:21Z

Cool, nice to see interest, and this type of developments!

What I am not certain is how the buffer protocol can even be used for this reasonably. A way like Eric suggested in the datetime issue, might be something that can work? By adding a type-name, the "type information" is at least encoded in some form (but then you need to teach that to cython/consumers).
The buffer protocol doesn't support non-standard C types, so that NumPy cannot just add it anyway. I am not sure if that can be extended, I doubt it was attempted yet :).

So I would think you will always need some view or C casts, and custom code to do the translation?

RaulMurillo · 2021-02-19T11:47:00Z

A way like Eric suggested in the datetime issue, might be something that can work?

I tested generating a view of the array with that method:

dt = arr.dtype
    new_dt = np.dtype([
        ('__numpy_value', (np.void, dt.itemsize)),
        ('__numpy_dtype', [
            (str(dt), [])
        ])
    ])
arr.view(new_dt)

A memoryview can be created from this array, but then it is not possible to assign it to a typed memoryview:

cdef posit16[:, :] arr_view = arr.view(new_dt)

ValueError: Buffer dtype mismatch, expected 'posit16' but got end

Another problem is that this can not solve intrinsic calls to memoryview method, for example, when adding types to Cython functions. As an illustration, consider the following Cython code, where array_1 is a NumPy array of np.posit16 custom-defined DType.

def foo(posit16[:, :] array_1):
...

ValueError: cannot include dtype 'k' in a buffer

Here, it tries to create a MemoryView at function call, which probably involves the call to memoryview(array_1) (I guess this from the error message since it is arisen from the core of NumPy library).

The buffer protocol doesn't support non-standard C types, so that NumPy cannot just add it anyway. I am not sure if that can be extended, I doubt it was attempted yet :).

So I would think you will always need some view or C casts, and custom code to do the translation?

Since I implemented the buffer protocol in the simple DType (I can call memoryview(np.posit16(1.5)) and get a correct result, although it is completely useless since a posit number behaves like a scalar), I guess it would not be so difficult to implement some kind of recursion in the buffer protocol for ndarrays with custom DTypes that already implement it for single values such as scalars.
Might I be wrong, but this seems to me similar to how the ScalarArrayTypes implemented in NumPy deal with this protocol:

numpy/numpy/core/src/multiarray/scalartypes.c.src

Line 2449 in 740a8cf

@name@_getbuffer(PyObject *self, Py_buffer *view, int flags)

RaulMurillo · 2021-02-19T13:32:19Z

By the way, I forgot to mention that I made the posit types defined in Python (C-API level) available to Cython as an extern extension type as shown in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html?highlight=complex#external-extension-types
Passing a single posit value or operating with it is straight forward in Cython, but it fails with ndarrays. For this reason I think the problem could be solved if Numpy could generate a memoryview of this or any custom DType.

seberg · 2021-02-19T14:52:16Z

Well, datetime aside, the buffer protocol can represent our dtypes, so we don't have these problems.

The only problem really is the "format", and we could very much allow your dtype to effectively have a method dtype.buffer_format() which returns the correct buffer format string. I am not worried about that part, it is a straight forward addition (one we could work on right now, although it is private API at the moment).

My current guess is that the path of least resistance may be to really define such a special "format" string for your own dtype (NumPy could provide a guideline). And then teach Cython that posit16[:, :] requires a specially formatted "structured" format string. Something slightly weird like: 'T{T{2x:posit.posit16:}:__numpy_dtype__:}' Although honestly, there is no reason for __numpy_dtype__ at all. Since we can allow you to use whatever you like, with a note to please make sure that it is unique (e.g. fully qualified python object name).

RaulMurillo · 2021-02-22T17:22:00Z

I did some tests on the matter.

By slightly modifying the _buffer_format_string() function in NumPy's source

numpy/numpy/core/src/multiarray/buffer.c

Lines 418 to 422 in 0eb9f54

    
           default: 
        
               PyErr_Format(PyExc_ValueError, 
        
                            "cannot include dtype '%c' in a buffer", 
        
                            descr->type); 
        
               return -1;

as follows

default:
    if (descr->type == 'k'){
        if (_append_char(str, 'k') < 0) return -1;
    }
    else{
        PyErr_Format(PyExc_ValueError,
                    "cannot include dtype '%c' in a buffer",
                    descr->type);
        return -1;
    }

(here, 'k' is the descriptor type of the custom dtype) it is possible to generate a memoryview object from a ndarray of such type. One difference with Eric's method (#4983 (comment)) is that the "format" is now the corresponding descriptor type char, in contrast to the previous weird string:

>>> import numpy as np
>>> import posit
>>> arr = np.arange(5).astype(np.posit16)
>>> arr_view = memoryview(arr)
>>> arr_view
<memory at 0x7efff724ca00>
>>> arr_view.format
'k'

I guess this method is simpler and could be implemented for custom DTypes, assigning the corresponding descriptor type to the format of the view.

However, neither this method nor Eric's allows accessing the internal data of an object from a memoryview. The reason for this is that Python' memoryobject doesn't support it for other types:

>>> arr_view[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: memoryview: format k not supported

The same issue also occurs with, for example, np.float16, as it is not supported in pure Python.

Finally, regarding to Cython, we can see that using posits with this compiler (or any other custom DType) has the same problems that if we would like to use half-precision floats (see https://stackoverflow.com/questions/47421443/using-half-precision-numpy-floats-in-cython).
I am not sure if this has an easy solution, but probably this is more a Cython issue than a NumPy one. The solution mentioned in Stack Overflow for performing math calculations (that is the same problem I found with posits if using arr.view(np.uint16)) is not valid for this case, and it should be necessary another trick (for example, overriding unit16 operators) to make it work. However, if a pure-NumPy solution was found for np.float16 type, my intuition tells me that it would also be useful for posits or any other user-defined type.

seberg · 2021-02-22T17:47:45Z

@RaulMurillo sure, NumPy could allow you put whatever you like into the exported buffer format (you already can do that for your scalars anyway). And I don't mind ~~doing~~ allowing it, even though I personally would aim at the new API myself right now.

One issue with it is that your k could collide with someone else's k (single characters are really not that great at being unique).

So, before we make it convenient to create such clashes, it might be useful to have some loose formalism about it. Maybe that is just introducing a new character code that signals that the :name: should be used to identify the actual type. Or something like Eric's suggestion. Since this is a buffer protocol, which goes beyond NumPy, floating an idea like that on python-ideas or similar would likely make sense. There may already be solutions out there for all I know.

RaulMurillo · 2021-02-22T22:05:14Z

@seberg, completely agree that checking for descr->type == 'k' is not the most suitable way to implement the buffer protocol for these DTypes, just was a proof of concept, and a method like the one you mention should be preferable.

I think the question was answered, so I will close the issue. Please re-open if there is a need.

RaulMurillo changed the title ~~Buffer Protocol Support for Custom dtype Array~~ Buffer Protocol Support for Custom DType Array Feb 18, 2021

RaulMurillo closed this as completed Feb 22, 2021

ngoldbaum mentioned this issue Mar 30, 2023

ENH: Support new-style custom dtypes in the buffer protocol #23500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer Protocol Support for Custom DType Array #18442

Buffer Protocol Support for Custom DType Array #18442

RaulMurillo commented Feb 18, 2021

seberg commented Feb 18, 2021

seberg commented Feb 18, 2021

RaulMurillo commented Feb 18, 2021

seberg commented Feb 19, 2021

RaulMurillo commented Feb 19, 2021 •

edited

RaulMurillo commented Feb 19, 2021

seberg commented Feb 19, 2021

RaulMurillo commented Feb 22, 2021

seberg commented Feb 22, 2021 •

edited

RaulMurillo commented Feb 22, 2021

Buffer Protocol Support for Custom DType Array #18442

Buffer Protocol Support for Custom DType Array #18442

Comments

RaulMurillo commented Feb 18, 2021

Feature

Reproducing code example:

NumPy/Python version information:

seberg commented Feb 18, 2021

seberg commented Feb 18, 2021

RaulMurillo commented Feb 18, 2021

seberg commented Feb 19, 2021

RaulMurillo commented Feb 19, 2021 • edited

RaulMurillo commented Feb 19, 2021

seberg commented Feb 19, 2021

RaulMurillo commented Feb 22, 2021

seberg commented Feb 22, 2021 • edited

RaulMurillo commented Feb 22, 2021

RaulMurillo commented Feb 19, 2021 •

edited

seberg commented Feb 22, 2021 •

edited