Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer Protocol Support for Custom DType Array #18442

Closed
RaulMurillo opened this issue Feb 18, 2021 · 10 comments
Closed

Buffer Protocol Support for Custom DType Array #18442

RaulMurillo opened this issue Feb 18, 2021 · 10 comments

Comments

@RaulMurillo
Copy link

Feature

I am defining a Python extension type, as well as adding built-in support for this new type to NumPy with an extension module that allows generating ndarrays of this custom dtype. It has a similar implementation to quaternion package (so I will use this type for illustration).

This new data type also implements the “buffer protocol”, so applying memoryview on an object of this type gets the expected output. However, if an array with objects of this type is created, applying memoryview raises a ValueError due to the fact that this type is not supported by the buffer protocol of the NumPy library.
Is there any way to obtain the memory view of a NumPy array containing custom extension types?

Similar issues #4983 and PR #15309 expose this problem for the datetime dtype, and also solution proposed in #4983 (comment) seems to be in the right direction.

However, it would be desirable to support buffer for any specialized types that implement themselves this protocol (in the same manner as numpy scalar types do). Finally, my purpose is to integrate this type in a Cython module to speed-up the code, but the typed memoryviews used in this language require the NumPy array buffer support when passing array arguments to Cython functions, so in this case solution from #4983 (comment) is not an option either.

Reproducing code example:

>>> import numpy as np
>>> import quaternion
>>> a = np.quaternion()
>>> memoryview(a)
<memory at 0x7f75c1350b80>
>>> memoryview(np.asarray(a))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot include dtype 'q' in a buffer

The buffer protocol is implemented in the quaternion class/type, as described here and in the official documentation.
When trying to get the memory view of an array containing objects of this class, ValueError is risen.

NumPy/Python version information:

NumPy Version: 1.20.1

Python: 3.8

@seberg
Copy link
Member

seberg commented Feb 18, 2021

Just so you are aware of it, we/I are in the process of revamping the dtype API, which will lead to a new API (and eventually force you to move).

I am aware of this limitation, but have treated it as a backburner, partially because I am not sure how often it actually makes sense to export it! For example, given quaternion you can use arr.view("d,d,d,d") and pass that, since the buffer protocol is not able to represent "quaternion" (as a type, meaning "behaviour") in any case.

But, we definitely can move the creation of the buffer format string to be accessible to user DTypes in the future (I will only aim for after the new API, which is hopefully not far off now).

Another thing that should work (but I am not immediately sure how to write in cython), is to not use FORMAT compeltely (i.e. not request it in the buffer protocol). In that case, NumPy will export the buffer happily. (You probably need NumPy 1.20 for that, it was a recent fix, but I think it is included there.)

Just curious, but can you say what you are working on? I am trying to see what kind of DTypes will benefit from all my changes (and keep in mind what kind of holes in the API might crop up for them).

@seberg
Copy link
Member

seberg commented Feb 18, 2021

Forgot to include a link to the new DType NEPs (42 might be more interesting, but this is the entry point): https://numpy.org/neps/nep-0041-improved-dtype-support.html

@RaulMurillo
Copy link
Author

I am working on an extension of posit datatype in NumPy. In the case you are unaware of this, it is an alternative format to floating-point arithmetic (see Beating Floating Point at its Own Game) which provides higher accuracy and seems to be really useful in areas such as numerical analysis or Deep Learning (see https://github.com/RaulMurillo/deep-pensieve).

Currently, there are few libraries that support this format (only in software emulation), most of them in C/C++. However, none of them provide an interface for array handling as NumPy does, and other characteristics of the library such as broadcasting and ufuncs could help in the development of applications and the format itself, IMO.

I referred to quaternion because it is a well-known extension package, and also this posit package is still WIP and not publicly available (but it will be in the near future, I will link it here when it is). I supposed your solution of using view as double DType, but I am afraid this does not perform as desired with operations like multiplication, since the multiplication for complex numbers is different as for vectors (in fact, this operation is redefined in the C files of the library).

A similar problem arises with posits. The posit types in this package are implemented in C language as a typedef uint8_t, uint16_t, uint32_t for the different precisions, but just as a compact storage format. All the arithmetic for posits is re-defined using bitwise operations on unsigned integers (I use SoftPosit library as C backend).
In fact, I was able to compile a Cython program using this view trick, but the operations were performed as unsigned integers, obtaining then unexpected results. There is no a NumPy/Python DType that behaves like posits in terms of arithmetic operations.

I think the API changes you mention could be quite beneficial for this package. During its development, I ran into many of the issues mentioned in the User Impact section of NEP 41.

@RaulMurillo RaulMurillo changed the title Buffer Protocol Support for Custom dtype Array Buffer Protocol Support for Custom DType Array Feb 18, 2021
@seberg
Copy link
Member

seberg commented Feb 19, 2021

Cool, nice to see interest, and this type of developments!

What I am not certain is how the buffer protocol can even be used for this reasonably. A way like Eric suggested in the datetime issue, might be something that can work? By adding a type-name, the "type information" is at least encoded in some form (but then you need to teach that to cython/consumers).
The buffer protocol doesn't support non-standard C types, so that NumPy cannot just add it anyway. I am not sure if that can be extended, I doubt it was attempted yet :).

So I would think you will always need some view or C casts, and custom code to do the translation?

@RaulMurillo
Copy link
Author

RaulMurillo commented Feb 19, 2021

A way like Eric suggested in the datetime issue, might be something that can work?

I tested generating a view of the array with that method:

dt = arr.dtype
    new_dt = np.dtype([
        ('__numpy_value', (np.void, dt.itemsize)),
        ('__numpy_dtype', [
            (str(dt), [])
        ])
    ])
arr.view(new_dt)

A memoryview can be created from this array, but then it is not possible to assign it to a typed memoryview:

cdef posit16[:, :] arr_view = arr.view(new_dt)

ValueError: Buffer dtype mismatch, expected 'posit16' but got end

Another problem is that this can not solve intrinsic calls to memoryview method, for example, when adding types to Cython functions. As an illustration, consider the following Cython code, where array_1 is a NumPy array of np.posit16 custom-defined DType.

def foo(posit16[:, :] array_1):
...

ValueError: cannot include dtype 'k' in a buffer

Here, it tries to create a MemoryView at function call, which probably involves the call to memoryview(array_1) (I guess this from the error message since it is arisen from the core of NumPy library).

The buffer protocol doesn't support non-standard C types, so that NumPy cannot just add it anyway. I am not sure if that can be extended, I doubt it was attempted yet :).

So I would think you will always need some view or C casts, and custom code to do the translation?

Since I implemented the buffer protocol in the simple DType (I can call memoryview(np.posit16(1.5)) and get a correct result, although it is completely useless since a posit number behaves like a scalar), I guess it would not be so difficult to implement some kind of recursion in the buffer protocol for ndarrays with custom DTypes that already implement it for single values such as scalars.
Might I be wrong, but this seems to me similar to how the ScalarArrayTypes implemented in NumPy deal with this protocol:

@name@_getbuffer(PyObject *self, Py_buffer *view, int flags)

@RaulMurillo
Copy link
Author

By the way, I forgot to mention that I made the posit types defined in Python (C-API level) available to Cython as an extern extension type as shown in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html?highlight=complex#external-extension-types
Passing a single posit value or operating with it is straight forward in Cython, but it fails with ndarrays. For this reason I think the problem could be solved if Numpy could generate a memoryview of this or any custom DType.

@seberg
Copy link
Member

seberg commented Feb 19, 2021

Well, datetime aside, the buffer protocol can represent our dtypes, so we don't have these problems.

The only problem really is the "format", and we could very much allow your dtype to effectively have a method dtype.buffer_format() which returns the correct buffer format string. I am not worried about that part, it is a straight forward addition (one we could work on right now, although it is private API at the moment).

My current guess is that the path of least resistance may be to really define such a special "format" string for your own dtype (NumPy could provide a guideline). And then teach Cython that posit16[:, :] requires a specially formatted "structured" format string. Something slightly weird like: 'T{T{2x:posit.posit16:}:__numpy_dtype__:}' Although honestly, there is no reason for __numpy_dtype__ at all. Since we can allow you to use whatever you like, with a note to please make sure that it is unique (e.g. fully qualified python object name).

@RaulMurillo
Copy link
Author

I did some tests on the matter.

By slightly modifying the _buffer_format_string() function in NumPy's source

default:
PyErr_Format(PyExc_ValueError,
"cannot include dtype '%c' in a buffer",
descr->type);
return -1;
as follows

default:
    if (descr->type == 'k'){
        if (_append_char(str, 'k') < 0) return -1;
    }
    else{
        PyErr_Format(PyExc_ValueError,
                    "cannot include dtype '%c' in a buffer",
                    descr->type);
        return -1;
    }

(here, 'k' is the descriptor type of the custom dtype) it is possible to generate a memoryview object from a ndarray of such type. One difference with Eric's method (#4983 (comment)) is that the "format" is now the corresponding descriptor type char, in contrast to the previous weird string:

>>> import numpy as np
>>> import posit
>>> arr = np.arange(5).astype(np.posit16)
>>> arr_view = memoryview(arr)
>>> arr_view
<memory at 0x7efff724ca00>
>>> arr_view.format
'k'

I guess this method is simpler and could be implemented for custom DTypes, assigning the corresponding descriptor type to the format of the view.

However, neither this method nor Eric's allows accessing the internal data of an object from a memoryview. The reason for this is that Python' memoryobject doesn't support it for other types:

>>> arr_view[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: memoryview: format k not supported

The same issue also occurs with, for example, np.float16, as it is not supported in pure Python.

Finally, regarding to Cython, we can see that using posits with this compiler (or any other custom DType) has the same problems that if we would like to use half-precision floats (see https://stackoverflow.com/questions/47421443/using-half-precision-numpy-floats-in-cython).
I am not sure if this has an easy solution, but probably this is more a Cython issue than a NumPy one. The solution mentioned in Stack Overflow for performing math calculations (that is the same problem I found with posits if using arr.view(np.uint16)) is not valid for this case, and it should be necessary another trick (for example, overriding unit16 operators) to make it work. However, if a pure-NumPy solution was found for np.float16 type, my intuition tells me that it would also be useful for posits or any other user-defined type.

@seberg
Copy link
Member

seberg commented Feb 22, 2021

@RaulMurillo sure, NumPy could allow you put whatever you like into the exported buffer format (you already can do that for your scalars anyway). And I don't mind doing allowing it, even though I personally would aim at the new API myself right now.

One issue with it is that your k could collide with someone else's k (single characters are really not that great at being unique).

So, before we make it convenient to create such clashes, it might be useful to have some loose formalism about it. Maybe that is just introducing a new character code that signals that the :name: should be used to identify the actual type. Or something like Eric's suggestion. Since this is a buffer protocol, which goes beyond NumPy, floating an idea like that on python-ideas or similar would likely make sense. There may already be solutions out there for all I know.

@RaulMurillo
Copy link
Author

@seberg, completely agree that checking for descr->type == 'k' is not the most suitable way to implement the buffer protocol for these DTypes, just was a proof of concept, and a method like the one you mention should be preferable.

I think the question was answered, so I will close the issue. Please re-open if there is a need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants