Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vectorcall protocol #390

Merged
merged 42 commits into from
Mar 31, 2023
Merged

Conversation

fangerer
Copy link
Contributor

@fangerer fangerer commented Dec 15, 2022

Resolves #389 .

This PR enables HPy extensions to implement the vectorcall protocol for HPy types.
To be clear, this is about providing the possibility to implement the vectorcall protocol on the receiver side (i.e. on the type) as in PEP 590.

In contrast to the C API, I tried to make the common case as easy as possible.
Implementing the vectorcall protocol in HPy is now (in the common case) very simple and looks like this:

HPyDef_VECTORCALL(SomeObject_vectorcall)
static HPy
SomeObject_vectorcall_impl(HPyContext *ctx, HPy callable, HPy *args, HPy_ssize_t nargsf, HPy kwnames)
{
    // ...
}

static HPyDef *SomeObject_defines[] = { &SomeObject_vectorcall, NULL };

That's it.

As you might notice, this means that each instance of the type will automatically use the same vectorcall function implementation but one important improvement of PEP 590 is that you can have different (maybe specialized) vectorcall function implementations per object.
In order to provide this flexibility, I've introduced API function HPyVectorcall_Set that allows to set an arbitrary vectorcall function on an object. For example:

HPyVectorcall_FUNCTION(Point_special_vectorcall)
static HPy
Point_special_vectorcall_impl(HPyContext *ctx, HPy callable, HPy *args, HPy_ssize_t nargsf, HPy kwnames)
{
    // ...
}

HPyDef_SLOT(Point_new, HPy_tp_new)
static HPy Point_new_impl(HPyContext *ctx, HPy cls, HPy *args, HPy_ssize_t nargs, HPy kw)
{
    // ...
    HPyVectorcall_Set(ctx, h_point, &Point_special_vectorcall);
    // ...
}

As indicated in the above example, HPyVectorcall_Set is meant to be used in the object constructor but there is no restriction when it can be used.
Macro HPyVectorcall_FUNCTION is encouraged to be used since it generates the appropriate CPython trampoline and fills the required HPyVectorcall struct.

Some more explanation

HPyDef_VECTORCALL(SYM) is an alias for HPyDef_SLOT(SYM, HPy_tp_vectorcall_default). So, this just defines an HPy-specific slot HPy_tp_vectorcall_default which is the default vectorcall function that will be used for all objects.
If ctx_Type_FromSpec recognized this slot, following happens behind the scenes:

  1. An additional field (of type vectorcallfunc) will be added (at the end) to the CPython object. This increases the basic size (by sizeof(vectorcallfunc)). It is appended to the object because otherwise the *_AsStruct calls would return an incorrect pointer.
  2. Flag Py_TPFLAGS_HAVE_VECTORCALL is set automatically
  3. Member __vectorcalloffset__ will be added to the C API slots automatically (using the offset of the hidden field).
  4. In case of the type also has a custom slot HPy_tp_new, we assume that HPy_New will be used for allocation which will take care of writing the default vectorcall function pointer to the object (see ctx_type.c:1408).
  5. In case HPy_tp_new is not provided, we wrap the inherited tp_new function with hpyobject_new (see ctx_type.c:265) which takes care of that.

Restrictions

  • Python <3.8 just doesn't know about the vectorcall protocol. So, using this in HPy and running on Python <3.8 will just not use the vectorcall function implementation. However, we will still install the corresponding tp_call function and delegate to the vectorcall impl.
  • Because of the above point 1., it is not possible to use HPyDef_VECTORCALL with var objects (i.e. where itemsize > 0). Right now, this is not a big restriction since HPy does not really support them. Howver, it is still possible to do the manual way similar to how it is done in the C API: add a field HPyVectorcall vectorcall to the type's struct, define member __vectorcalloffset__, set flag HPy_TPFLAGS_HAVE_VECTORCALL, ...). This is also covered by a test.

Misc

From a performance point of view, object creation should not be significantly slower (compared to CPython's vectorcall API) because if (1) the vectorcall protocol is not implemented, we just do an additional type flag check, and if (2) the protocol is implemented, we might do an additional write to the hidden field in case the user overwrites the default function.

I still did not write documentation about that. I will do in a follow-up PR.

@fangerer
Copy link
Contributor Author

@hodgestar left some comments in the IRC channel. I'm posting them here for documentation:

Every time I look at the old C API for it, I go "arg" a lot. It feels more like a perfomance hack that got exposed than an API. However, I'm also not sure what to do about it.

@fangerer Do you have an important / good example use case for the per-instance vector call? What would prevent people who want per-instance calls from just adding their own C function pointer to their struct and doing it themselves?

It feels like we know that a better way is to have our "argument clinic-esque" API for JITs and similar, but that is a lot of work. :/

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

Would it be possible to remove HPy_VECTORCALL_ARGUMENTS_OFFSET from our API and, for example, make a new rule that one can always overwrite args[0] (i.e. pass the actual array instead of a pointer to the second element)?

@fangerer
Copy link
Contributor Author

@hodgestar: Here are my answers:

Do you have an important / good example use case for the per-instance vector call?

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: "Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created."
So, the real world example is "calls to classes"

Would it be possible to remove HPy_VECTORCALL_ARGUMENTS_OFFSET from our API and, for example, make a new rule that one can always overwrite args[0] (i.e. pass the actual array instead of a pointer to the second element)?

Sure and sounds good to me since it makes it very clear.

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

I'm not sure if we even need to take caution concerning compatibility with arg clinic. I think there are two aspects:

  1. Assume an extension author already implements the vector protocol using this PR and then we introduce arg clinic. Compatibility would mean that the extension doesn't need to be migrated. I think that's easily possible.
  2. We want to (internally) use the arg clinic calling machinery to call vectorcall functions. IMO, that is also possible since we just need to implement arg clinic in a way that it can call the vectorcall signature.

Or did I misunderstand your comment?

@steve-s
Copy link
Contributor

steve-s commented Dec 16, 2022

Do you have an important / good example use case for the per-instance vector call?

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: "Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created."
So, the real world example is "calls to classes"

The author of nanobind asks for this in CPython stable ABI in here: https://discuss.python.org/t/ideas-for-forward-compatible-and-fast-extension-libraries-in-python-3-12. IIRC he mentioned somewhere that nanobind uses/can use this for all functions (my possibly wrong understanding: every function is a separate object with vectorcall).

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

I think there is one more thing to vectorcall (and again, maybe I just misunderstand it :-)). Citing from PEP-590:

Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class cls, at least one intermediate object is created for each call in the sequence type.call, cls.new, cls.init.

@wjakob
Copy link

wjakob commented Jan 4, 2023

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data.

I can give one example from nanobind: its function object dispatches calls to C++ using either a simple implementation (only positional arguments supported) or a complex implementation (with handling of default values, keyword arguments, variable argument count, etc.) that is significantly slower. When a new function object is created, it sets the appropriate vector call dispatcher based on the properties of the function.

@mattip
Copy link
Contributor

mattip commented Jan 31, 2023

This has conflicts. Does it replace #251?

@fangerer
Copy link
Contributor Author

Does it replace #251?

@mattip: No, this PR is basically about supporting something like Py_tp_vectorcall_offset in the type spec (in other words: this is about how to implement the callee). PR #251 is about how to call something (like HPy_CallTupleDict).

As far as I got from the discussions here and in the dev calls: we are still not sure if this is the way to go.
@hodgestar argued that the vectorcall protocol mostly looks like an exposed implementation detail but with a little bit of extra functionality (in particular, the fact that you can have different call function implementations per object).

I still think the changes in this PR make sense because of following reasons:

  • In CPython, you need to specify the tp_vectorcall_offset that points into the C struct of the type and then the constructor needs to set the function. This is way simpler here in HPy since we introduced a default function.
  • A major difference is the calling convention: the keywords calling convention already takes an array of arguments but the keywords are still in a dict. The vectorcall calling convention would pass any args in a C array and then provide only the keyword names in a tuple (or maybe also in a C array).
  • As mentioned, the vectorcall protocol allows to have different callee function impls per instance.

Anyway, before merging this, I would like to have more feedback. In particular from @antocuni .

This has conflicts.

They would be easy to resolve if we decide to merge this.

Copy link
Collaborator

@antocuni antocuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks very good to me, thanks for doing this work!
And thanks for the many and detailed comments, they really helped to understand what's going on.
I think there are some details to be discussed though; I raised some concerns also in the inline comments below.

The biggest question IMHO is: do we really need separate slots for HPy_tp_call and HPy_tp_vectorcall_default? The two signatures are already very similar: if I understand correctly the only difference is that HPyFunc_keywords take a full dictionary of keywords, while HPyFunc_vectorcallfunc takes a list of keyword names.

We could "tweak" the existing HPyFunc_KEYWORDS calling convention to be compatible with vectorcall: in that way, the end user would just implement HPy_tp_call, and we would automatically use tp_vectorcall under the hood.
If we do in that direction, the only "real" difference between supporting vectorcall or not is the ability of setting a per-object function.

hpy/devel/include/hpy/cpython/hpyfunc_trampolines.h Outdated Show resolved Hide resolved
typedef struct {
cpy_vectorcallfunc cpy_trampoline;
HPyFunc_vectorcallfunc impl;
} HPyVectorcall;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a special struct for this instead of reusing HPySlot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in the manual case, you need to include this struct (I've renamed it to HPyCallFunction) in the type's struct like this:

typedef struct {
    int member0;
    int member1;
    // ...
    HPyCallFunction callfunc;
   // ...
}

In this case, the struct shouldn't be larger than necessary which would be the case if we use HPyDef. For consistency, I'm also using HPyCallFunction * in HPyVectorcall_Set (name of this function is subject to be changed). We could use a slot definition there but that seems to be more confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it and it makes a lot of sense.
The only drawback which I see is that by doing this we are wasting a pointer "forever": HPyCallFunction contains a pointer to the cpy_trampoline, which is needed right now to support CPython, but it's not needed at all by alternative implementations and it might not be needed even by CPython itself in the future (in case they decide to support hpy natively).

Note that it's slightly different than the HPySlot case, because HPySlots are in fixed number (and stored in static data), while this impose an extra cost to the runtime objects.

However, I don't really know how to solve this issue, so I suppose it is fine to keep as is for now, and see whether we can improve it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's a good point. We don't really need the both function pointers in general.
I think about how we can reduce that to void sizeof(void *) slot but yes, let's merge this PR and do it in a follow-up (since the PR is already pretty large).

hpy/devel/include/hpy/hpydef.h Outdated Show resolved Hide resolved
hpy/devel/include/hpy/hpydef.h Show resolved Hide resolved
hpy/devel/include/hpy/hpytype.h Outdated Show resolved Hide resolved
hpy/tools/autogen/public_api.h Outdated Show resolved Hide resolved
test/test_hpytype.py Outdated Show resolved Hide resolved
hpy/devel/src/runtime/ctx_type.c Outdated Show resolved Hide resolved
hpy/devel/src/runtime/ctx_type.c Show resolved Hide resolved
hpy/devel/src/runtime/ctx_type.c Show resolved Hide resolved
@fangerer
Copy link
Contributor Author

Not done yet but pushed if people are interested on the progress and to trigger the tests.

@fangerer fangerer force-pushed the fa/vectorcall branch 2 times, most recently from a9b8453 to 05cb57c Compare March 20, 2023 16:21
@fangerer
Copy link
Contributor Author

fangerer commented Mar 21, 2023

Big update on the PR. I've addressed most of @antocuni 's points. Here is a summary:

  • HPy provides just one calling protocol (by defining slot HPy_tp_call) and this is mapped to CPython's vectorcall protocol. It is not possible to define the legacy tp_call protocol in an HPyType_Spec.
  • I've aligned function signatures as far as possible. In particular, HPyFunc_keywords now is:
    typedef HPy (*HPyFunc_keywords)(HPyContext *ctx, HPy self, const HPy *args, size_t nargs, HPy kwnames);
    
    Therefore, we are now also implementing HPyFunc_KEYWORDS with METH_FASTCALL | METH_KEYWORDS to avoid unnecessary and slow argument conversion (see C API function signature _PyCFunctionFastWithKeywords). We already use METH_FASTCALL for HPyFunc_VARARGS.
  • I've removed almost any reference or mentioning of vectorcall (this should in particular make @hodgestar happy 😄) and the vectorcall protocol is for us just an impl detail.
  • I've introduced helper function HPyHelpers_PackArgsAndKeywords to convert from fastcall/vectorcall calling convention to the legacy convention (with args tuple and keywords dict).
  • I've written documentation for all new features. In particular, I added a section to the Porting Guide that should explain how to use the HPy calling protocol.
  • I could remove any special casing for Python 3.7 since we dropped support for it already.

Some other remarks:

  • I've replaced argument HPy_ssize_t nargs by size_t nargs mostly because (a) I think a negative count doesn't make sense, and (b) to be compatible to CPython's vectorcall signature.
  • We don't use the PY_VECTORCALL_ARGUMENTS_OFFSET flag. nargs will always just be the positional argument count.
  • I DID NOT align function signatures of HPy_tp_new and HPy_tp_init with HPyFunc_keywords because I feared that this could have a significant performance impact since CPython is calling tp_new directly and unpacking the keywords dict is an expensive operation. I plan to do some measurements. Anyway, I would like to do this in a separate PR.
  • As described in the porting guide, there is still the possibility to manually define the __vectorcalloffset__. Should we also intercept that and use a different name?

@fangerer fangerer requested a review from antocuni March 21, 2023 17:29
Copy link
Collaborator

@antocuni antocuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! LGTM :)

docs/examples/snippets/hpycall.c Outdated Show resolved Hide resolved
@fangerer fangerer merged commit 3da37d6 into hpyproject:master Mar 31, 2023
@fangerer fangerer deleted the fa/vectorcall branch March 31, 2023 14:00
@fangerer fangerer mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for vectorcall protocol.
5 participants