Skip to content

Stack allocated HPyContext

Florian Angerer edited this page Jun 10, 2022 · 5 revisions

We discussed this first in our HPy dev call on July 8th, 2021 and discussion weren't intense since then but showed up here and there. We recently had more discussion in our Berlin meetup, so I think it's a good time to summarize everything we talked about so far.

Introduction and Motivation

Since the HPy context is NOT opaque and crucial for backwards compatibility, it is very important to design it well and be aware that the made design decisions will then be there forever (or at least for a long time). We still make breaking changes to HPy since we still see it in an early phase and we don't have a lot of packages yet that would needed to be fixed if we do breaking changes. However, I think this is changing slowly and we need to agree on the final context structure ASAP.

Right now, HPyContext is a big (generated) structure looking like this (see also: autogen_ctx.h):

struct _HPyContext_s {
    const char *name; // used just to make debugging and testing easier
    void *_private;   // used by implementations to store custom data
    int ctx_version;
    // roughly 80 built-in handles; more are being added
    HPy h_None;
    // ... 

    // roughly 150 context functions; more are being added
    HPy (*ctx_Module_Create)(HPyContext *ctx, HPyModuleDef *def);
    // ... 
}

HPyContext is right now already quite large and it will grow further since we keep adding functionality to HPy.

As far as I know, all Python implementations supporting HPy are currently just allocating one universal context and one debug context in (native) heap memory. For this reason, the size of HPyContext is currently no problem.

However, if we would like to allocate HPyContext on stack, it will be a concern.

Why should one want to stack-allocate the context? First of all, HPy only guarantees that the received HPyContext * (and it's content) is valid just for the current calling context. This is because we wanted to keep the possibility to provide per-call data in the context. Having this opportunity is IMO very powerful but I will explain that in detail in a later section. The common way to do so is to allocate data structures on the stack for every call. Since calls may be very frequent and are most certainly crucial for performance, the can only have a stack-allocated HPyContext if the structure is reasonably small (let's say, a few words).

Proposal

The proposed structure for HPyContext to prepare it for stack-allocation is inspired by JNIEnv (see jni.h). JNIEnv is very minimal and basically just contains a pointer to the function table. Hence, the idea is to move all members that will be the same for each all into separate data structures:

struct _HPyContext_s {
    const struct _HPyFunctionTable_s *fun_table;
    const struct _HPyBuiltinHandleTable_s *handles;
    void *_private;   // used by implementations to store custom data
}

/* information about the context that is rarely used and mainly for debugging purposes */
struct _HPyContextInfo_s {
    const char *name;
    int ctx_version;
}

/* table of handles to built-in objects */
struct _HPyBuiltinHandleTable_s {
    // roughly 80 built-in handles; more are being added
    HPy h_None;
    // ... 
}

/* the context function table */
struct _HPyFunctionTable_s {
    // roughly 150 context functions; more are being added
    HPy (*ctx_Module_Create)(HPyContext *ctx, HPyModuleDef *def);
    // ... 
}

Requirements for HPyContext

  1. Provide call-specific data
  2. Provide thread-local data
  3. Support sub-interpreters
  4. Carry data to be able to do upcall specialization
  5. Provide lifeness scopes for handles
  6. Call C functions of other HPy extensions
  7. Fast access (ideally just one indirection) to context members (mostly handles and functions)
  8. Low overhead for preparing the HPyContext for a downcall.

Discussion

Stack-allocated context

Referring to the list of requirements, stack-allocated HPyContext structs are able to fulfill many of the requirements. Providing call-specific data is easy since we can just allocate a fresh context on the stack for every downcall. In order to ensure that the preparation of the context has low overhead for a downcall, the handle and function tables as well as the context meta info will just be shared. Since it already can provide call-specific data, that data can also be thread-local. Sub-interpreters are supported as well. Handle scopes are also possible. The context may have it's own handle table that is used during the downcall.

However, a stack-allocated context performance worse on two points: First, access to context members is a bit worse due to the additional indirection. In order to get a built-in handle, we now need two memory loads (load built-in handle table pointer + load built-in handle). This could be an unacceptable performance regression. Second, support for upcall specializations is a bit annoying since the called runtime function cannot just use the context's pointer for caching (since that will be different for every downcall) but the context needs to carry some extra data (maybe a token) for that.

Calling C functions of other HPy extensions is just the same as with every other context.

Heap-allocated context

We can achieve most of the goals with heap-allocated contexts as well. However, that does not happen automatically but we need to have some context caching and the necessary management. The idea here is: every time when a downcall happens, we fetch a currently unused (heap-allocated) context from some (lock-free) cache and patch it appropriately. The preparation for the downcall will also have a low overhead since as for the stack-allocated context, we will reuse all built-in handles and function pointers and everything but call-specific data. In order to have thread-local data, we can just have a context cache per thread. Same for the subinterpreters. The big advantage of this approach is the fast member access since there is just one indirection starting with the context pointer.

Support for upcall specializations is still a bit annoying since the called runtime function still cannot just use the context's pointer for caching because we are fetching a pre-allocated context from the context cache (doesn't need to be the same one every time). So, we need to carry some extra data (maybe a token) for that as well.

Summary

Stack-allocated context

  • Satisfies requirements 1., 2., 3., 5., 8. out of the box.
  • Major (expected) drawback: the additional indirection (and implied performance penalty) for accessing built-in handles and functions.

Heap-allocated context

  • Satisfies requirements 1., 2., 3., 5., 7., 8. (but not out of the box).
  • Major drawback: Context caching is strongly required and this may very complex.

Making a Decision

In order to make a decision on how we should switch to stack-allocated context (using the proposed structure), we need to do benchmarks on the performance impact due to the additional indirection overhead.

My expectation is that since the HPyContext is then stack-allocated and since the stack is mostly in the CPU's L1 cache, the first indirection is very cheap (like just a few CPU cycles) whereas this could be way worse for a heap-allocated context. I expect the second indirection to be as expensive as accessing the heap-allocated struct with a reasonable chance of even better performance since we can now mark the whole built-in handle and function tables to be constant. That would put less caching restrictions on those. But that remains to be shown.

References

Please see also some description/analysis about JNIEnv here:

Clone this wiki locally