Skip to content
This repository
branch: master
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

file 381 lines (282 sloc) 14.221 kb

SEP 200 - Extensible type objects ("C-level duck typing")

Author Dag Sverre Seljebotn
Status Draft

Overview

Often Python extensions needs to communicate things on the ABI level about Python objects. In essence, one would like more slots in PyTypeObject for a custom purpose. Dictionary lookups are both impractical, slow, and require the GIL.

The solution today is often to rely on PyObject_TypeCheck. However, this works against standardizing things across libraries, and creates one-to-many situations (only one implementor of an API, but many consumers), rather than many-to-many where there can be multiple implementors and multiple consumers of a mutually agreed-upon standard (think "domain-specific PEP 3118s").

To overcome this problem, the usual approach is to propose a PEP. However, that process is a) slow, b) not suitable for very domain-specific tasks, and c) the result is not backwards-compatible with earlier versions of Python.

This SEP introduces a metaclass on the C level which can be used to provide effectively an unlimited number of custom slots in the PyTypeObject, without relying on the normal type inheritance.

Examples uses:

  • Make the NumPy array truly polymorphic at the C level, exporting various vtables
  • Implement "fast callables" on the C-level, so that, e.g., Cython can call np.sin from C code with very little overhead, because it finds a function pointer with a "dd->d" signature tagged to the Python object
  • Standardize the Cython cdef class vtable
  • Standardize the internal memoryview representation of Cython (which is much faster than passing around Py_buffer)

Implementation and lack of runtime dependency

An implementation is available from https://github.com/dagss/pyextensibletype

Every type exporting custom slots a) sets a new metaclass extensibletype as its metaclass (the ob_type->ob_type of the object instance), and b) sets tp_flags bit 22, which we hijack for this purpose (ob_type->tp_flags |= PyExtensibleType_TPFLAGS_IS_EXTENSIBLE). The metaclass is necesarry to ensure that the custom slots table is inherited to subclasses created in Python. See the discussion section for justifications for also using the flag bit.

Requiring a common run-time dependency would be a significant disadvantage. Instead, each participating provider library should contain the code for the extensibletype metaclass, provided in a short header file (include/extensibletype.h). The first such module that gets initialized puts the metaclass in the extensibletype_v1 attribute in the _extensibletype module; later modules simply look up the metaclass there.

The custom slot table

The metaclass is used to annotate each participating type with a table of custom slots:

typedef struct {
    uintptr_t id;
    union {
        void *pointer;
        Py_ssize_t objoffset;
        uintptr_t flags;
    } data;
} PyCustomSlot;

The id identifies the purpose of the slot. Users querying an object for a C-level interface scan this custom slot table in its type for IDs they recognize.

The meaning of the data attribute is defined by the custom slot in question. For instance data.pointer can contain a pointer to a vtable (like tp_as_buffer) or C-level run-time type information; data.objoffset contains an offset which can be added to a PyObject* to access an object field, or data.flags contain some flags for the object.

Custom slots are not required to have any order; types are expected to know which custom slot is most performance critical and put that first in the list.

The ID comes in two categories, allocated ID and pointer ID.

Allocated IDs

These are indicated by the least significant bit being 1. The ID space is partitioned as described in a separate section below. Typically an ID is defined statically (e.g., a SEP 2xx defines it). However, one can also reserve some parts of the ID space for IDs allocated incrementally at run-time.

Pointer ID

These are casted memory addresses. The least significant bit must be zero (i.e. an address aligned to at least 2 bytes, which should be the case for any relevant object). For instance, if there's a run-time object specifying an "interface" that both the exporter and consumer has access to, then they can use its address be able to communicate.

This SEP does not specify a specific scheme for such interface creation or any other interning scheme. There could in fact be multiple such schemes operating in parallel, since they never claim the same address.

Interning through a table lookup (using running IDs) should use some section of the ID space reserved for the interning mechanism instead.

Slot expected position

Slots can define an expected position, which consumers check before scanning the entire list. For instance, if the NumPy array grows four different C vtables, then the last one can indicate that it is expected to be found at the 4th slot. Other less featureful array types supporting that slot may pre-pad the table with {0x1, NULL} entries to put the slot in its preferred position.

The value of the expected position is in that the CPU can continue execution with the slot in that position while the check goes on in parallel in the CPU pipeline, not simply that fewer elements are scanned.

The expected position scheme should mean there's almost never branch misses for the typical usecases, since each group of custom slots likely to be found together can negotiate on a position in the table which the branch predictor should assume. For weirder objects that, e.g., supports both NumPy vtables and acts as a native callable, there will of course be conflicts and the table must be scanned.

Consumer API

The consumer API is found in include/customslots.h. It is meant to be forwards-compatible with other ways of implementing the same concepts (like a PEP), and so doesn't mention the metaclass explicitly.

All functions can be called without holding the GIL, as long as one has a guaranteed reference to the type object.

PyCustomSlots_Check(obj)
Does the object support the protocol? Should be checked before using any of the other below.
PyCustomSlots_Count(obj)
How many custom slots does the object support?
PyCustomSlot *PyCustomSlots_Table(obj)
Get a pointer to the table
PyCustomSlot *PyCustomSlots_Find(PyObject *obj, uintptr_t id, Py_ssize_t expected_pos)
Search the table for a matching slot; returns NULL if none is found. Pass the slots' expected position to expected_pos (or 0 if none is defined).

Provider API

The provider API is found in include/extensibletype.h, and requires detailed knowledge of the implementation mechanism (so go read it).

To allow sub-classing Python side, the "object struct" must be based on PyHeapTypeObject rather than PyTypeObject. A typical type object follows (full example in demo/provider_c_code.h).

Note: Even if the binary layout follows that of heap-allocated types, there is nothing heap-allocated about a typical exporter type. Also, in the example below, one could set tp_as_number to 0, but the PyNumberMethods struct would still have to be present.

PyHeapExtensibleTypeObject MyProvider_Type =
{
    /* PyHeapTypeObject etp_heaptype */
    {
        /* PyTypeObject ht_type */
        {
            PyVarObject_HEAD_INIT(0, 0),
            "myprovidertype", /*tp_name*/,
            sizeof(MyProvider_Object), /* tp_basicsize */
            0,                        /* tp_itemsize */
            ...
            &MyProvider_Type.etp_heaptype.as_number, /*tp_as_number*/
            &MyProvider_Type.etp_heaptype.as_sequence, /*tp_as_sequence*/
            &MyProvider_Type.etp_heaptype.as_mapping, /*tp_as_mapping*/
            ...
            &MyProvider_Type.etp_heaptype.as_buffer, /*tp_as_buffer*/
            ...
        },

        /* PyNumberMethods as_number */
        {
            0, /*nb_add*/
            ...
        },

        ...

        0, /* ht_name */
        0 /* ht_slots */

    }, /* end of PyHeapTypeObject */

    2, /* etp_custom_slot_count */
    my_custom_slots /* etp_custom_slot_table */
};
static int PyExtensibleType_Ready(PyHeapExtensibleTypeObject *type, Py_ssize_t slot_table_size)

Called to initialize a statically allocated extensible type. The slot_table_size is used in the case of subclassing another extensible type (see subclassing rules below).

Before calling this function, etp_count and etp_custom_slots should be filled in.

The function a) imports the extensibletype metaclass and sets type->ob_type to it, b) patches the etp_custom_slots table in response to inheritance, c) calls PyType_Ready, d) updates tp_flags.

Note: In the current implementation, subclassing from another extensible type (step d) is simply not implemented, and will raise an exception. This support can be added when it is needed.

PyTypeObject *PyExtensibleType_Import()
Get hold of the extensibletype metaclass directly. There's normally no need to call this.

Subclassing

Statically allocated C subclasses: Since etp_custom_slots is statically allocated, it should be over-allocated and padded with slots with 0 as ID. The number of non-zero slots should be filled in etp_count, while the table size is passed to PyExtensibleType_Ready. The table is then modified to inherit the custom slots just like the built-in slots:

  • Slots are inherited from the parent class by prepending them to the table. The PyCustomSlot struct is simply copied by value.
  • If the same ID is present in the custom slot table of the child, the parent slot is not inherited.
  • If the final number of slots is larger than the count passed to PyExtensibleType_Ready, an exception is raised.

Heap-allocated Python classes: The metaclass ensures that the custom slots of the parent is copied also to Python classes inheriting from classes with custom slots. However, there is no mechanism for changing the table of custom slots (the table pointer is simply set to the table of the superclass).

Libraries can however subclass the extensibletype metaclass in order to (somehow) provide the ability for Python subclasses to modify the table (like a __customslots__ class attribute or similar).

Benchmark results

The penalty of a branch-predicted table lookup in a micro-benchmark was 0.54 ns for one particular test on a 1.87 GHz (Intel Core i7 Q 840).

Changing to a format where the table was embedded directly, loosing one pointer indirection, did not change the numbers at all. Also, because the var-object resizeability is already used up for the method table in heap-allocated types, this would be somewhat intricated.

There was no difference between checking ob_type->tp_flags and checking for a metaclass; ob_type->ob_type. For the metaclass checking strategy, there was no difference between only being able to match the metaclass itself, or also having the possibility of matching a metaclass subclass (as long as that possibility isn't taken, i.e. the direct match is likely).

The custom slot ID space

As mentioned above, when least significant bit is 1 the slot IDs are statically assigned.

For static assignment we assume that the uintptr_t is at least 32 bits; any higher bits should always be 0.

The most significant 8 bits (of the lower 32) denote a "registrar". Each registrar determines the use of the remaining 23 bits, but a recommendation, from most to least significant, is:

  • 8 bits: Registrar (required)
  • 16 bits: Which custom slot "idea"
  • 7 bits: Which backwards-incompatible version of the idea
  • 1 bit: Should be set to 1 for static IDs (*PS! required*)

Special IDs

  • 0x00000000: Empty table position (use for trailing slots when over-allocating table)
  • 0x00000001: Use this if skipping table slots is needed

ID space (most significant 8 bits)

  • 0x00: Reserved
  • 0x01: For internal/personal use, never use in released libraries
  • 0x02: Cython
  • 0x03: NumPy
  • 0x04: NumFOCUS SEPs
  • 0x05-...: Whoever asks

Discussion

Hijacking bit 22 in tp_flags has the following advantages:

  • Consumers don't have to call any PyCustomSlots_Init to import a reference to the metaclass
  • Consumers don't have to carry along a metaclass implementation just in case they are imported before the first provider. (Keep in mind that if the NumPy C API is refactored to be based on this mechanism, there will be a lot of consumers.)
  • It is (probably) microscopically faster if you need to subclass the metaclass for some reason. No effect if you're not subclassing the metaclass though (due to branch prediction working its wonders)

The disadvantage is of course that we hijack a flag, and we have no guarantee that other Python libraries are not doing the same.

If a new Python version uses all available flag bits (and this SEP is not accomodated by any PEPs in the meantime), one can switch to walking ob_type and tp_base rather than checking tp_flags.

As for inclusion as a PEP, that only works for new Python versions. Python-dev was consulted on the question [1], and Nick Coghlan's response [2] indicated that a PEP might not be entirely impossible but should require a working implementation based on meta-classes first.

[1] http://mail.python.org/pipermail/python-dev/2012-May/119481.html
[2] http://mail.python.org/pipermail/python-dev/2012-May/119518.html
Something went wrong with that request. Please try again.