String builder API #214

antocuni · 2021-06-10T13:08:17Z

We need to design an API to build str and bytes object.
Before reading this issue, it is strongly suggested to read the document which summarizes the current CPython API for doing that and how they are used in the real world.

General goals

Whatever API we come with, it needs to meet the following goals:

must have 0 or almost 0 overhead on CPython in CPython-ABI mode
must offer a clear and easy way to implement all the real-world usage patterns described here
must be easy and reasonable to implement for alternative implementations

In this issue, I'll use HPyBytes_* and HPyStr_*. See #213 for more discussion about naming.

Raw-buffer vs opaque API

There are many real-world patterns in which it is necessary to offer a raw buffer interface, so we need to provide it. However, we need to decide whether we want to always expose the raw buffer or only when the user explicitly needs/requires it.

Always exposing the raw buffer makes the API smaller and simpler to use.

Exposing it only upon request might allow for a more efficient implementation. For example, GraalPython could maybe use a java.lang.StringBuilder?

Open question: can PyPy/GraalPython/etc. optimize things better in the non-buffer case?

Proposal 1: "functional interface"

This is probably the most similar to the current API. The implementation on CPython is straightforward since HPyBytesBuilder and HPyStrBuilder can just be thin wrappers around PyBytesObject* and PyUnicodeObject*:

    HPyBytesBuilder b = HPyBytesBuilder_New(ctx, size);
    char *buf = HPyBytesBuilder_GetBuffer(ctx, b);
    buf[0] = 'h';
    strcpy(buf[1], "ello world");
    HPyBytesBuilder_WriteChar(b, 0, 'H');
    HPy s = HPyBytesBuilder_Build(ctx, b);

The biggest drawback is that at the time of _New, the implementation does not know yet whether the user will request a raw buffer or not.

Moreover, for the unicode case you would need different versions of HPyStr_GetBuffer, one for each kind:
HPyStr_GetBuffer_UCS{1,2,4}, and you get undefined behavior if you call the wrong one (this is a problem which exists also on the currenct CPython API, of course).

Proposal 2: type-safe API

For the vast majority of real-world usage of PyUnicode_New, the maxchar and consequently the kind are fixed and known at compile time. We could take advantage of that and propose an API which is more type-safe at the C level, e.g.:

    char *buf1;
    HPy_UCS1 *buf2;
    HPy_UCS2 *buf3;
    HPy_UCS2 *buf4;

    HPyStrBuilder b0 = HPyStrBuilder_New(ctx, size, maxchar);
    HPyStrBuilder b1 = HPyStrBuilder_ASCII(ctx, size, &buf1);
    HPyStrBuilder b2 = HPyStrBuilder_UCS1(ctx, size, &buf2);
    HPyStrBuilder b3 = HPyStrBuilder_UCS2(ctx, size, &buf3);
    HPyStrBuilder b4 = HPyStrBuilder_UCS4(ctx, size, &buf4);
    ...

In this proposal, the C type of HPyStrBuilder is always the same but it provides several different constructors: if you construct the builder with HPyStrBuilder_New you can only use the opaque API to write content into it. If you need a raw-buffer interface, you can use one of the other constructors, which also return a pointer to a buffer of the correct type.

Some additional notes:

for the _ASCII, _UCS1, etc. cases, you don't need to specify maxchar: it is already implicit.
The implementation knowns in advance whether it needs to provide a C-level buffer or not.
It makes it impossible to use a raw-buffer interface if you don't know in advance the kind.

Luckily, there is only one real world usage of the pattern described in point 3 above, inside PyICU. A stripped-down version of the code is this:

PyObject *result = PyUnicode_New(len32, max_char);
...
switch (PyUnicode_KIND(result)) {
  case PyUnicode_1BYTE_KIND:
    for (int32_t i = 0; i < len32; ++i)
        PyUnicode_1BYTE_DATA(result)[i] = (Py_UCS1) (utf16[i]);
    break;

  case PyUnicode_2BYTE_KIND:
    u_memcpy((UChar *) PyUnicode_2BYTE_DATA(result), utf16, len16);
    break;

  case PyUnicode_4BYTE_KIND: {
    u_strToUTF32((UChar32 *) PyUnicode_4BYTE_DATA(result), len32, NULL,
                 utf16, len16, &status);
    ...
    break;
  }

The code above could be easily rewritten in the following way, which is only slightly more inconvenient. In this example, HPyStr_KIND is a macro which pre-computes the kind given the max_char:

HPyStrBuilder b_result;
...
switch (HPyStr_KIND(max_char)) {
  case HPyUnicode_1BYTE_KIND:
    HPy_UCS1 *ucs1_buffer;
    b_result = HPyStrBuilder_UCS1(ctx, len32, &ucs1_buffer);
    for (int32_t i = 0; i < len32; ++i)
        ucs1_buffer[i] = (Py_UCS1) (utf16[i]);
    break;

  case HPyUnicode_2BYTE_KIND:
    HPy_UCS2 *ucs2_buffer;
    b_result = HPyStrBuilder_UCS2(ctx, len32, &ucs2_buffer);
    u_memcpy((UChar *) ucs2_buffer, utf16, len16);
    break;

  case PyUnicode_4BYTE_KIND: {
    HPy_UCS4 *ucs4_buffer;
    b_result = HPyStrBuilder_UCS4(ctx, len32, &ucs4_buffer);
    u_strToUTF32((UChar32 *) ucs4_buffer, len32, NULL,
                 utf16, len16, &status);
    ...
    break;
  }

HPy result = HPyStrBuilder_Build(ctx, b_result);

Proposal 3: "buffer inside a struct"

Instead of returning the builder and the buffer separately, we could pack everything inside a struct. E.g.:

    typedef struct {
        // on CPython, _private contains the result of
        // PyBytes_FromStringAndSize(NULL, size); on alternative impl, it
        // contains a pointer to whatever the impl wants, similar to
        // HPyTupleBuilder&co.
        HPy_ssize_t _private;
        char *buffer;
    } HPyBytesBuilder;

    HPyBytesBuilder b1, b2;
    HPyBytesBuilder_Init(ctx, &b1, size);
    assert(b1->buffer == NULL);
    HPyBytesBuilder_WriteChar(ctx, b1, 0, 'H');

    HPyBytesBuilder_InitWithBuffer(ctx, &b2, size);
    b2->buffer[0] = 'H';

In the example above I used _Init and _InitWithBuffer, but we could also turn it into a flag.

The biggest advantage is that it scales very for the unicode case:

    typedef struct {
        HPy_ssize_t _private;
        int kind;
        union {
            void *data;
            HPy_UCS1 *ucs1;
            HPy_UCS2 *ucs2;
            HPy_UCS4 *ucs4;
        };
    } HPyStrBuilder;

    HPyStrBuilder b1, b2;
    HPyStrBuilder_Init(ctx, &b1, size, maxchar);
    HPyStrBuilder_WriteChar(ctx, b1, 0, 'H');
    
    HPyStrBuilder_InitWithBuffer(ctx, &b2, size, 0xFFFF); // maxchar==0xFFFF ==> UCS2
    b2->ucs2[0] = 'H';

In this scenario, the user needs to take care of not accessing the wrong field of the union, but this is the same problem that they have now with PyUnicode_{1,2,4}BYTE_DATA.

The HPyStrBuilder struct also makes it very convenient to access the kind, without having to do a function call.

If we want to go one step further, we could remove the union and make ucs{1,2,4} real fields, one of them pointing to the buffer and the others being NULL: this way, if you try to access the wrong buffer by mistake, you get a nice segfault.

Fixed-size vs growing buffers

All the proposals above assume that the size of the string is known in advance: there is no support for growing buffers.

To achieve the result, some real word code
uses PyUnicode_Join. In theory, alternative implementations like PyPy could provide a faster alternative, if we expose the proper API.

Open questions:

do we want support for this use case?
Does it need to be integrated with the builders which we are designing
here, or it should be a completely different API?
are we aware of any existing C extension which would benefit from such a
feature, i.e. which build growing strings in C in performance-critical
code?

The text was updated successfully, but these errors were encountered:

TeamSpen210 · 2021-06-15T22:42:19Z

pickle actually contains what could be an excellent use case for a resizable builder. _Pickler_Write accumulates binary data into a bytes object, resizing it as required. (The frame part is unrelated, it's part of the pickle format.) In the case of file-dump mode, this is periodically ouputted, otherwise it collects the full pickle.

It's not necessarily in performance-critical code, but Cython uses an inlined/optimised version of PyUnicode_Join for implementing F-strings.

steve-s · 2023-01-31T21:43:17Z

Note I'd like to remove docs/misc/str-builder-api.rst from user documentation, but leave a trace of it here.

antocuni added the API design label Jun 10, 2021

This was referenced Jun 10, 2021

Request for Comments: Added UnicodeBuilder #205

Draft

Introduce BytesBuilder #181

Open

timfel added this to the Version 0.9 milestone Nov 1, 2022

fangerer modified the milestones: ABI version 1, ABI version 2 Jan 30, 2023

steve-s mentioned this issue Jan 31, 2023

Small documentation improvements & quickstart guide #399

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String builder API #214

String builder API #214

antocuni commented Jun 10, 2021 •

edited

Loading

TeamSpen210 commented Jun 15, 2021

steve-s commented Jan 31, 2023

String builder API #214

String builder API #214

Comments

antocuni commented Jun 10, 2021 • edited Loading

General goals

Raw-buffer vs opaque API

Proposal 1: "functional interface"

Proposal 2: type-safe API

Proposal 3: "buffer inside a struct"

Fixed-size vs growing buffers

TeamSpen210 commented Jun 15, 2021

steve-s commented Jan 31, 2023

antocuni commented Jun 10, 2021 •

edited

Loading