Request for Comments: Added UnicodeBuilder #205

cklein · 2021-04-11T13:34:38Z

RfC

This is a very early draft for an implementation of UnicodeBuilder.
As @antocuni pointed out in cklein#1, a having a discussion around the API first is a good idea, so take the implementation details with a grain of salt and have a focus on the API.
Maybe it's easier to have a discussion on #181 first.

The first iteration (first commit) was very close to ListBuilder and TupleBuilder although it seemed to make sense to preallocate memory.

This changed in the second iteration and it turned out to be a lot like HPyTracker, so it took over some naming conventions.
One of the main differenes is that it uses a C-Python list under the hood to store objects.

Currently there's no way to add objects beyond the initial capacity.
If this is continuing to use a C-Python list, the implementation could switch back from PyList_SET_ITEM to PyList_Append
which automatically handles reallocation including overallocation.
Or the storage could be setup similar to HPy_Tracker and share the reallocation strategy.
A third option would be to internally use a HPy_Tracker, it already provides everything that's needed.

The idea turned out to be very closely related to HPyTracker, so I blatantly stole some code there.

hodgestar · 2021-04-13T09:46:58Z

@cklein Thank you for starting this (its really great to have code to look at to kick-off the design discussion). Apologies for taking a bit long to reply -- life got in the way for awhile.

My 2c on the API design:

I think we want an API that works more with C strings than HPy handles that point to strings.

For me the important use case is "I'm trying to efficiently construct a Python string from a possibly unknown number of C strings". That's currently hard in HPy because unlike in the old C API, we don't want to expose the raw character buffer because that constrains the Python implementation to store strings in that format.

In particular, it would be good to be able to use UnicodeBuilder when implementing HPyUnicode_FromFormat. CPython internally has an _PyUnicodeWriter which fulfills a similar role to UnicodeBuilder (although I don't propose we copy that API -- it got a bit messy over the years).

I think the structure of the current PR is pretty good though, and perhaps we can fulfill the above needs with just a few changes to the API:

In HPyUnicodeBuilder HPyUnicodeBuilder_New(HPyContext *ctx, HPy_ssize_t size), make the size be an initial size for the buffer in bytes (I think @antocuni already suggested this)
In int ctx_UnicodeBuilder_Add(HPyContext *ctx, HPyUnicodeBuilder builder, HPy h_item) replace HPy h_item with const char * buffer that contains a UTF8 string and perhaps add _NAdd that takes a buffer and a length (with a less horrible name than NAdd).
Possibly add _SetSlice (like @antocuni suggested) in case someone wants to write into adhoc places (although I think concatenating is the most common, and we can always add _SetSlice a bit later).

Then the buffer management is all hidden behind the API and _Build can pass the buffer or whatever it is storing directly to the underlying Python implementation which can incorporate it into a Python string without making copies.

cklein · 2021-04-13T16:47:19Z

Thanks a lot for the feedback @hodgestar.
First no worries, for me it's not a time critical thing and I don't expect anybody to respond in realtime :-)
I'm going to have a look at _PyUnicodeWriter later, but in the meantime I started an implementation that operates on a c string instead of a list of Python strings (footgun loaded...). Feedback appreciated, I also don't mind rewriting it.

There are some open questions:

Should it trust the input to be 0 terminated (using strlen vs strlen)?
Should the API require a size or should there be a default value?
Some implementation detail, how should the reallocation strategy work?

… Python strings

antocuni · 2021-06-10T14:51:49Z

@cklein sorry for the very late response. Some part of the delay was caused by real-life issues, but a good part of it was because I spent a lot of time trying to think of the appropriate API that we want for HPyUnicodeBuilder&co.
See #214 for details. Once we have finalized the API design, we can go back to this PR :)

cklein and others added 2 commits April 11, 2021 15:23

Added StringBuilder

1943680

Renamed StringBuilder to UnicodeBuilder

de2e78a

cklein force-pushed the cklein-stringbuilder branch from e11a645 to 914b447 Compare April 11, 2021 13:39

Implementation with a preallocated list

cd4c539

The idea turned out to be very closely related to HPyTracker, so I blatantly stole some code there.

cklein force-pushed the cklein-stringbuilder branch from 914b447 to cd4c539 Compare April 11, 2021 14:51

cklein force-pushed the cklein-stringbuilder branch from 68df196 to 74e2eef Compare April 13, 2021 16:49

Changed UnicodeBuilders to operate on a c string instead of a list of…

c2ed4f2

… Python strings

cklein force-pushed the cklein-stringbuilder branch from 74e2eef to c2ed4f2 Compare April 13, 2021 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Comments: Added UnicodeBuilder #205

Request for Comments: Added UnicodeBuilder #205

cklein commented Apr 11, 2021 •

edited

Loading

hodgestar commented Apr 13, 2021 •

edited

Loading

cklein commented Apr 13, 2021

antocuni commented Jun 10, 2021

Request for Comments: Added UnicodeBuilder #205

Are you sure you want to change the base?

Request for Comments: Added UnicodeBuilder #205

Conversation

cklein commented Apr 11, 2021 • edited Loading

RfC

hodgestar commented Apr 13, 2021 • edited Loading

cklein commented Apr 13, 2021

antocuni commented Jun 10, 2021

cklein commented Apr 11, 2021 •

edited

Loading

hodgestar commented Apr 13, 2021 •

edited

Loading