Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Comments: Added UnicodeBuilder #205

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cklein
Copy link
Contributor

@cklein cklein commented Apr 11, 2021

RfC

This is a very early draft for an implementation of UnicodeBuilder.
As @antocuni pointed out in cklein#1, a having a discussion around the API first is a good idea, so take the implementation details with a grain of salt and have a focus on the API.
Maybe it's easier to have a discussion on #181 first.

The first iteration (first commit) was very close to ListBuilder and TupleBuilder although it seemed to make sense to preallocate memory.

This changed in the second iteration and it turned out to be a lot like HPyTracker, so it took over some naming conventions.
One of the main differenes is that it uses a C-Python list under the hood to store objects.

Currently there's no way to add objects beyond the initial capacity.
If this is continuing to use a C-Python list, the implementation could switch back from PyList_SET_ITEM to PyList_Append
which automatically handles reallocation including overallocation.
Or the storage could be setup similar to HPy_Tracker and share the reallocation strategy.
A third option would be to internally use a HPy_Tracker, it already provides everything that's needed.

The idea turned out to be very closely related to HPyTracker,
so I blatantly stole some code there.
@hodgestar
Copy link
Contributor

hodgestar commented Apr 13, 2021

@cklein Thank you for starting this (its really great to have code to look at to kick-off the design discussion). Apologies for taking a bit long to reply -- life got in the way for awhile.

My 2c on the API design:

I think we want an API that works more with C strings than HPy handles that point to strings.

For me the important use case is "I'm trying to efficiently construct a Python string from a possibly unknown number of C strings". That's currently hard in HPy because unlike in the old C API, we don't want to expose the raw character buffer because that constrains the Python implementation to store strings in that format.

In particular, it would be good to be able to use UnicodeBuilder when implementing HPyUnicode_FromFormat. CPython internally has an _PyUnicodeWriter which fulfills a similar role to UnicodeBuilder (although I don't propose we copy that API -- it got a bit messy over the years).

I think the structure of the current PR is pretty good though, and perhaps we can fulfill the above needs with just a few changes to the API:

  • In HPyUnicodeBuilder HPyUnicodeBuilder_New(HPyContext *ctx, HPy_ssize_t size), make the size be an initial size for the buffer in bytes (I think @antocuni already suggested this)

  • In int ctx_UnicodeBuilder_Add(HPyContext *ctx, HPyUnicodeBuilder builder, HPy h_item) replace HPy h_item with const char * buffer that contains a UTF8 string and perhaps add _NAdd that takes a buffer and a length (with a less horrible name than NAdd).

  • Possibly add _SetSlice (like @antocuni suggested) in case someone wants to write into adhoc places (although I think concatenating is the most common, and we can always add _SetSlice a bit later).

Then the buffer management is all hidden behind the API and _Build can pass the buffer or whatever it is storing directly to the underlying Python implementation which can incorporate it into a Python string without making copies.

@cklein
Copy link
Contributor Author

cklein commented Apr 13, 2021

Thanks a lot for the feedback @hodgestar.
First no worries, for me it's not a time critical thing and I don't expect anybody to respond in realtime :-)
I'm going to have a look at _PyUnicodeWriter later, but in the meantime I started an implementation that operates on a c string instead of a list of Python strings (footgun loaded...). Feedback appreciated, I also don't mind rewriting it.

There are some open questions:

  • Should it trust the input to be 0 terminated (using strlen vs strlen)?
  • Should the API require a size or should there be a default value?
  • Some implementation detail, how should the reallocation strategy work?

@antocuni
Copy link
Collaborator

@cklein sorry for the very late response. Some part of the delay was caused by real-life issues, but a good part of it was because I spent a lot of time trying to think of the appropriate API that we want for HPyUnicodeBuilder&co.
See #214 for details. Once we have finalized the API design, we can go back to this PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants