New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF codecs and validators to the Stdlib
#10660
Comments
I only had a brief glance at the API, but I confirm; upstream wants this :) I'll try to give a fuller look tomorrow. Thanks! |
In your description you speak of decoders and validators but the proposal also includes encoders (
Finally, I couldn't come up with any obvious use-cases for this |
Stdlib
Stdlib
Thanks @nojb for having a look.
Yes. I adapted the issue title to reflect this.
Yes it differs. But other Both UTF-8 and UTF-16 are variable length encodings. We could have the same design as other
Encoding in fixed size buffers to I also have opened issues dbuenzli/stdlib-utf#1 and dbuenzli/stdlib-utf#2 which need discussion. |
Returning the encoding length seems fine to me. What I am less sure about is whether to signal error by returning a negative value. On the one hand, it is a clever way of returning the extra information without any allocation or exception. On the other hand, the general style of the stdlib is to always signal an error either by raising or by using an option type (in this case an option type would be heavy, but I guess raising is a possibility). |
Maybe it's a matter of view point. Personally I don't see that case as being an error or particularly exceptional. It will happen, so I really don't want to raise. Note that it's the same interface If there is strong consensus on not returning negative numbers then I rather suggest to return |
Also maybe we can change the names, rather than |
Also I want, to add, none of the use cases I personally have in mind would suffer from that (basically I would call the And for using the functions to reimplement existing |
I think this is the last point that remains to be decided before moving on to PR-stage. Let's try to make a decision. Returning 0 seems fine to me. @gasche @alainfrisch do you have an opinion? The question is: given the proposed signature val set_utf_8_uchar : t -> int -> Uchar.t -> int
(** [set_utf_8_uchar b i u] UTF-8 encodes [u] at index [i] in [b].
If the result value [n] is:
{ul
{- [n > 0], [n] is the number of bytes that were used for encoding
[u]. A new character can be encoded at [i + n].}
{- [n < 0], [b] was left untouched because there was no space left
at [i] to encode [u]. [-n] is the total number of bytes needed for
encoding [u].}} *) we are discussing what to return when there is not enough space:
I believe we have a soft consensus on 2. but having one or two extra opinions would be helpful! |
I think the original design |
I think we should also consider:
Personally I'm starting to lean towards the return |
I strongly dislike the idea of encoding a 2-case result type into an integer, with different meanings for positive and negative values. I don't see concrete cases where the caller would need to know the exact number of bytes which are missing to encode the uchar, and if the need arises, we have a function that gives the answer, with some small extra cost. The approach with returning 0 when there is not enough space at least has a uniform meaning (number of bytes actually written). It's slightly unsatisfactory having to check for a specific value to know whether the function succeeded or not, but I'm not opposed to it; it is a good compromise. I wonder what would be the performance impact of always calling Otherwise, what about returning a |
Thanks @alainfrisch for your input.
I personally don't find any weirdness in the I changed the proposal to "return the number of written bytes" . Personally I see nothing non-idiomatic in that. Again it's just the signature of dbuenzli/stdlib-utf@d40aa57 Amends the proposal. |
I have slightly changed the functions on decodes not to mention bytes as the All the proposed additions can be found here, I'm glad if people manage to complain before I PR (open an issue on the tracker). I hope to be able to do that sometime next week according to this plan so that this hopefully gets in before we all get frozen. |
Stdlib
Stdlib
Since we have
Uchar.t
UTF encoders inBuffer
, I have been thinking (for a long time) it would be nice to add UTF decoders to theStdlib
. For now I focus on simply providing an API forBytes
, the rest can follow later.I think I finally found a design I'm happy with. It is efficient (no allocations, no exceptions) and can easily be used to devise wasteful higher-level abstractions (e.g
Uutf
-like folders orSeq
based iterators). It also has good properties for best-effort decodes.Since these things are a pain to tweak and work on with a compiler build on your knees I developed this in this repository. Note that there's a single API candidate there, just a few different implementations of UTF-8 decoding that I wanted to try, see the README for all the details.
Before I consider making a PR, I'd like to:
For 3. I suggest to use
pat
, it was the fastest on my machine and it lets the compiler nicely works out the dispatch table for us by using character range patterns :-)Also I still need to work out a little test suite.
Feel free to open issues and/or make PR there aswell if you see it fit.
The text was updated successfully, but these errors were encountered: