-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988
Conversation
|
I like this proposal :) |
|
|
||
| #### Hold off on developing Char further and remove it from stdlib.builtin | ||
|
|
||
| `Char` is currently expensive to create and use compared to a `StringSlice` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have numbers on this?
I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what Span is and what StringSlice uses underneath.
also using different methods on it
Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).
*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528
Let's avoid optimizing things without data.
A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.
|
Hi Martin, thanks for taking the time to write up a proposal on this topic! 🙂 Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌 Before responding to your proposal, let me share a bit out where my head is at regarding how we handle these string processing APIs. For some context on my recent work in this area, we had had a discussion internally on the stdlib team and came to a tentative consensus to move forward with the name My current thinking is a modification to what @owenhilyard proposed in Discord here:
More directly, I'm currently thinking should do the following:
Then, eventually, re-add a Now some initial thoughts on your proposal 🙂
Minor clarification: This is correct re. Swift's To that end, that's partly why I'm in favor of renaming our Using terminology like
I think both types of iteration (over decoded As a simple example, our Unicode case conversion logic maps a single So insofar as your argument is that Regarding the general thrust of the rest of the proposal:
Re. (2): One idea we've been discussed a bit internally (but haven't had time to implement yet) is the idea of using named arguments within var str = StringSlice("abcef")
var bytes: Span[Byte] = str[byte=2:5]
var cps: StringSlice = str[codepoints=1:3]
var chars: StringSlice = str[chars=1:5] # Grapheme clustersI still have some reservations about this, but I like that it strikes a balance between being a relatively terse syntax for indexing, while also staying absolutely clear and readable about what indexing semantics are being used (helping the programmer stay aware of behavior and potentially performance concerns). I'm unsure though what I think the default should be for Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users. |
|
Hi @ConnorGray
Thanks for writing, I actually like reading through well thought ideas. Your "unordered thoughts" match some of my essays lol
+1 on all 3 of these
This one I'm not so fond of because I like the idea of parametrizing more. I'll expand later
I think we should have the method be
I seriously think it is not necessary to deprecate the default iteration over unicode codepoints given it is the IMO sane default (and practically free using
100% Agree on this
Also on this, but I just think that the default should be what is most often used. I also think it's quite straightforward to do: a = "123"
for item in a:
c = Codepoint(item)
# or for what I've read them being used for very often
# in our unicode casing code and golang's stdlib
# some people call them "runes"
utf32_value, utf8_length = Codepoint.parse_utf32_utf8_length(item)
...
IMHO this will be much more work than parametrizing and adding
Quite the opposite, the optimization trickles down to all users and libraries which interact with the generic
That section was to propose in the case that we wanted a *: This wouldn't include when
It's just a matter of bitcasting the underlying bytes to the proper encoding datatype. For example, the data for a utf16 is actually stored in a
Apologies again for being intense sometimes hehe. String handling is what I still love about Python, and I want to make it even better in Mojo 🔥 |
|
thanks @martinvuyk I've added to next design discussion meeting |
|
The thing is how much memory are you willing to use to get O(n) indexing. |
|
Overall, I agree with @ConnorGray. I think that Mojo benefits by being correct by default, and then having the ability to drop down to get more performance after you read about the hazards. In this case, that means defaulting to Graphemes, and giving codepoint and byte access options. People dealing with large bodies of text will need to understand unicode and what possibilities exist for their problem, or they can just throw CPU at the problem. When people think of
We can make an alias, so that people who reach for
The reason I think that we may want to be parameterized over encoding is because not only do you want Ascii, but also UTF-16 (JS interop with WASM, Java), but also UCS-2 (Windows). It means the stdlib-side string handling code is going to be gross, but I would rather go for a solution we can adapt than one where we have "oops, we can't talk to that". This is especially true since I've seen some RAG DB research which wants to use UTF-32 as the input encoding since it's easier to only need to do "codepoints -> graphemes" on a GPU.
@martinvuyk these "casts" are non-trivial amounts of compute. There are only a few of these "casts" that are cheap, like ASCII -> UTF-8. There are also issues with characters which are impossible to represent in ASCII. I think having the buffer parameterized and converting the whole thing when you want to is a better idea, since I think most code will take input, make everything a single encoding for internal processing, and then potentially convert back to a desired output encoding. These are also potentially fairly large copies to compute, since UTF-16 and UTF-32 usually take more memory than the equivalent UTF-8. I think this makes sense because the encoding is a property of the data in the buffer, not the function called on the data in the buffer. @leb-kuchen
|
I think it is either O(n) memory or O(n) time. In my opinion it is not worth it and iteration should be faster in 90% of the cases. I think of the following API
Graphemes are designed for segmentation , and not really for indexing. Is still possible to possible to design an API this way, but it is not where graphemes shine. |
|
I think with low-level programming for Mojo you want closer to the metal than even llvm and mlir. What would the concept of passes in this framework look like. Since I was first introduced to Mojo, the impressions I get are you want orders of magnitude in performance. |
My thought is that something like this will happen: If the iteration isn't in the critical path, then nobody will notice and it will silently do the thing which is correct 99% of the time, which is iterating over visible symbols. Very few people not well versed in unicode expect "👍🏻" and "👍" result in different numbers of loop iterations. If it is in the critical path, then it shows up in profiling, and someone googles "Mojo string iteration slow", and lands on a docs page or blog post about how we decided to do graphemes by default, and which tells the user to make use of |
IMO you are underestimating how much code in the wild is a series of small/medium/large text processing, and how little developers care about performance once a project works. (I think) The impact of these defaults will be big not only for large text.
Sidenote on this: graphemes are an extension of the unicode standard, using only codepoints is also correct. And code which is migrated from Python that does arithmetic based on codepoints might have issues with Mojo having different defaults.
@owenhilyard I don't mean converting from UTF-8 to these encodings, I mean casting the buffer to e.g. a I'm willing to change my mind on the defaults if we can have some benchmarks which show a relatively low impact for common use-cases. My main focus with this proposal is to have |
[External] [stdlib] [NFC] Reorganize UTF-8 utils Reorganize UTF-8 utils into their own files and tests. It is bloating `string_slice.mojo` and `test_string_slice.mojo`, and in the case that #3988 gets accepted we will need to build similar infrastructure for UTF-16 and UTF-32. Closes #4246 MODULAR_ORIG_COMMIT_REV_ID: 478ed82ea126b4195721f5d246513040cc42b538
|
|
||
| Graphemes are more expensive to slice because one can't use faster algorithms, | ||
| only skipping each unicode codepoint one at a time and checking the last byte | ||
| to see if it's an extended cluster *. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed; but graphemes are what most humans care about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in every case, when designing a database index or something where the size of the string keys need to be known or at least limited up to X bytes one does care about unicode codepoints or even just bytes (when using pure ASCII). There is also doing a StringSlice into an http header for example, where the encoding is ISO-8859-1. IMO we should be able to offer a zero-copy way to read them
The thing with Mojo is that our user-base will be from systems people all the way up to hopefully even web development where graphemes are what is most natural to think about. My main issue with going graphemes by default is that it will force the cost down to people who will not want it. We could keep String with such defaults, but maybe we should let StringSlice be much more generic over string data.
|
In my opinion this proposal has many great ideas, but I still remain skeptical towards parameterized encoding. |
[External] [stdlib] [NFC] Reorganize UTF-8 utils Reorganize UTF-8 utils into their own files and tests. It is bloating `string_slice.mojo` and `test_string_slice.mojo`, and in the case that #3988 gets accepted we will need to build similar infrastructure for UTF-16 and UTF-32. Closes #4246 MODULAR_ORIG_COMMIT_REV_ID: 478ed82ea126b4195721f5d246513040cc42b538
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
a262ba3 to
78590d3
Compare
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
|
|
||
| ### Python | ||
|
|
||
| Python currently uses UTF-32 for its string type. So the slicing and indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true anymore. They use the format outlined in PEP393, which can be UTF-8, UTF-16, or UTF-32 or some combination (new representations are created as-needed as copies). Also I think it has some way of preserving bogus codepoints.
Python strings are immutable, which makes the PEP393 approach easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still kind of is UTF-32 (unicode-ish), just with tricks to reduce it to latin-1 or BMP when bigger codepoints aren't present. That is why I phrased it like that. But the indexing and slicing still just follows unicode codepoint logic, not utf-8 nor utf-16.
UTF-8, UTF-16, or UTF-32 or some combination
AFAIK that is not the case, they are just shorter ranges of the unicode character list and use UInt8, UInt16, or UInt32 to store the unicode codepoint values and not encoded ones.
Python strings are immutable, which makes the PEP393 approach easier.
100% having in-place mutability makes everything more complicated
|
I'm going to be writing up an actual proposal for handling String going into 1.0. Here are my thoughts so far I'm fairly against making the default iteration order be grapheme clusters, or making the "character" type grapheme clusters for the following reasons (among others):
Basically it feels like defaulting to EGCs is making things much more complicated in service of a particular set of operations that are still impossible to do properly even after all your blood, sweat and tears. To do any of this stuff "correctly" you need to know at least the user's language/writing system and details about the active font, and making the core "string" type dependent on either is ..... probably a bad idea. I'm unsure about unicode normalization-based equality comparison. It's not as bad as grapheme clusters, but there are strings that are equivalent when normalized but very much not equivalent from the perspective of a reader/user. |
|
I think we should maybe add graphemes later on but I wouldn't bet on it especially now that extensions are becoming a possibility. I think we should go for codepoint slicing and indexing by default because byte indexing and slicing is unsafe for UTF-8 and it should not be exposed by default. Ideally I'd like this proposal to move forward by having indexing parametrized for
I think we should go for raw byte comparison as we currently are and create methods for different "case insensitive-ish" comparisons (see #5389). We can have several performance choices for each, but the default should be the fastest and simplest. PS: @barcharcraz If you're interested you can look at my backlog of PRs, I've been trying to fix a bunch of things that are wrong with our string for a while now |
Is byte indexing and slicing really unsafe so long as you don't edit anything and when you actually look at the slice you're able to skip over any bogus stuff at the beginning and end? Maybe it's not useful but I'm not sure how unsafe it is. I'm not really sure String should provide
I think it makes more sense to have an encoding parameter than it does an indexing parameter. Really the indexing parameter should be on
Yes, I saw quite a collection, some of which look like good ideas. I'll try and get through them tomorrow (🫰🏼) |
There is an example in #5281. If you really implement high performance stuff then you will make assumptions based on the first byte or continuation bytes of a sequence. We shouldn't be exposing that by default and hope developers read documentation (highly unlikely IMO).
If you don't have an indexing parameter then how would functions like Anyway, I don't really care much about the indexing parameter, I proposed it more to have an option for all kinds including graphemes. But the default indexing, slicing, find, and length should be unicode codepoints due to safety and if byte indexing is desired then there is always
Me too (see #3653). Especially now that the iterator abstractions are maturing we might start pushing for other patterns like #5398. I'm stoked to see what this all will look like when it comes together :) |
This is a proposal on the main goals. Followed by a proposed approach to get there.
I would like us to reach a consensus on this topic given it will affect many projects, and involve a lot of work to fully support.
Everyone involved with strings or that should be part of this conversation I can think of: @JoeLoser @ConnorGray @lsh @jackos @mzaks @bgreni @thatstoasty @leb-kuchen