Skip to content

Conversation

@martinvuyk
Copy link
Contributor

This is a proposal on the main goals. Followed by a proposed approach to get there.

I would like us to reach a consensus on this topic given it will affect many projects, and involve a lot of work to fully support.

Everyone involved with strings or that should be part of this conversation I can think of: @JoeLoser @ConnorGray @lsh @jackos @mzaks @bgreni @thatstoasty @leb-kuchen

@martinvuyk martinvuyk changed the title String, ASCII, Unicode, UTF, Graphemes [stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes Feb 5, 2025
@gryznar
Copy link
Contributor

gryznar commented Feb 5, 2025

I like this proposal :)


#### Hold off on developing Char further and remove it from stdlib.builtin

`Char` is currently expensive to create and use compared to a `StringSlice`
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this?

I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what Span is and what StringSlice uses underneath.

also using different methods on it

Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).

*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528

Let's avoid optimizing things without data.

A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.

@ConnorGray
Copy link
Collaborator

Hi Martin, thanks for taking the time to write up a proposal on this topic! 🙂

Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌

Before responding to your proposal, let me share a bit out where my head is at regarding how we handle these string processing APIs. For some context on my recent work in this area, we had had a discussion internally on the stdlib team and came to a tentative consensus to move forward with the name Char / char_* for types that operate on Unicode codepoints, with the understanding that we could rename if needed for clarity. It sounds like the current name may not be ideal, and so we're open to renaming it 🙂

My current thinking is a modification to what @owenhilyard proposed in Discord here:

I'm somewhat torn. For most people, they seem to want to iterate over symbols in text, graphemes. However, that can have a substantial perf cost. Char also has potential for confusion with c_char, since c_char is usually a byte. What if we kept the name of the functions as .characters() but make the function return a Grapheme iterator, but also provided .codepoints() and .bytes()? I personally prefer to make the API most people will reach for the most correct in terms of what most people want (iteration over symbols they can see), and then if you have perf issues you can use codepoints or bytes directly.

More directly, I'm currently thinking should do the following:

  • Rename Char to Codepoint for minimal ambiguity
  • Keep StringSlice.as_bytes() -> Span[Byte]
  • Rename .chars() to StringSlice.codepoints() -> CodepointsIter (iterator over Codepoint)
  • Rename .char_slices() to StringSlice.codepoint_slices() (iterator over single-codepoint StringSlice pointers)
  • Eventually introduce a grapheme cluster iterator called .characters() or .graphemes(), along with Character (owning) and CharacterSlice (view) types.
    • (If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

Then, eventually, re-add a StringSlice.__iter__() that either calls .codepoint_slices() or .graphemes(), pending discussion about the merits of each of those (performance vs correctness tradeoff, essentially).


Now some initial thoughts on your proposal 🙂

They [Swift] have a Character that I think inspired our current Char type, it is a generic representation of a Character in any encoding, that can be from one codepoint up to any grapheme cluster.

Minor clarification: This is correct re. Swift's Character containing one or more clustered codepoints, but our Char type I added was actually inspired by Rust's char type, which is a unsigned 32 bit integer. I hadn't intended for our Char to eventually support stored an unbounded amount of codepoints, because that would have meant it would need to allocate, which wasn't optimal for the common cases where iterating over codepoints is sufficient.

To that end, that's partly why I'm in favor of renaming our Char type to Codepoint. Char has the advantage of being "recognizable" as an English-ish word, but I've come around to the opinion that we're better off using API naming that more closely adheres to the underlying model and inherent complexity of UTF-8 and Unicode, instead of trying to mask that complexity by pretending that a "Character" and Unicode scalar value are equivalent.

Using terminology like Codepoint might be unfamiliar at first to folks not already knowledgeable about how computers represent and process text. But I think part of Mojo's broader language philosophy is not shying away from and trying to paper over complexity. Instead, I think it's better to both teach and make that complexity feel manageable with APIs that model it well 🙂

Value vs. Reference

Our current Char type uses a u32 as storage, every time an iterator that
yields Char is used, an instance is parsed from the internal UTF-8 encoded
StringSlice (into UTF-32).

The default iterator for String returns a StringSlice which is a view into
the character in the UTF-8 encoded StringSlice. This is much more efficient
and does not add any complexity into the type system nor developer headspace.

I think both types of iteration (over decoded u32 codepoints, and over single-codepoint StringSlice views) are useful, and supporting both is valuable. I agree that iteration over single-character StringSlice is likely the most common for any kind of general-purpose string processing. But for certain specialized use cases, e.g. within parsers, low-level string manipulation algorithms, character encoding converters, etc. being able to compare codepoint values directly can be the most natural way to express the desired logic.

As a simple example, our Unicode case conversion logic maps a single Char codepoint to a sequence of replacement Char that are the uppercase expression of the original Char. That is an algorithm most naturally expressed in terms of single codepoint values, because the Unicode casing tables powering it are based on codepoints, not UTF-8 sequences.

So insofar as your argument is that Char should not be used frequently (and even then only be advanced users), I think this is another argument in favor of renaming Char to Codepoint, to de-emphasize it for use by less knowledgeable users.


Regarding the general thrust of the rest of the proposal:

  1. I like the idea of having a dedicated ASCIIString type that would enable use to perform encoding-specific optimization 😄
    • I think it would be reasonable to start with a struct ASCIIString that is its own dedicated type, and then work later to assess the performance and ergonomics tradeoff in parameterizing StringSlice on an encoding parameter.
  2. I'm less certain that having a parameter for controlling indexing behavior is optimal.
  3. I'm unclear on the semantics of the proposed iter(data).chars[encoding.UTF8]() method — could you elaborate that section?
    • My understanding is that the StringSlice would have some specific encoding its stored in memory as, so I'm uncertain what "casting" that encoding during iteration would do?

Re. (2): One idea we've been discussed a bit internally (but haven't had time to implement yet) is the idea of using named arguments within __getitem__ methods, to enable a syntax like:

var str = StringSlice("abcef")

var bytes: Span[Byte] = str[byte=2:5]

var cps: StringSlice = str[codepoints=1:3]

var chars: StringSlice = str[chars=1:5] # Grapheme clusters

I still have some reservations about this, but I like that it strikes a balance between being a relatively terse syntax for indexing, while also staying absolutely clear and readable about what indexing semantics are being used (helping the programmer stay aware of behavior and potentially performance concerns).

I'm unsure though what I think the default should be for str[1:5] — codepoints or grapheme clusters, I'm not sure.


Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users.

@martinvuyk
Copy link
Contributor Author

Hi @ConnorGray

Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌

Thanks for writing, I actually like reading through well thought ideas. Your "unordered thoughts" match some of my essays lol

More directly, I'm currently thinking should do the following:

  • Rename Char to Codepoint for minimal ambiguity
  • Keep StringSlice.as_bytes() -> Span[Byte]
  • Rename .chars() to StringSlice.codepoints() -> CodepointsIter (iterator over Codepoint)

+1 on all 3 of these

  • Rename .char_slices() to StringSlice.codepoint_slices() (iterator over single-codepoint StringSlice pointers)

This one I'm not so fond of because I like the idea of parametrizing more. I'll expand later

  • Eventually introduce a grapheme cluster iterator called .characters() or .graphemes(), along with Character (owning) and CharacterSlice (view) types.
    • (If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

I think we should have the method be .graphemes() because IMO .characters() is more ambiguous. IMO Grapheme and GraphemeSlice won't be necessary, when iterating over a String or StringSlice with Indexing.GRAPHEME it would just return a StringSlice whose __len__() method might return more than 1 when iterating.

Then, eventually, re-add a StringSlice.__iter__() that either calls .codepoint_slices() or .graphemes(), pending discussion about the merits of each of those (performance vs correctness tradeoff, essentially).

I seriously think it is not necessary to deprecate the default iteration over unicode codepoints given it is the IMO sane default (and practically free using StringSlice). It also follows Python's default which IMO should also count. This is also solved by parametrizing String and StringSlice in a way in which we don't have to make that decision for people.


Using terminology like Codepoint might be unfamiliar at first to folks not already knowledgeable about how computers represent and process text. But I think part of Mojo's broader language philosophy is not shying away from and trying to paper over complexity. Instead, I think it's better to both teach and make that complexity feel manageable with APIs that model it well 🙂

100% Agree on this

I think both types of iteration (over decoded u32 codepoints, and over single-codepoint StringSlice views) are useful, and supporting both is valuable.

Also on this, but I just think that the default should be what is most often used. I also think it's quite straightforward to do:

a = "123"
for item in a:
    c = Codepoint(item)
    # or for what I've read them being used for very often
    # in our unicode casing code and golang's stdlib
    # some people call them "runes"
    utf32_value, utf8_length = Codepoint.parse_utf32_utf8_length(item)
    ...

  • I think it would be reasonable to start with a struct ASCIIString that is its own dedicated type, and then work later to assess the performance and ergonomics tradeoff in parameterizing StringSlice on an encoding parameter.

IMHO this will be much more work than parametrizing and adding constrained on every branch that is not utf8 and progressively adding optimizations. Because it means duplicating every API and docstrings since we don't have good inheritance-like mechanisms yet.

  1. I'm less certain that having a parameter for controlling indexing behavior is optimal.

Quite the opposite, the optimization trickles down to all users and libraries which interact with the generic String. The indexing is very easy to change with a rebind or if we add some APIs to go back and forth. ascii(String("123")) would return a rebound string which is free. Any function which uses any sort of string manipulation and whose signature accepts a generic String would benefit from the perf. gains of e.g. changing to an ASCIIString for your specific use case since you know only those sequences will be there. Just as the performance hit of going full grapheme support will only happen where needed (determined by the end user: the programmer, not us).

  1. I'm unclear on the semantics of the proposed iter(data).chars[encoding.UTF8]() method — could you elaborate that section?

That section was to propose in the case that we wanted a Char type which can cave any one of the 4 encodings underneath*. In the case in which you wanted to iterate over Char of a different encoding than the current string for which iter(some_str) was called, then you could pass the encoding parameter for the iterator. All 4 encodings can be packed inside a UInt32 and as such have the same fix-sized stack-allocated variable. And IMO this still makes sense because a character in ascii is just 1 byte, in utf8 1-4, utf16 1-2, utf32 1. The underlying methods could be parametrized for each encoding and constrained where the methods don't make sense for certain cases.

*: This wouldn't include when Indexing.Grapheme, but this might actually be more of an argument in favor of renaming the type to Codepoint as you proposed.

  • My understanding is that the StringSlice would have some specific encoding its stored in memory as, so I'm uncertain what "casting" that encoding during iteration would do?

It's just a matter of bitcasting the underlying bytes to the proper encoding datatype. For example, the data for a utf16 is actually stored in a List[UInt16] but the String type uses a List[UInt8] (and a StringSlice points to it), so the only necessary step when doing anything is bitcasting the internal pointer e.g. utf16_buffer = rebind[Span[UInt16, __origin_of(some_str)]](some_str.as_bytes()). But both String and StringSlice remain untouched, except inside the body of the methods where the casting needs to happen.


Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users.

Apologies again for being intense sometimes hehe. String handling is what I still love about Python, and I want to make it even better in Mojo 🔥

@jackos
Copy link
Collaborator

jackos commented Feb 13, 2025

thanks @martinvuyk I've added to next design discussion meeting

@YkdWaWEzVmphR1Z1
Copy link

YkdWaWEzVmphR1Z1 commented Feb 14, 2025

The thing is how much memory are you willing to use to get O(n) indexing.
The most efficient solution is probably to store a subset of breakpoints.
However segmentation is not as predictable as encoding, so the efficiency will vary.
For ASCII I think, it is a good idea, Rust is experimenting with ASCII Char, just because of how painful UTF-8 conversion are.
But I don't think a separate Grapheme type is a good idea. This would be essentially String.graphemes().next(), i.e. a wrapper around String or StringSlice.

@owenhilyard
Copy link
Contributor

Overall, I agree with @ConnorGray. I think that Mojo benefits by being correct by default, and then having the ability to drop down to get more performance after you read about the hazards. In this case, that means defaulting to Graphemes, and giving codepoint and byte access options. People dealing with large bodies of text will need to understand unicode and what possibilities exist for their problem, or they can just throw CPU at the problem. When people think of Character, 90% of developers mean Graphemes, so Char should be a buffer slice (or just a pointer if we're fine re-validating it on use).

(If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

We can make an alias, so that people who reach for CharacterSlice get the GraphemeSlice type back and can either ignore that or look a bit further.

I like the idea of having a dedicated ASCIIString type that would enable use to perform encoding-specific optimization 😄

The reason I think that we may want to be parameterized over encoding is because not only do you want Ascii, but also UTF-16 (JS interop with WASM, Java), but also UCS-2 (Windows). It means the stdlib-side string handling code is going to be gross, but I would rather go for a solution we can adapt than one where we have "oops, we can't talk to that". This is especially true since I've seen some RAG DB research which wants to use UTF-32 as the input encoding since it's easier to only need to do "codepoints -> graphemes" on a GPU.

That section was to propose in the case that we wanted a Char type which can cave any one of the 4 encodings underneath*. In the case in which you wanted to iterate over Char of a different encoding than the current string for which iter(some_str) was called, then you could pass the encoding parameter for the iterator. All 4 encodings can be packed inside a UInt32 and as such have the same fix-sized stack-allocated variable. And IMO this still makes sense because a character in ascii is just 1 byte, in utf8 1-4, utf16 1-2, utf32 1. The underlying methods could be parametrized for each encoding and constrained where the methods don't make sense for certain cases.

@martinvuyk these "casts" are non-trivial amounts of compute. There are only a few of these "casts" that are cheap, like ASCII -> UTF-8. There are also issues with characters which are impossible to represent in ASCII. I think having the buffer parameterized and converting the whole thing when you want to is a better idea, since I think most code will take input, make everything a single encoding for internal processing, and then potentially convert back to a desired output encoding. These are also potentially fairly large copies to compute, since UTF-16 and UTF-32 usually take more memory than the equivalent UTF-8. I think this makes sense because the encoding is a property of the data in the buffer, not the function called on the data in the buffer.

@leb-kuchen

The thing is how much memory are you willing to use to get O(n) indexing. The most efficient solution is probably to store a subset of breakpoints. However segmentation is not as predictable as encoding, so the efficiency will vary. For ASCII I think, it is a good idea, Rust is experimenting with ASCII Char, just because of how painful UTF-8 conversion are. But I don't think a separate Grapheme type is a good idea. This would be essentially String.graphemes().next(), i.e. a wrapper around String or StringSlice.

O(n) indexing is doable by parsing while we go. If we want better indexing, that's where indexes come into play.

@YkdWaWEzVmphR1Z1
Copy link

YkdWaWEzVmphR1Z1 commented Feb 15, 2025

O(n) indexing is doable by parsing while we go. If we want better indexing, that's where indexes come into play.

I think it is either O(n) memory or O(n) time. In my opinion it is not worth it and iteration should be faster in 90% of the cases.
It would also limit the extensibility of string if this API was introduced.

I think of the following API

  • graphemes
  • is_grapheme_boundary
  • ceil_grapheme_boundary
  • floor_grapheme_boundary

Graphemes are designed for segmentation , and not really for indexing. Is still possible to possible to design an API this way, but it is not where graphemes shine.

@rcghpge
Copy link
Contributor

rcghpge commented Feb 15, 2025

I think with low-level programming for Mojo you want closer to the metal than even llvm and mlir. What would the concept of passes in this framework look like. Since I was first introduced to Mojo, the impressions I get are you want orders of magnitude in performance.

@owenhilyard
Copy link
Contributor

O(n) indexing is doable by parsing while we go. If we want better indexing, that's where indexes come into play.

I think it is either O(n) memory or O(n) time. In my opinion it is not worth it and iteration should be faster in 90% of the cases. It would also limit the extensibility of string if this API was introduced.

I think of the following API

* graphemes

* is_grapheme_boundary

* ceil_grapheme_boundary

* floor_grapheme_boundary

Graphemes are designed for segmentation , and not really for indexing. Is still possible to possible to design an API this way, but it is not where graphemes shine.

My thought is that something like this will happen:

If the iteration isn't in the critical path, then nobody will notice and it will silently do the thing which is correct 99% of the time, which is iterating over visible symbols. Very few people not well versed in unicode expect "👍🏻" and "👍" result in different numbers of loop iterations.

If it is in the critical path, then it shows up in profiling, and someone googles "Mojo string iteration slow", and lands on a docs page or blog post about how we decided to do graphemes by default, and which tells the user to make use of .codepoints() or .bytes if they want faster iteration in exchange for having to deal with some parts of unicode themselves.

@martinvuyk
Copy link
Contributor Author

Overall, I agree with @ConnorGray. I think that Mojo benefits by being correct by default, and then having the ability to drop down to get more performance after you read about the hazards. In this case, that means defaulting to Graphemes, and giving codepoint and byte access options. People dealing with large bodies of text will need to understand unicode and what possibilities exist for their problem, or they can just throw CPU at the problem.

IMO you are underestimating how much code in the wild is a series of small/medium/large text processing, and how little developers care about performance once a project works. (I think) The impact of these defaults will be big not only for large text.

Mojo benefits by being correct by default

Sidenote on this: graphemes are an extension of the unicode standard, using only codepoints is also correct. And code which is migrated from Python that does arithmetic based on codepoints might have issues with Mojo having different defaults.

That section was to propose in the case that we wanted a Char type which can cave any one of the 4 encodings underneath*. In the case in which you wanted to iterate over Char of a different encoding than the current string for which iter(some_str) was called, then you could pass the encoding parameter for the iterator. All 4 encodings can be packed inside a UInt32 and as such have the same fix-sized stack-allocated variable. And IMO this still makes sense because a character in ascii is just 1 byte, in utf8 1-4, utf16 1-2, utf32 1. The underlying methods could be parametrized for each encoding and constrained where the methods don't make sense for certain cases.

@martinvuyk these "casts" are non-trivial amounts of compute. There are only a few of these "casts" that are cheap, like ASCII -> UTF-8. There are also issues with characters which are impossible to represent in ASCII. I think having the buffer parameterized and converting the whole thing when you want to is a better idea, since I think most code will take input, make everything a single encoding for internal processing, and then potentially convert back to a desired output encoding. These are also potentially fairly large copies to compute, since UTF-16 and UTF-32 usually take more memory than the equivalent UTF-8. I think this makes sense because the encoding is a property of the data in the buffer, not the function called on the data in the buffer.

@owenhilyard I don't mean converting from UTF-8 to these encodings, I mean casting the buffer to e.g. a Span[UInt16], doing your editing, and voila. It's just using a buffer of bytes as all the other encoding storage, not only utf8. It's a way for us to not have to touch String or StringSlice 's internal data storage format (List[UInt8], Span[UInt8])


I'm willing to change my mind on the defaults if we can have some benchmarks which show a relatively low impact for common use-cases. My main focus with this proposal is to have String parametrized so that we can have some proper utf16 and utf32 encoding support and not a half-hazard set of APIs, while also allowing us to bring back some cool ASCII and other encoding-specific optimizations.

@JoeLoser JoeLoser added the needs-discussion Need discussion in order to move forward label Feb 21, 2025
modularbot pushed a commit that referenced this pull request Apr 2, 2025
[External] [stdlib] [NFC] Reorganize UTF-8 utils

Reorganize UTF-8 utils into their own files and tests. It is bloating
`string_slice.mojo` and `test_string_slice.mojo`, and in the case that
#3988 gets accepted we will need to
build similar infrastructure for UTF-16 and UTF-32.

Closes #4246

MODULAR_ORIG_COMMIT_REV_ID: 478ed82ea126b4195721f5d246513040cc42b538

Graphemes are more expensive to slice because one can't use faster algorithms,
only skipping each unicode codepoint one at a time and checking the last byte
to see if it's an extended cluster *.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed; but graphemes are what most humans care about

Copy link
Contributor Author

@martinvuyk martinvuyk Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in every case, when designing a database index or something where the size of the string keys need to be known or at least limited up to X bytes one does care about unicode codepoints or even just bytes (when using pure ASCII). There is also doing a StringSlice into an http header for example, where the encoding is ISO-8859-1. IMO we should be able to offer a zero-copy way to read them

The thing with Mojo is that our user-base will be from systems people all the way up to hopefully even web development where graphemes are what is most natural to think about. My main issue with going graphemes by default is that it will force the cost down to people who will not want it. We could keep String with such defaults, but maybe we should let StringSlice be much more generic over string data.

@martinvuyk martinvuyk requested a review from a team as a code owner April 27, 2025 14:00
@martinvuyk martinvuyk requested a review from lattner April 27, 2025 14:01
@YkdWaWEzVmphR1Z1
Copy link

YkdWaWEzVmphR1Z1 commented Apr 28, 2025

In my opinion this proposal has many great ideas, but I still remain skeptical towards parameterized encoding.
I will make a counter proposal which involves a custom encoding ( UTF8-G) and a bit-packed structure at the front.
This should enable fast indexing for graphemes while minimizing overhead for ASCII and other graphemes of length 1.

modularbot pushed a commit that referenced this pull request May 6, 2025
[External] [stdlib] [NFC] Reorganize UTF-8 utils

Reorganize UTF-8 utils into their own files and tests. It is bloating
`string_slice.mojo` and `test_string_slice.mojo`, and in the case that
#3988 gets accepted we will need to
build similar infrastructure for UTF-16 and UTF-32.

Closes #4246

MODULAR_ORIG_COMMIT_REV_ID: 478ed82ea126b4195721f5d246513040cc42b538
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>
@martinvuyk martinvuyk force-pushed the string-ascii-unicode-utf-grapheme branch from a262ba3 to 78590d3 Compare May 12, 2025 00:52
Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>

### Python

Python currently uses UTF-32 for its string type. So the slicing and indexing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true anymore. They use the format outlined in PEP393, which can be UTF-8, UTF-16, or UTF-32 or some combination (new representations are created as-needed as copies). Also I think it has some way of preserving bogus codepoints.

Python strings are immutable, which makes the PEP393 approach easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still kind of is UTF-32 (unicode-ish), just with tricks to reduce it to latin-1 or BMP when bigger codepoints aren't present. That is why I phrased it like that. But the indexing and slicing still just follows unicode codepoint logic, not utf-8 nor utf-16.

UTF-8, UTF-16, or UTF-32 or some combination

AFAIK that is not the case, they are just shorter ranges of the unicode character list and use UInt8, UInt16, or UInt32 to store the unicode codepoint values and not encoded ones.

Python strings are immutable, which makes the PEP393 approach easier.

100% having in-place mutability makes everything more complicated

@barcharcraz
Copy link
Contributor

I'm going to be writing up an actual proposal for handling String going into 1.0. Here are my thoughts so far

I'm fairly against making the default iteration order be grapheme clusters, or making the "character" type grapheme clusters for the following reasons (among others):

  • If we make "String" be fully based on grapheme clusters (copying swift) then we will need to introduce a new type to deal with filesystem paths. Technically even just enforcing utf-8 would require a new type, but doing things like normalized comparisons will make the issue way worse on all non-apple platforms. Swift uses String for filesystem things, I think they get away with it because all the MacOS filesystem APIs are normalization insensitive.
  • grapheme clusters are NOT "user perceived characters" or what the "user expects to see" In particular:
    • they are not appropriate places to break text, while it's probably true that grapheme clusters shouldn't be split by a line break it's not true that grapheme cluster boundaries are always appropriate places to put a break.
    • They aren't appropriate to decide what to delete when you press "backspace" because there are breaks after some combining characters that are written/edited as a unit in certain scripts.
    • they are not appropriate for counting the number of "characters" in a string, because different users have different ideas about what they want counted in different languages/scripts/situations/applications. The clusterization rules can also change from unicode-version to unicode-version so using them for "string length" means your strings will change length with updates to the UCD.
    • they aren't appropriate for moving the cursor about, since often you do, actually, want to edit "between" combining marks (or even expand precomposed characters during cursor movement.
    • see L2/23-140, L2/23-141
  • grapheme clusters are only useful when you want a rough estimation of "character" that doesn't need to know about the user's language/situation/font/document/time of day/breakfast preferences/etc; I'm not sure where the idea came from that "grapheme cluster sequence" is the "appropriate" way to think of a string in a modern unicode world came from, but I don't really understand the rationale, beyond looking good in cute blog posts about how "🇦🇮🏳️‍🌈🇯🇵".length is "surprising"

Basically it feels like defaulting to EGCs is making things much more complicated in service of a particular set of operations that are still impossible to do properly even after all your blood, sweat and tears. To do any of this stuff "correctly" you need to know at least the user's language/writing system and details about the active font, and making the core "string" type dependent on either is ..... probably a bad idea.

I'm unsure about unicode normalization-based equality comparison. It's not as bad as grapheme clusters, but there are strings that are equivalent when normalized but very much not equivalent from the perspective of a reader/user.

@martinvuyk
Copy link
Contributor Author

I think we should maybe add graphemes later on but I wouldn't bet on it especially now that extensions are becoming a possibility. I think we should go for codepoint slicing and indexing by default because byte indexing and slicing is unsafe for UTF-8 and it should not be exposed by default.

Ideally I'd like this proposal to move forward by having indexing parametrized for StringSlice (and let people implement extensions for graphemes). Even better if I would be allowed to add Encoding into the mix as well. I think that having a native way to handle different encodings will give big performance and correctness improvements (I've already seen people building with from_utf8 using http headers which are in latin-1 for example). We could even get Python <-> Mojo native zero copy StringSlices

I'm unsure about unicode normalization-based equality comparison. It's not as bad as grapheme clusters, but there are strings that are equivalent when normalized but very much not equivalent from the perspective of a reader/user.

I think we should go for raw byte comparison as we currently are and create methods for different "case insensitive-ish" comparisons (see #5389). We can have several performance choices for each, but the default should be the fastest and simplest.

PS: @barcharcraz If you're interested you can look at my backlog of PRs, I've been trying to fix a bunch of things that are wrong with our string for a while now

@barcharcraz
Copy link
Contributor

I think we should maybe add graphemes later on but I wouldn't bet on it especially now that extensions are becoming a possibility. I think we should go for codepoint slicing and indexing by default because byte indexing and slicing is unsafe for UTF-8 and it should not be exposed by default.

Is byte indexing and slicing really unsafe so long as you don't edit anything and when you actually look at the slice you're able to skip over any bogus stuff at the beginning and end? Maybe it's not useful but I'm not sure how unsafe it is. I'm not really sure String should provide __getitem__ at all.

Ideally I'd like this proposal to move forward by having indexing parametrized for StringSlice (and let people implement extensions for graphemes). Even better if I would be allowed to add Encoding into the mix as well. I think that having a native way to handle different encodings will give big performance and correctness improvements (I've already seen people building with from_utf8 using http headers which are in latin-1 for example). We could even get Python <-> Mojo native zero copy StringSlices

I think it makes more sense to have an encoding parameter than it does an indexing parameter. Really the indexing parameter should be on __getitem__ and not the slice. I think we should have separate indexing methods (or methods returning an iterator) on string slice for each kind of indexing you might want, and then only have __getitem__ on slices of "bytestring/ascii" encoding.

I'm unsure about unicode normalization-based equality comparison. It's not as bad as grapheme clusters, but there are strings that are equivalent when normalized but very much not equivalent from the perspective of a reader/user.

I think we should go for raw byte comparison as we currently are and create methods for different "case insensitive-ish" comparisons (see #5389). We can have several performance choices for each, but the default should be the fastest and simplest.

PS: @barcharcraz If you're interested you can look at my backlog of PRs, I've been trying to fix a bunch of things that are wrong with our string for a while now

Yes, I saw quite a collection, some of which look like good ideas. I'll try and get through them tomorrow (🫰🏼)

@martinvuyk
Copy link
Contributor Author

Is byte indexing and slicing really unsafe so long as you don't edit anything and when you actually look at the slice you're able to skip over any bogus stuff at the beginning and end?

There is an example in #5281. If you really implement high performance stuff then you will make assumptions based on the first byte or continuation bytes of a sequence. We shouldn't be exposing that by default and hope developers read documentation (highly unlikely IMO).

I think it makes more sense to have an encoding parameter than it does an indexing parameter. Really the indexing parameter should be on getitem and not the slice. I think we should have separate indexing methods (or methods returning an iterator) on string slice for each kind of indexing you might want, and then only have getitem on slices of "bytestring/ascii" encoding.

If you don't have an indexing parameter then how would functions like __len__, find, __getitem__(idx), __getitem__(Slice) all coordinate in tandem without allowing developer error? It would be very easy for people to do str_slice[:str_slice.find(";")] and mistakenly use one indexing scheme for one and another for the other function call. There is a lot of code that would break if one of these methods had a different indexing scheme than the other

Anyway, I don't really care much about the indexing parameter, I proposed it more to have an option for all kinds including graphemes. But the default indexing, slicing, find, and length should be unicode codepoints due to safety and if byte indexing is desired then there is always .as_bytes() at hand (or we provide keyword getitem e.g. [bytes=:3] and find[bytes=True](...) or whatever).

I think we should have separate indexing methods (or methods returning an iterator)

Me too (see #3653). Especially now that the iterator abstractions are maturing we might start pushing for other patterns like #5398. I'm stoked to see what this all will look like when it comes together :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-discussion Need discussion in order to move forward

Projects

None yet

Development

Successfully merging this pull request may close these issues.