[Stdlib] Document StringSlice[byte=] subscript performance in docstring#6251
[Stdlib] Document StringSlice[byte=] subscript performance in docstring#6251msaelices wants to merge 3 commits intomodular:mainfrom
StringSlice[byte=] subscript performance in docstring#6251Conversation
Add a note to StringSlice.__getitem__(byte=) docstring explaining that the UTF-8 boundary check runs in release builds, and pointing users to as_bytes() for performance-sensitive byte-scanning loops. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
There was a problem hiding this comment.
Pull request overview
Updates the StringSlice.__getitem__(byte=) docstring to clarify performance implications of byte subscripting, specifically calling out the UTF-8 boundary check and recommending as_bytes() for tight byte-scanning loops.
Changes:
- Added a docstring note that
__getitem__(byte=)performs a UTF-8 codepoint-boundary check in typical builds. - Pointed users to
as_bytes()/Span[Byte]for performance-sensitive byte iteration.
Comments suppressed due to low confidence (1)
mojo/stdlib/std/collections/string/string_slice.mojo:1133
- The docstring describes this as "byte-level indexing" that may return a partial/invalid UTF-8 sequence and says the return is a "String containing a single byte", but the implementation requires a UTF-8 start byte (via
debug_assert[assert_mode="safe"]) and returns aStringSlicewhose length is the full codepoint byte length. Please update the docstring to reflect that this returns the codepoint slice at a given byte offset (and that the offset must be a codepoint boundary).
"""Gets a single byte at the specified byte index.
This performs byte-level indexing, not character (codepoint) indexing.
For strings containing multi-byte UTF-8 characters, this may return a
partial or invalid character sequence. For proper character access, use
`codepoint_slices()` or iterate over the string directly.
This method includes a UTF-8 codepoint boundary check that runs even
in release builds. For performance-sensitive byte-scanning loops (e.g.
parsers, regex engines), use `as_bytes()` to access the underlying
`Span[Byte]` directly, which avoids this overhead.
Parameters:
I: A type that can be used as an index.
Args:
byte: The byte index (0-based). Negative indices count from the end.
Returns:
A new String containing a single byte at the specified position.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| This method includes a UTF-8 codepoint boundary check that runs even | ||
| in release builds. For performance-sensitive byte-scanning loops (e.g. |
There was a problem hiding this comment.
The new note says the UTF-8 boundary check "runs even in release builds", but the code is a debug_assert[assert_mode="safe"], which can be compiled out when assertions are disabled (e.g. assert_mode=none). Consider rewording to something like "a debug_assert[assert_mode=\"safe\"] boundary check (enabled by default, including in release builds)" to avoid overstating the guarantee.
| This method includes a UTF-8 codepoint boundary check that runs even | |
| in release builds. For performance-sensitive byte-scanning loops (e.g. | |
| This method includes a `debug_assert[assert_mode="safe"]` UTF-8 | |
| codepoint boundary check (enabled by default, including in release | |
| builds). For performance-sensitive byte-scanning loops (e.g. |
Specify that the check is a debug_assert[assert_mode="safe"] rather than saying it "runs even in release builds", since it can be compiled out with -D ASSERT=none. Signed-off-by: Manuel Saelices <msaelices@gmail.com>
StringSlice[byte=] subscript performance in docstring
|
!sync |
Summary
Add a note to
StringSlice.__getitem__(byte=)docstring explaining that the UTF-8 boundary check runs in release builds, and pointing users toas_bytes()for performance-sensitive byte-scanning loops.Assisted-by: AI