Skip to content

[Stdlib] Document StringSlice[byte=] subscript performance in docstring#6251

Open
msaelices wants to merge 3 commits intomodular:mainfrom
msaelices:stringslice-byte-docstring
Open

[Stdlib] Document StringSlice[byte=] subscript performance in docstring#6251
msaelices wants to merge 3 commits intomodular:mainfrom
msaelices:stringslice-byte-docstring

Conversation

@msaelices
Copy link
Copy Markdown
Contributor

@msaelices msaelices commented Mar 22, 2026

Summary

Add a note to StringSlice.__getitem__(byte=) docstring explaining that the UTF-8 boundary check runs in release builds, and pointing users to as_bytes() for performance-sensitive byte-scanning loops.

Assisted-by: AI

Add a note to StringSlice.__getitem__(byte=) docstring explaining that
the UTF-8 boundary check runs in release builds, and pointing users to
as_bytes() for performance-sensitive byte-scanning loops.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@github-actions github-actions bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Mar 22, 2026
@msaelices msaelices marked this pull request as ready for review March 22, 2026 23:23
@msaelices msaelices requested a review from a team as a code owner March 22, 2026 23:23
Copilot AI review requested due to automatic review settings March 22, 2026 23:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the StringSlice.__getitem__(byte=) docstring to clarify performance implications of byte subscripting, specifically calling out the UTF-8 boundary check and recommending as_bytes() for tight byte-scanning loops.

Changes:

  • Added a docstring note that __getitem__(byte=) performs a UTF-8 codepoint-boundary check in typical builds.
  • Pointed users to as_bytes() / Span[Byte] for performance-sensitive byte iteration.
Comments suppressed due to low confidence (1)

mojo/stdlib/std/collections/string/string_slice.mojo:1133

  • The docstring describes this as "byte-level indexing" that may return a partial/invalid UTF-8 sequence and says the return is a "String containing a single byte", but the implementation requires a UTF-8 start byte (via debug_assert[assert_mode="safe"]) and returns a StringSlice whose length is the full codepoint byte length. Please update the docstring to reflect that this returns the codepoint slice at a given byte offset (and that the offset must be a codepoint boundary).
        """Gets a single byte at the specified byte index.

        This performs byte-level indexing, not character (codepoint) indexing.
        For strings containing multi-byte UTF-8 characters, this may return a
        partial or invalid character sequence. For proper character access, use
        `codepoint_slices()` or iterate over the string directly.

        This method includes a UTF-8 codepoint boundary check that runs even
        in release builds. For performance-sensitive byte-scanning loops (e.g.
        parsers, regex engines), use `as_bytes()` to access the underlying
        `Span[Byte]` directly, which avoids this overhead.

        Parameters:
            I: A type that can be used as an index.

        Args:
            byte: The byte index (0-based). Negative indices count from the end.

        Returns:
            A new String containing a single byte at the specified position.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1120 to +1121
This method includes a UTF-8 codepoint boundary check that runs even
in release builds. For performance-sensitive byte-scanning loops (e.g.
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new note says the UTF-8 boundary check "runs even in release builds", but the code is a debug_assert[assert_mode="safe"], which can be compiled out when assertions are disabled (e.g. assert_mode=none). Consider rewording to something like "a debug_assert[assert_mode=\"safe\"] boundary check (enabled by default, including in release builds)" to avoid overstating the guarantee.

Suggested change
This method includes a UTF-8 codepoint boundary check that runs even
in release builds. For performance-sensitive byte-scanning loops (e.g.
This method includes a `debug_assert[assert_mode="safe"]` UTF-8
codepoint boundary check (enabled by default, including in release
builds). For performance-sensitive byte-scanning loops (e.g.

Copilot uses AI. Check for mistakes.
Specify that the check is a debug_assert[assert_mode="safe"] rather
than saying it "runs even in release builds", since it can be compiled
out with -D ASSERT=none.

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices changed the title [Stdlib] Document byte= subscript performance in docstring [Stdlib] Document StringSlice[byte=] subscript performance in docstring Mar 23, 2026
@NathanSWard
Copy link
Copy Markdown
Contributor

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

imported-internally Signals that a given pull request has been imported internally. mojo-stdlib Tag for issues related to standard library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants