segmenter: add OffsetInBytes and LengthInBytes to Grapheme, Line, and Word#241
segmenter: add OffsetInBytes and LengthInBytes to Grapheme, Line, and Word#241benoitkugler merged 1 commit intogo-text:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the segmenter package API to expose UTF-8 byte offsets/lengths for each produced segment, enabling callers to slice the original string/[]byte input without converting to []rune first (Fixes #240).
Changes:
- Track UTF-8 byte position alongside rune position inside the internal
attributeIterator. - Expose
OffsetInBytesandLengthInBytesonGrapheme,Line, andWord. - Add a test validating that byte slicing via the new fields matches the segment
Textfor several UTF-8 inputs across all init modes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| segmenter/segmenter.go | Adds internal byte-position tracking and exposes byte offset/length on segment structs. |
| segmenter/segmenter_test.go | Adds coverage ensuring byte offsets/lengths reproduce the expected substring for graphemes/lines/words. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
89b991a to
5229631
Compare
|
I think it's ready. Please take a look, thanks |
|
Thank you for this contribution. I'm not a big fan however of the additional complexity incurred by invalid utf8 handling. Have you concrete use case you must support ? Could we instead require valid utf8 is passed to the segmenter ? That would align more closely to the scope of the library (implicitly defined by the use of rune slices). |
In Guigui (a GUI framework made with Ebitengine), I need to implement a text auto-wrapping, and this requires a segmenter. Now I am using uniseg, but this is relatively inactive.
Yeah, I think this makes sense. I think there are some options, but the simplest way is let it panic:
This enables to remove some checks like Another option is just silently ignore such runes, but this sounds a little risky. What do you think? |
|
What about returning an error from |
I slightly prefer panics since the condition when to panic is very clear and a user can check before invoking InitWithString/Bytes. Also this would break a comaptibility with v0.3.4. But if you want to change the signature, I'm fine to do so. |
|
Also, if |
|
Ah, my mistake, I've overlooked the fact that I'm in favor of the panic then. Whats @andydotxyz @whereswaldon think about this question ? |
|
Please don't (ever) panic in a library. Unless it is truly an unrecoverable fatal flaw in the input. We could log it aggressively if returning an error is not an option. |
|
So if we don't panic, don't return errors, and don't handle invalid sequences in the current way, the only way is to just ignore such sequences or log them, and the result of |
|
Perhaps this case justifies an API change ? |
Only |
True. Almost nobody uses the API. I've not used these yet neither.
It's ok, but I feel like this is inconsistent. An invalid rune like U+FFFF can come to |
Yes, you're right. So we would have an undefined behavior in case someone uses |
|
Updated this PR. |
This was my understanding but U+FFFF is treated as a one valid rune in Go. An invalid rune is for example U+7FFFFFFF. https://go.dev/play/p/-Axax5XcldD |
benoitkugler
left a comment
There was a problem hiding this comment.
Thank you for the simplification, I'm really happy with this change !
… Word Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion. Fixes go-text#240
whereswaldon
left a comment
There was a problem hiding this comment.
Thank you for implementing this. Looks reasonable to me!
… Word (go-text#241) Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion. Fixes go-text#240
Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion.
Fixes #240