misc/codepoint_width: handle partially ill-formed UTF-8#17792
Merged
Conversation
Previously the function just bailed on invalid input, this instead makes it count how many replacement characters would be shown by a terminal complying with the Unicode specification's recommendation here: https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G66453 Fixes mpv-player#17773 (comment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously the function just bailed on invalid input, this instead makes it count how many replacement characters would be shown by a terminal complying with the Unicode specification's recommendation here: https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G66453
Fixes #17773 (comment)
I wasn't sure if it's better to go through all the call sites of
bstr_decode_utf8to adjust them to thisbstr *change so I just split out an inner function, but there's like <10 call sites so could also change the behavior ofbstr_decode_utf8itself.Now whether this is correct: it seems to work and the examples from the spec pass, I also found https://hsivonen.fi/broken-utf-8/ while researching and that post links a test page that I quickly converted into assertions for the test here: afishhh@a7d4080. Those also pass but I left them out since there's a lot of them and thought it might be overkill.
My terminal (kitty) also passes these tests, don't know about others.
Can also confirm
\xff+aaaaaaaaaaaaaaa...interm-status-msgno longer fills the terminal with junk.