Remove variation selector workaround #682

rolandwalker · 2017-07-21T01:22:26Z

Remove workaround for VARIATION SELECTOR-16 which was added in #679, and move failing part of test to a TODO. This is accomplished in part by adding a --subshell flag to test_case.

About variation selectors in general:

a VS changes the character/s before it, like a combining character
but a combining character can be composed with any other character, and each possible VS modification is a rule defined separately and explicitly in Unicode
for the case of a successful modification it should be clear that from the POV of unicode_width() the VS "disappears" and has width zero
when a VS is ineffective, the Unicode spec also defines it as having width zero (http://unicode.org/faq/unsup_char.html#3)
due to the complexity of VS sequences, unicode_width() simply can't get all of the cases correct (and tig shouldn't worry much about that). This is b/c unicode_width() statelessly considers a single codepoint, but VS sequences are of varying length
example complex VS sequence: PERSON WITH BLOND HAIR EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16, which forms the single emoji 👱🏾‍♂️
yes, it is nuts that a glyph can be specified by a variable-length sequence of codepoints, and that the codepoints themselves may be given by variable-length UTF-8 encoding

edit: the reason that an approach likewcwidth() can still get the job done is that sequences are artfully defined: in the one above only 2 of the codepoints have width, and the final glyph is meant to correspond to the sum of the widths. But inevitably there are edge cases and rendering issues.

by not including VARIATION SELECTOR-16 in tested outputs

plus one other related TODO

jonas · 2017-07-26T22:42:08Z

So one fix might be to keep a context between calls to unicode_width()? I agree that it is better to remove the workaround and keep the test failure as a future improvement.

rolandwalker · 2017-07-26T23:44:05Z

The bullets were meant to argue against chasing perfection. unicode_width is efficient. If deeper knowledge about Unicode strings was really needed a library would be better.

And as to this bug, I suspect that the locus of the issue will turn out to be in glibc or similar.

rolandwalker added 6 commits July 20, 2017 20:55

shorten TODO message

8e6c2d2

simplify restore-after-test_tig with subshell

6695edb

add --subshell= flag to test_case

cdcdf84

reduce emoji test to case that always passes

32da8da

by not including VARIATION SELECTOR-16 in tested outputs

remove hardcoded workaround from unicode_width

26206d1

reintroduce failing unicode test as a TODO

76b0934

plus one other related TODO

rolandwalker force-pushed the rm-var-sel-workaround branch from c4bbbde to 76b0934 Compare July 21, 2017 10:42

jonas merged commit 7e9afc7 into jonas:master Jul 26, 2017

rolandwalker deleted the rm-var-sel-workaround branch July 26, 2017 23:44

jonas mentioned this pull request Dec 11, 2017

Does not display some Unicode characters #747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove variation selector workaround #682

Remove variation selector workaround #682

rolandwalker commented Jul 21, 2017 •

edited

jonas commented Jul 26, 2017

rolandwalker commented Jul 26, 2017

Remove variation selector workaround #682

Remove variation selector workaround #682

Conversation

rolandwalker commented Jul 21, 2017 • edited

jonas commented Jul 26, 2017

rolandwalker commented Jul 26, 2017

rolandwalker commented Jul 21, 2017 •

edited