Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle surrogate pair for unicode >= U+10000 #616

Closed
xianpingge opened this issue Dec 15, 2016 · 7 comments
Closed

Handle surrogate pair for unicode >= U+10000 #616

xianpingge opened this issue Dec 15, 2016 · 7 comments

Comments

@xianpingge
Copy link

Bug:
mintty displays unicode characters >= U+10000 as a blank box. E.g.,
to display the music symbol G clef U+1D11E (𝄞) :

echo "printf '\U1d11e'; read dummy" > hello.sh; chmod +x hello.sh
mintty.exe -o 'Font=Segoe UI Symbol' ./hello.sh

Diagnosis:

In wintext.c , win_text() represents U+1D11E as 2 wchar's (i.e., a surrogate pair,
https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates ).**
However, the pair is broken up into 2 halves when calling text_out().

Proposed fix:

Check each character to see whether it consists of 1 wchar or 2 wchar's,
and call text_out() accordingly.
See the attached patch.

wintext-c-surrogate-pair-patch.txt

@mintty
Copy link
Owner

mintty commented Dec 16, 2016

In general, mintty does handle surrogate pairs, but there may be a gap in the code for handling combining characters, which might hit also non-combining characters in certain cases. To be checked.
Did you check your patch actually fixes display for you? Which font do you use?

@mintty
Copy link
Owner

mintty commented Dec 16, 2016

Checking again, I don't think the musical symbol alone can trigger the loop for the combining case that you patched (did you trace?). Although you have made a valid point, I guess your symptom is more likely explained by missing font support. Did you try option FontRender=uniscribe by the way?
In any case, for further analysis, I'd need to be able to display that symbol at all, e.g. in notepad at least, so I need a font that includes it.

@mintty
Copy link
Owner

mintty commented Dec 16, 2016

Somehow (to be checked), all non-BMP characters seem to be getting handled by the combining characters code branch, so your patch does indeed stabilise non-BMP display (would have worked a while, than break mysteriously). Changing and further investigating.

@xianpingge
Copy link
Author

Thanks for looking into this issue.

I tested my patch and it fixed the display for me. I used the Windows 10's seguisym.ttf ( -o 'Font=Segoe UI Symbol' ) which covers, among other characters, both latin1 and the music symbol G clef U+1D11E.
Here are some other fonts that include U+1D11E:

NotoSansSymbols-Regular.ttf ( -o 'Font=Noto Sans Symbols' ), available at https://www.google.com/get/noto/

HanaMinA.ttf ( -o 'Font=HanaMinA' ), available at
http://fonts.jp/hanazono/
(BTW, HanaMinA.ttf and HanaMinB.ttf from this site cover CJK, CJK Ext-A, Ext-B, Ext-C, Ext-D, and Ext-E.)

Per your instruction, I also tested the option '-o FontRender=uniscribe'; the music G clef symbol was
displayed before my patch. I also tested this option with some CJK characters outside of BMP;
they were displayed, but they seemed to be overlaid with the fallback blank-boxes.
Attached please find the screenshots:

For the music G clef symbol tests:

11-clef-before-patch
12-clef-after-patch
13-clef-uniscribe-before-patch
14-clef-uniscribe-after-patch

For the CJK characters tests:

21-cjk-before-patch
22-cjk-after-patch
23-cjk-uniscribe-before-patch
24-cjk-uniscribe-after-patch

@mintty
Copy link
Owner

mintty commented Dec 18, 2016

For the records, and in case you're interested:
The root cause was that all non-BMP characters are handled like combining characters during processing, mainly in termline.c. As there certainly was a good reason for this, I'm not going to change it.
But the effect of this internal handling to get the TATTR_COMBING flag set for output processing can be fixed with the patch below. I guess this is the more proper fix for the issue; either of the patches actually fix it, and your patch is correct anyway.
My patch is not yet complete, though, as it does not consider that there are also combining non-BMP characters. Some revision of width handling code will be necessary to complete this.

--- a/src/term.c
+++ b/src/term.c
@@ -1221,7 +1221,8 @@ term_paint(void)
           else {
             textattr[textlen] = dd->attr;
             text[textlen++] = dd->chr;
-            attr.attr |= TATTR_COMBINING;
+            if ((dd->chr & 0xFC00) != 0xDC00)
+              attr.attr |= TATTR_COMBINING;
           }
         }
       }

@mintty
Copy link
Owner

mintty commented Dec 21, 2016

Eliminated one remaining quirk in surrogate handling (they were not considered for the bounding rectangle when drawing the character, yielding blanks and cursor artefacts especially when editing).
Please check the current repository version.

@mintty
Copy link
Owner

mintty commented Dec 22, 2016

Fixed in 2.7.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants