Miscellaneous Unicode bugfixes and optimizations #338

steven676 · 2014-06-09T03:16:49Z

This patch series contains various bug fixes for combining character and East Asian wide character support, notably:

The second crash reported in #145
Text is rendered incorrectly (characters are drawn on top of one another) if the cursor is moved into the middle of a line with many East Asian wide characters; in extreme cases, the cursor can disappear entirely (regression caused by 9f1d71b and not addressed by e69a1b4)
Combining diacritics in the last column of the screen or at the end of a selection weren't displayed

There are also additional test cases for Unicode support and a couple of small performance improvements.

Lightly tested on various emulators; in addition, most of these patches have been running for some time on a phone running 4.4 without problems.

These are a set of specially-crafted text files designed to exercise most of the corner cases in the Unicode support code, particularly FullUnicodeLine.setChar()'s support for overwriting screen columns with changes in sequence lengths and display widths. For best (worst?) results, use the tests in vttest mode; the lines of exactly 80 columns will be more likely to uncover bugs that way.

…quences The Unicode standard requires that, when dealing with ill-formed UTF-8 (version 6.2, page 96): If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes [...], it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence. This implies that when we hit a byte in the input stream which cannot fit into the sequence currently being decoded, we must attempt to decode that byte again after resetting our decoder state.

We currently assume that the column we're inserting a char into is the start of a display character, and compute the length of the subsequence of mText storing the column's contents accordingly. This breaks, however, when inserting into the second column spanned by an East Asian wide character. Detect this case and handle it specially when we need to find the start of the next independent column's contents (such as when computing the length of the existing sequence stored in this column). Also, to preserve column alignment, pad the column before with a space; if the character being inserted is a wide character, clobber the next column's contents too. Fixes the second coming of #145, and potentially other difficult-to-reproduce bugs concerning East Asian wide character support as well.

At the moment, TranscriptScreen's drawTextRun() tracks the screen column corresponding to the character it's working on at the moment by updating the column after processing each character; this results in combining characters being considered part of the column after the one they should apply to. For most purposes, this does not matter -- Android's own text drawing routines don't care about our column count, so the combining characters will be drawn with the correct base character. However, there are corner cases where this causes problems: * Combining characters in the last screen column will be ignored entirely (after processing the base character, the column count exceeds the number of columns, so processing stops) * If the last column of a text run (sequence with the same style, or a selection range) has combining characters, they will be erroneously drawn outside the run (the run ends after the base character, so the combining characters aren't passed to the text drawing routine as part of the run) Instead, update a next-column counter after processing each character, and use it to set the column count once we hit the next spacing character. This requires us to add array bounds checking to the loop, as it's now possible to fall off the end of the array looking for the next column; that has the side effect of papering over many possible crash bugs involving line buffer corruption.

For use in vttest mode.

…deLine getLine() is called frequently during screen drawing and other operations, usually with a request for the contents of an entire line. We have a fast path for this case with basic lines; something similar works for FullUnicodeLines, so let's do that too. Also, update the documentation to reflect the fact that the array returned from getLine() will not be null-terminated if the array size exactly matches the length of the requested text.

While modern Korean is typically written using precomposed syllable blocks, Unicode also provides individual conjoining Hangul syllables (jamo) which are supposed to combine to form complete syllable blocks. Since at least Android 4.1, the Android text rendering services handle conjoining jamo correctly -- but in order for a syllable block to be rendered correctly, we need to pass its components to the text renderer as a unit. The easiest way to achieve this is to treat medial vowels and final consonants as combining characters; that way, UnicodeTranscript will automatically store them in the same column along with the initial consonant, and all our text-processing code will treat the syllable block as a unit. This behavior is strictly incorrect (isolated individual medials and finals shouldn't behave as combining characters), but it's the easiest way to get conjoining jamo working in the common cases. On platforms where we don't trust Android to render conjoining jamo correctly, we instead treat medials and finals as regular spacing characters of width 2. This is also strictly incorrect (the Unicode character database says medials and finals are East Asian neutral/narrow), but it's the best match for the way Android renders the individual jamo.

Text runs beginning or ending with East Asian wide characters can currently behave strangely when selected: * An East Asian wide character immediately before the displayed selection is sometimes copied along with the highlighted text, despite not being highlighted; * An East Asian wide character at the end of the displayed selection is sometimes omitted from the copied text, despite being highlighted. Both of these problems have the same root cause: EmulatorView's reporting of selection bounds does not take the width of the underlying character into account, even though the drawing of the selection highlight does. In other words, it's possible for EmulatorView to ask for half of an East Asian wide character to be selected, but the highlight won't reflect this (either the entire character is highlighted, or none of it is). Preventing EmulatorView from reporting selection bounds like this would require making the text selection routines aware of the underlying characters they operate on (as opposed to just working from screen positions). Instead: * Have UnicodeTranscript include half-selected East Asian wide characters at the end of a selection in copied text. * Ensure that a half-selected East Asian wide character at the beginning of a selection is highlighted. * Adjust UnicodeTranscript.getLineColor() to ensure that the number of columns of color information returned in a partial line always matches the number of columns in the text that getLine() would return. * Ensure that TranscriptScreen.internalGetTranscriptText() doesn't truncate the returned text when there are half-selected East Asian wide characters at both the beginning and end of the selection range. This requires us to avoid bounds checking based on selection width (the naive computation based on screen positions would be off by two, omitting the last character, and we lack the information here to do better); using a try-catch clause to replace the bounds check is likely to paper over some crash bugs involving line buffer corruption. This ensures a consistent user experience: half-selected East Asian wide characters are always highlighted and always included in the copied text, whether at the beginning or end of the selection.

* PaintRenderer.drawTextRun(): compute cursor visibility solely in units of screen position instead of mixing screen positions and indexes into the line buffer; in a long run of East Asian wide characters, this ensures correct placement of text after the cursor and prevents the cursor from disappearing midway through the line. * PaintRenderer.drawTextRun(): do not assume the display character under the cursor contains only one UTF-16 code unit; this fixes rendering of non-BMP characters and combining diacritics at the cursor. * TranscriptScreen.drawText(): make cursorIncr reflect all code units at the cursor position, including combining diacritics.

jackpal · 2014-06-09T04:11:34Z

Thanks! Very nice patch series, as always!

steven676 added 10 commits June 8, 2014 14:50

TermSession: add a fast path for ASCII in write(int codePoint)

c903ed9

Add test of combining characters at the end of a line

09ead3a

For use in vttest mode.

jackpal merged commit a2b2af1 into jackpal:master Jun 9, 2014

agnostic-apollo mentioned this pull request Feb 22, 2024

Fixed: Widen TerminalRow's index so very long lines don't overflow it. termux/termux-app#3840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous Unicode bugfixes and optimizations #338

Miscellaneous Unicode bugfixes and optimizations #338

steven676 commented Jun 9, 2014

jackpal commented Jun 9, 2014

Miscellaneous Unicode bugfixes and optimizations #338

Miscellaneous Unicode bugfixes and optimizations #338

Conversation

steven676 commented Jun 9, 2014

jackpal commented Jun 9, 2014