Skip to content
This repository has been archived by the owner on Jan 14, 2022. It is now read-only.

Miscellaneous Unicode bugfixes and optimizations #338

Merged
merged 10 commits into from Jun 9, 2014

Conversation

steven676
Copy link
Contributor

This patch series contains various bug fixes for combining character and East Asian wide character support, notably:

  • The second crash reported in #145
  • Text is rendered incorrectly (characters are drawn on top of one another) if the cursor is moved into the middle of a line with many East Asian wide characters; in extreme cases, the cursor can disappear entirely (regression caused by 9f1d71b and not addressed by e69a1b4)
  • Combining diacritics in the last column of the screen or at the end of a selection weren't displayed

There are also additional test cases for Unicode support and a couple of small performance improvements.

Lightly tested on various emulators; in addition, most of these patches have been running for some time on a phone running 4.4 without problems.

These are a set of specially-crafted text files designed to exercise
most of the corner cases in the Unicode support code, particularly
FullUnicodeLine.setChar()'s support for overwriting screen columns with
changes in sequence lengths and display widths.  For best (worst?)
results, use the tests in vttest mode; the lines of exactly 80 columns
will be more likely to uncover bugs that way.
…quences

The Unicode standard requires that, when dealing with ill-formed UTF-8
(version 6.2, page 96):

    If the converter encounters an ill-formed UTF-8 code unit sequence
    which starts with a valid first byte, but which does not continue
    with valid successor bytes [...], it must not consume the successor
    bytes as part of the ill-formed subsequence whenever those successor
    bytes themselves constitute part of a well-formed UTF-8 code unit
    subsequence.

This implies that when we hit a byte in the input stream which cannot
fit into the sequence currently being decoded, we must attempt to decode
that byte again after resetting our decoder state.
We currently assume that the column we're inserting a char into is the
start of a display character, and compute the length of the subsequence
of mText storing the column's contents accordingly.  This breaks,
however, when inserting into the second column spanned by an East Asian
wide character.

Detect this case and handle it specially when we need to find the start
of the next independent column's contents (such as when computing the
length of the existing sequence stored in this column).  Also, to
preserve column alignment, pad the column before with a space; if the
character being inserted is a wide character, clobber the next column's
contents too.

Fixes the second coming of #145, and potentially other
difficult-to-reproduce bugs concerning East Asian wide character support
as well.
At the moment, TranscriptScreen's drawTextRun() tracks the screen column
corresponding to the character it's working on at the moment by updating
the column after processing each character; this results in combining
characters being considered part of the column after the one they should
apply to.

For most purposes, this does not matter -- Android's own text drawing
routines don't care about our column count, so the combining characters
will be drawn with the correct base character.  However, there are
corner cases where this causes problems:

* Combining characters in the last screen column will be ignored
  entirely (after processing the base character, the column count
  exceeds the number of columns, so processing stops)
* If the last column of a text run (sequence with the same style, or a
  selection range) has combining characters, they will be erroneously
  drawn outside the run (the run ends after the base character, so
  the combining characters aren't passed to the text drawing routine as
  part of the run)

Instead, update a next-column counter after processing each character,
and use it to set the column count once we hit the next spacing
character.  This requires us to add array bounds checking to the loop,
as it's now possible to fall off the end of the array looking for the
next column; that has the side effect of papering over many possible
crash bugs involving line buffer corruption.
…deLine

getLine() is called frequently during screen drawing and other
operations, usually with a request for the contents of an entire line.
We have a fast path for this case with basic lines; something similar
works for FullUnicodeLines, so let's do that too.

Also, update the documentation to reflect the fact that the array
returned from getLine() will not be null-terminated if the array size
exactly matches the length of the requested text.
While modern Korean is typically written using precomposed syllable
blocks, Unicode also provides individual conjoining Hangul syllables
(jamo) which are supposed to combine to form complete syllable blocks.
Since at least Android 4.1, the Android text rendering services handle
conjoining jamo correctly -- but in order for a syllable block to be
rendered correctly, we need to pass its components to the text renderer
as a unit.

The easiest way to achieve this is to treat medial vowels and final
consonants as combining characters; that way, UnicodeTranscript will
automatically store them in the same column along with the initial
consonant, and all our text-processing code will treat the syllable
block as a unit.  This behavior is strictly incorrect (isolated
individual medials and finals shouldn't behave as combining characters),
but it's the easiest way to get conjoining jamo working in the common
cases.

On platforms where we don't trust Android to render conjoining jamo
correctly, we instead treat medials and finals as regular spacing
characters of width 2.  This is also strictly incorrect (the Unicode
character database says medials and finals are East Asian
neutral/narrow), but it's the best match for the way Android renders
the individual jamo.
Text runs beginning or ending with East Asian wide characters can
currently behave strangely when selected:

* An East Asian wide character immediately before the displayed
  selection is sometimes copied along with the highlighted text, despite
  not being highlighted;
* An East Asian wide character at the end of the displayed selection is
  sometimes omitted from the copied text, despite being highlighted.

Both of these problems have the same root cause: EmulatorView's
reporting of selection bounds does not take the width of the underlying
character into account, even though the drawing of the selection
highlight does.  In other words, it's possible for EmulatorView to ask
for half of an East Asian wide character to be selected, but the
highlight won't reflect this (either the entire character is
highlighted, or none of it is).

Preventing EmulatorView from reporting selection bounds like this would
require making the text selection routines aware of the underlying
characters they operate on (as opposed to just working from screen
positions).  Instead:

* Have UnicodeTranscript include half-selected East Asian wide characters
  at the end of a selection in copied text.
* Ensure that a half-selected East Asian wide character at the beginning
  of a selection is highlighted.
* Adjust UnicodeTranscript.getLineColor() to ensure that the number of
  columns of color information returned in a partial line always matches
  the number of columns in the text that getLine() would return.
* Ensure that TranscriptScreen.internalGetTranscriptText() doesn't
  truncate the returned text when there are half-selected East Asian
  wide characters at both the beginning and end of the selection range.
  This requires us to avoid bounds checking based on selection width
  (the naive computation based on screen positions would be off by two,
  omitting the last character, and we lack the information here to do
  better); using a try-catch clause to replace the bounds check is
  likely to paper over some crash bugs involving line buffer corruption.

This ensures a consistent user experience: half-selected East Asian wide
characters are always highlighted and always included in the copied
text, whether at the beginning or end of the selection.
* PaintRenderer.drawTextRun(): compute cursor visibility solely
  in units of screen position instead of mixing screen positions and
  indexes into the line buffer; in a long run of East Asian wide
  characters, this ensures correct placement of text after the cursor
  and prevents the cursor from disappearing midway through the line.
* PaintRenderer.drawTextRun(): do not assume the display character under
  the cursor contains only one UTF-16 code unit; this fixes rendering of
  non-BMP characters and combining diacritics at the cursor.
* TranscriptScreen.drawText(): make cursorIncr reflect all code units at
  the cursor position, including combining diacritics.
@jackpal jackpal merged commit a2b2af1 into jackpal:master Jun 9, 2014
@jackpal
Copy link
Owner

jackpal commented Jun 9, 2014

Thanks! Very nice patch series, as always!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants