Skip to content

Commit 81ef4f4

Browse files
committed
Replace unsupported whitespace with space glyph
This follows the word of the Unicode standard: http://unicode.org/faq/unsup_char.html """ Q: Which characters should be displayed as a visible but blank space? A: This is the easy one: all the characters that have the White_Space property, also generically known as “whitespace characters”. This set includes SPACE, of course, but also such characters as the tab control character, NO-BREAK SPACE, LINE SEPARATOR, and so on. For the full list, see the White_Space values in PropList.txt. """ However, I'm not sure if we want to do this this way. Note that White_Space, as of Unicode 7.0, includes: $ grep '; White_Space' PropList.txt 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE That's in fact all of GC=Zs/Zp/Zl plus U+0009..000D and U+0085. Of those, all the GC=Zs ones have a compatibility decomposition to space already, so they were getting this treatment already, with the benefit that client could override that fallback by overriding decompose_compatibility() function, and in fact LibreOffice already does that. If we commit this change, clients wouldn't be able to override that anymore. So this change is essentially about ASCII control chars 9..D and U+0085 NEL as well as U+2028/U+2029 LINE/PARAGRAPH SEPARATOR. Perhaps I should limit this change to just those? My personal feeling is that those characters are actually better always rendered as space, or never rendered as space. Relying on whether the font supports those is only one particular reading of the Unicode standard. Unicode says show space if "the rendering system doesn't fully support them". We can also read this as "if client did indeed pass them to HarfBuzz". I think I like that reading for the newline-like characters.
1 parent 1eff435 commit 81ef4f4

File tree

2 files changed

+15
-1
lines changed

2 files changed

+15
-1
lines changed

src/hb-ot-shape-normalize.cc

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,13 @@ compose_unicode (const hb_ot_shape_normalize_context_t *c,
100100
return c->unicode->compose (a, b, ab);
101101
}
102102

103+
static inline bool
104+
is_whitespace (const hb_glyph_info_t &info)
105+
{
106+
return HB_UNICODE_GENERAL_CATEGORY_IS_SEPARATOR (_hb_glyph_info_get_general_category (&info)) ||
107+
hb_in_range (info.codepoint, 0x0009u, 0x000Du) || info.codepoint == 0x0085u;
108+
}
109+
103110
static inline void
104111
set_glyph (hb_glyph_info_t &info, hb_font_t *font)
105112
{
@@ -198,7 +205,7 @@ decompose_current_character (const hb_ot_shape_normalize_context_t *c, bool shor
198205
{
199206
hb_buffer_t * const buffer = c->buffer;
200207
hb_codepoint_t u = buffer->cur().codepoint;
201-
hb_codepoint_t glyph;
208+
hb_codepoint_t glyph, space_glyph;
202209

203210
/* Kind of a cute waterfall here... */
204211
if (shortest && c->font->get_glyph (u, 0, &glyph))
@@ -209,6 +216,8 @@ decompose_current_character (const hb_ot_shape_normalize_context_t *c, bool shor
209216
next_char (buffer, glyph);
210217
else if (decompose_compatibility (c, u))
211218
skip_char (buffer);
219+
else if (is_whitespace (buffer->cur()) && c->font->get_glyph (0x0020u, 0, &space_glyph))
220+
next_char (buffer, space_glyph); /* http://unicode.org/faq/unsup_char.html */
212221
else
213222
next_char (buffer, glyph); /* glyph is initialized in earlier branches. */
214223
}

src/hb-unicode-private.hh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -313,5 +313,10 @@ extern HB_INTERNAL const hb_unicode_funcs_t _hb_unicode_funcs_nil;
313313
FLAG (HB_UNICODE_GENERAL_CATEGORY_ENCLOSING_MARK) | \
314314
FLAG (HB_UNICODE_GENERAL_CATEGORY_NON_SPACING_MARK)))
315315

316+
#define HB_UNICODE_GENERAL_CATEGORY_IS_SEPARATOR(gen_cat) \
317+
(FLAG (gen_cat) & \
318+
(FLAG (HB_UNICODE_GENERAL_CATEGORY_LINE_SEPARATOR) | \
319+
FLAG (HB_UNICODE_GENERAL_CATEGORY_PARAGRAPH_SEPARATOR) | \
320+
FLAG (HB_UNICODE_GENERAL_CATEGORY_SPACE_SEPARATOR)))
316321

317322
#endif /* HB_UNICODE_PRIVATE_HH */

0 commit comments

Comments
 (0)