readline: fix character width calculation #13918

TimothyGu · 2017-06-26T06:46:44Z

Fixes width calculation of non-spacing marks, commonly seen in Unicode Normalization Form D. Example: 'a\u0301' ('á'.normalize('NFD')), 'ру́сский язы́к' (Unicode doesn't have many precomposed accented Cyrillic letters).

Outdated information

Not sure where to add tests for this feature though. readline has the following tests:

test-readline-csi.js
test-readline-emit-keypress-events.js
test-readline-interface.js
test-readline.js
test-readline-keys.js
test-readline-reopen.js
test-readline-set-raw-mode.js
test-readline-undefined-columns.js
test-icu-stringwidth.js

None of them seem to fit this bug, which is the glue between getStringWidth and readline, Hence the WIP.

The second commit changes how widths of certain characters are determined:

Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width
Categorize all spacing marks (Mc) as non-0-width.
Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that block.

These decisions are made, partially by following the behavior of GNOME Terminal. Testing on other terminals is of course welcome.

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
commit message follows commit guidelines

Affected core subsystem(s)

readline

cjihrig · 2017-06-26T14:16:18Z

None of them seem to fit this bug, which is the glue between getStringWidth and readline, Hence the WIP.

Why not just create a new test?

jasnell · 2017-06-26T17:05:13Z

src/node_i18n.cc

@@ -603,9 +603,10 @@ static void ToASCII(const FunctionCallbackInfo<Value>& args) {
 // consideration.
 static int GetColumnWidth(UChar32 codepoint,
                          bool ambiguous_as_full_width = false) {
-  if (!u_isdefined(codepoint) ||


Why remove the u_isdefined() check?

My terminal (GNOME Terminal) actually displays unassigned characters (in a weird box form) so they are more than 0-width. Indeed, UAX #11 specifies that

Unassigned code points in ranges intended for CJK ideographs are classified as Wide.

while

All other unassigned code points are by default classified as Neutral.

I removed the u_isdefined() check here so that these unassigned characters can use the generic ICU routine below, which does the right thing.

I suspected that was the case. Ok :)

jasnell

nice... thank you!

TimothyGu · 2017-06-27T00:11:42Z

@cjihrig

Why not just create a new test?

Err... why did I not think of that...

TimothyGu · 2017-06-27T01:35:32Z

Tests for 0-width characters are added too.

CI: https://ci.nodejs.org/job/node-test-pull-request/8836/

Trott · 2017-06-27T04:08:56Z

Stopped the Raspberry Pi devices in Ci because they had been running for 2.5 hours and had no new console messages for over an hour. Not sure if this will self-correct or if we need Build WG intervention....

Trott · 2017-06-27T04:10:09Z

Not looking like the Pi issue is self-correcting. Will have to either wait for the issue to get resolved before proceeding, or decide that this can land without the Pi run.
¯\(ツ)/¯

TimothyGu · 2017-06-28T05:25:50Z

Will have to either wait for the issue to get resolved before proceeding, or decide that this can land without the Pi run.

Given that the other machines are universally green (except for macOS's fickle nature), nor does this PR have any architecture-specific components, I'd say this can land w/o confirmation from RPi.

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX nodejs#11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/

jasnell · 2017-06-29T02:33:19Z

CI: https://ci.nodejs.org/job/node-test-pull-request/8858/

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

jasnell · 2017-06-29T05:20:36Z

Landed in 01aeb38 and f4b5b70

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

- Categorize all nonspacing marks (Mn) and enclosing marks (Me) as 0-width - Categorize all spacing marks (Mc) as non-0-width. - Treat soft hyphens (a format character Cf) as non-0-width. - Do not treat all unassigned code points as 0-width; instead, let ICU select the default for that character per UAX #11. - Avoid getting the General_Category of a character multiple times as it is an intensive operation. Refs: http://unicode.org/reports/tr11/ PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

MylesBorins · 2017-08-14T20:51:04Z

Should this be backported to v6.x-staging? If yes please follow the guide and raise a backport PR, if no let me know or add the dont-land-on label.

nodejs-github-bot added the readline Issues and PRs related to the built-in readline module. label Jun 26, 2017

TimothyGu force-pushed the readline-nsm branch from 6abff8b to ea1d8a2 Compare June 26, 2017 07:35

jasnell reviewed Jun 26, 2017

View reviewed changes

jasnell approved these changes Jun 26, 2017

View reviewed changes

TimothyGu force-pushed the readline-nsm branch from ea1d8a2 to 9201cd6 Compare June 27, 2017 01:17

TimothyGu changed the title ~~WIP readline: properly handle 0-width characters~~ readline: fix character width calculation Jun 27, 2017

readline: properly handle 0-width characters

8a3e46a

TimothyGu force-pushed the readline-nsm branch from 9201cd6 to 6e9fdd8 Compare June 28, 2017 07:38

jasnell pushed a commit that referenced this pull request Jun 29, 2017

readline: properly handle 0-width characters

01aeb38

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

jasnell closed this Jun 29, 2017

addaleax pushed a commit that referenced this pull request Jun 29, 2017

readline: properly handle 0-width characters

5ee0e20

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

addaleax mentioned this pull request Jun 29, 2017

v8.2.0 proposal #13744

Merged

TimothyGu deleted the readline-nsm branch June 30, 2017 06:23

addaleax pushed a commit that referenced this pull request Jul 11, 2017

readline: properly handle 0-width characters

c461079

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

addaleax pushed a commit that referenced this pull request Jul 18, 2017

readline: properly handle 0-width characters

b4b27b2

PR-URL: #13918 Reviewed-By: James M Snell <jasnell@gmail.com>

MylesBorins added the backport-requested-v6.x label Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readline: fix character width calculation #13918

readline: fix character width calculation #13918

TimothyGu commented Jun 26, 2017 •

edited

cjihrig commented Jun 26, 2017

jasnell Jun 26, 2017

TimothyGu Jun 27, 2017

jasnell Jun 27, 2017

jasnell left a comment

TimothyGu commented Jun 27, 2017

TimothyGu commented Jun 27, 2017

Trott commented Jun 27, 2017

Trott commented Jun 27, 2017

TimothyGu commented Jun 28, 2017

jasnell commented Jun 29, 2017

jasnell commented Jun 29, 2017

MylesBorins commented Aug 14, 2017

readline: fix character width calculation #13918

readline: fix character width calculation #13918

Conversation

TimothyGu commented Jun 26, 2017 • edited

Checklist

Affected core subsystem(s)

cjihrig commented Jun 26, 2017

jasnell Jun 26, 2017

Choose a reason for hiding this comment

TimothyGu Jun 27, 2017

Choose a reason for hiding this comment

jasnell Jun 27, 2017

Choose a reason for hiding this comment

jasnell left a comment

Choose a reason for hiding this comment

TimothyGu commented Jun 27, 2017

TimothyGu commented Jun 27, 2017

Trott commented Jun 27, 2017

Trott commented Jun 27, 2017

TimothyGu commented Jun 28, 2017

jasnell commented Jun 29, 2017

jasnell commented Jun 29, 2017

MylesBorins commented Aug 14, 2017

TimothyGu commented Jun 26, 2017 •

edited