readline: use icu based string width calculation #9040

jasnell · 2016-10-11T22:23:30Z

Checklist

make -j8 test (UNIX), or vcbuild test nosign (Windows) passes
tests and/or benchmarks are included
commit message follows commit guidelines

Affected core subsystem(s)

readline, internal

Description of change

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module, but has been split out.

Refs: #8075

jasnell · 2016-10-11T22:39:06Z

Technically semver-minor because it allows now a single codepoint to be passed in as the argument to getStringWidth().

jasnell · 2016-10-12T02:40:44Z

CI: https://ci.nodejs.org/job/node-test-pull-request/4484/

bnoordhuis · 2016-10-12T08:52:32Z

This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved.

jasnell · 2016-10-12T13:39:31Z

Intl and non-intl builds already behave differently in many ways. This
would be no different. Even v8 already acts differently if ICU isn't
present.

On Wednesday, October 12, 2016, Ben Noordhuis notifications@github.com
wrote:

This change would make intl and non-intl builds behave differently,
wouldn't it? I think I'd rather have a single semi-broken algorithm than
two diverging implementations. As well, the non-intl path will be untested
until nodejs/build#419 nodejs/build#419 is
resolved.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eTOFILRRLZmfbAkdyGUaprCAOECaks5qzJ_XgaJpZM4KUK7U
.

bnoordhuis · 2016-10-12T13:56:19Z

Intl and non-intl builds already behave differently in many ways.

Oh? In what way?

Even v8 already acts differently if ICU isn't present.

"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is.

jasnell · 2016-10-12T15:14:50Z

Oh? In what way?

Off the top of my head:

Presence of Intl
Results of String.prototype.normalize (try '\u1E9B\u0323'.normalize('NFD') === '\u1E9B\u0323'.normalize('NFC') in a default build vs. --without-intl build)
Behavior of locale specific methods like toLocaleUpperCase(), localeCompare(), and toLocaleString()
ICU's 10x faster punycode implementation used if ICU is present (recent change)

There's likely more.

It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm.

this PR:

$ ./node --expose-internals benchmark/misc/stringwidth.js
misc/stringwidth.js test="a" millions=5: 6.288441921473474
misc/stringwidth.js test="丁" millions=5: 6.230841486482881
misc/stringwidth.js test="👸🏿" millions=5: 4.1727458183443265
misc/stringwidth.js test="👅" millions=5: 5.148377225269059
misc/stringwidth.js test="\n" millions=5: 6.285715980343315
misc/stringwidth.js test="‎f‏" millions=5: 5.2453819193365865
misc/stringwidth.js test="‎\n∊⃒" millions=5: 4.174633509425163

vs. latest v6:

$ node --expose-internals benchmark/misc/stringwidth.js
misc/stringwidth.js test="a" millions=5: 1.3992433734181065
misc/stringwidth.js test="丁" millions=5: 1.7113687595793867
misc/stringwidth.js test="👸🏿" millions=5: 1.5577253551652364
misc/stringwidth.js test="👅" millions=5: 1.6861030856604042
misc/stringwidth.js test="\n" millions=5: 1.7522551193962668
misc/stringwidth.js test="‎f‏" millions=5: 1.5648816109285903
misc/stringwidth.js test="‎\n∊⃒" millions=5: 1.459896127806548

(the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine)

Fishrock123 · 2016-10-13T14:10:51Z

lib/internal/readline.js

 /**
 * Tries to remove all VT control characters. Use to estimate displayed
 * string width. May be buggy due to not running a real state machine
 */
 function stripVTControlCharacters(str) {
-  str = str.replace(new RegExp(functionKeyCodeReAnywhere.source, 'g'), '');
-  return str.replace(new RegExp(metaKeyCodeReAnywhere.source, 'g'), '');
+  return str.replace(ansi, '');


doesn't need /g?

it uses /g.

Fishrock123 · 2016-10-13T14:12:46Z

lib/internal/readline.js



 module.exports = {
  emitKeys,
-  getStringWidth,
-  isFullWidthCodePoint,


maybe we should initially assign no-ops with comments?

how come? that does not seem very practical.

Fishrock123 · 2016-10-13T14:13:35Z

lib/internal/readline.js

  stripVTControlCharacters
 };

+if (process.binding('config').hasIntl) {


not detectable from process.config?

Using process.config is unreliable because there are userland packages that override it. That's why process.binding('config') was introduced.

Fishrock123 · 2016-10-13T14:15:00Z

lib/internal/readline.js

+    return icu.getStringWidth(stripVTControlCharacters(str));
+  };
+  module.exports.isFullWidthCodePoint = function isFullWidthCodePoint(code) {
+    if (typeof code !== 'number')


shouldn't this remain isNaN() ... or rather, Number.isNaN()?

Fishrock123 · 2016-10-13T14:16:27Z

lib/readline.js

+      // if the key.sequence is half of a surrogate pair,
+      // refresh the line so the character is displayed appropriately.
+      const ch = key.sequence.codePointAt(0);
+      if (ch >= 0xd800 && ch <= 0xdfff)


maybe add a comment about these values?

is the existing comment on line 128 and 129 not enough?

jasnell · 2016-10-21T21:00:23Z

New CI: https://ci.nodejs.org/job/node-test-pull-request/4611/
Failure on windows... trying again: https://ci.nodejs.org/job/node-test-pull-request/4619/

jasnell · 2016-10-22T15:25:44Z

Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review.
@trevnorris @addaleax ... PTAL, does this LGTY?

bnoordhuis

FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.

bnoordhuis · 2016-10-23T10:24:44Z

lib/internal/readline.js

+// Adopted from https://github.com/chalk/ansi-regex/blob/master/index.js
+// License: MIT, authors: @sindresorhus, Qix-, and arjunmehta
+const ansi =
+  /[\u001b\u009b][[()#;?]*(?:[0-9]{1,4}(?:;[0-9]{0,4})*)?[0-9A-ORZcf-nqry=><]/g;


This regex could use a comment explaining what it tries to match/capture. Also, four space indent.

four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.

bnoordhuis · 2016-10-23T10:30:32Z

src/node_i18n.cc

+// newer wide characters. wcwidth, on the other hand, uses a fixed
+// algorithm that does not take things like emoji into proper
+// consideration.
+static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) {


Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.

bnoordhuis · 2016-10-23T10:31:22Z

src/node_i18n.cc

+// algorithm that does not take things like emoji into proper
+// consideration.
+static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) {
+  const int eaw = u_getIntPropertyValue(codepoint, UCHAR_EAST_ASIAN_WIDTH);


Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.

bnoordhuis · 2016-10-23T10:32:23Z

src/node_i18n.cc

+      return 2;
+    case U_EA_AMBIGUOUS:
+      if (ambiguousFull)
+        return 2;


If the fall-through is intentional, can you add // fallthrough comments?

bnoordhuis · 2016-10-23T10:32:47Z

src/node_i18n.cc

+      if (u_iscntrl(codepoint) ||
+          u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER) ||
+          u_getCombiningClass(codepoint) > 0)
+        return 0;


Maybe put braces around the body, slightly easier to read.

bnoordhuis · 2016-10-23T10:33:47Z

src/node_i18n.cc

+  }
+
+  bool ambiguousFull = args[1]->BooleanValue();
+  bool expandEmojiSequence = args[2]->BooleanValue();


Style: snake_case

bnoordhuis · 2016-10-23T10:35:12Z

src/node_i18n.cc

+  bool expandEmojiSequence = args[2]->BooleanValue();
+
+  TwoByteValue value(env->isolate(), args[0].As<String>());
+  UChar* str = reinterpret_cast<UChar*>(*value);


The reinterpret_cast shouldn't be necessary, should it?

bnoordhuis · 2016-10-23T10:37:58Z

src/node_i18n.cc

+        n > 0 && p == 0x200d &&  // 0x200d == emoji sequence continuation
+        (u_hasBinaryProperty(c, UCHAR_EMOJI_PRESENTATION) ||
+         u_hasBinaryProperty(c, UCHAR_EMOJI_MODIFIER)))
+      continue;


jasnell · 2016-10-23T15:14:30Z

Updated to address the nits.

jasnell · 2016-10-23T17:39:24Z

@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/
/cc @jbergstroem ... I'll tweak this a bit later to get it added to the main CI group.

jbergstroem · 2016-10-23T18:58:53Z

@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests.

@bnoordhuis

Updated. Needs another review from @bnoordhuis

bnoordhuis

LGTM with a comment and a style nit.

bnoordhuis · 2016-10-24T19:58:16Z

src/node_i18n.cc

+  if (args[0]->IsNumber()) {
+    args.GetReturnValue().Set(
+        GetColumnWidth(args[0]->Uint32Value(),
+        ambiguous_as_full_width));


Can you line up the argument?

bnoordhuis · 2016-10-24T20:00:36Z

src/node_i18n.cc

+    return;
+  }
+
+  TwoByteValue value(env->isolate(), args[0].As<String>());


TwoByteValue's constructor takes a Local<Value>. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.

srl295

Looks good, with a couple of minor comments

srl295 · 2016-10-24T21:26:35Z

src/node_i18n.cc

+    // The expand_emoji_sequence option allows the caller to skip this
+    // check and count each code within an emoji sequence separately.
+    if (!expand_emoji_sequence &&
+        n > 0 && p == 0x200d &&  // 0x200d == emoji sequence continuation


U+200D is a ZWJ (zero width joiner), please use that term

srl295 · 2016-10-24T21:27:25Z

src/node_i18n.cc

+    // in advance if a particular sequence is going to be supported.
+    // The expand_emoji_sequence option allows the caller to skip this
+    // check and count each code within an emoji sequence separately.
+    if (!expand_emoji_sequence &&


This seems like a reasoable way of doing this calculation

srl295 · 2016-10-24T21:29:16Z

src/node_i18n.cc

+      return 2;
+    case U_EA_AMBIGUOUS:
+      if (ambiguous_as_full_width) {
+        return 2;


It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here

jasnell · 2016-10-24T23:42:35Z

Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/

jasnell · 2016-10-25T15:25:29Z

had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075

jasnell · 2016-10-25T15:59:49Z

CI is green except for unrelated flaky failures. Landing

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>

jasnell · 2016-10-25T16:02:25Z

Landed in 72547fe

jasnell · 2016-10-25T18:21:28Z

Shrinking it down ought to be fine. I'll do that this next week

On Sunday, October 23, 2016, Johan Bergström notifications@github.com
wrote:

@jasnell https://github.com/jasnell: I don't think it should be added
to the current job as-is. There's way too many workers involved. Do we need
to test this on every linux flavor? I was hoping to have a small subset of
workers for build permutation tests.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eSj8DXEb45zGppZSrbdxdlcyksZkks5q2650gaJpZM4KUK7U
.

bnoordhuis · 2016-10-25T18:45:20Z

@jasnell g++ spotted a bug:

../src/node_i18n.cc: In function 'void node::i18n::GetStringWidth(const v8::FunctionCallbackInfo<v8::Value>&)':
../src/node_i18n.cc:543:15: warning: 'c' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (!expand_emoji_sequence &&
         ~~~~~~~~~~~~~~~~~~~~~~~~~
         n > 0 && p == 0x200d &&  // 0x200d == ZWJ (zero width joiner)

I think you need a U16_NEXT call outside the loop.

jasnell · 2016-10-25T18:47:42Z

Hmm.. Ok, away from the laptop at the moment but will look at that shortly

srl295 · 2016-10-25T19:45:34Z

@bnoordhuis good catch - otherwise p=<undefined>

A U16_NEXT(str, n, value.length(), c); before the loop will initialize c.
May also obviate the n>0 check within the loop.

bnoordhuis · 2016-10-25T20:32:49Z

#9280

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>

MylesBorins · 2017-05-15T14:54:09Z

/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered?

jasnell · 2017-05-15T15:03:01Z

I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this

srl295 · 2017-08-31T18:15:51Z

no pressing reason

jasnell added readline Issues and PRs related to the built-in readline module. i18n-api Issues and PRs related to the i18n implementation. labels Oct 11, 2016

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. readline Issues and PRs related to the built-in readline module. labels Oct 11, 2016

jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 11, 2016

bnoordhuis mentioned this pull request Oct 12, 2016

Add --without-intl builds to CI matrix nodejs/build#419

Closed

Fishrock123 reviewed Oct 13, 2016

View reviewed changes

rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:02

jasnell force-pushed the icu-stringwidth branch from cc56bd3 to 5e5522c Compare October 21, 2016 20:58

jasnell force-pushed the icu-stringwidth branch from 5e5522c to bf886f8 Compare October 21, 2016 21:39

bnoordhuis previously requested changes Oct 23, 2016

View reviewed changes

bnoordhuis approved these changes Oct 24, 2016

View reviewed changes

srl295 approved these changes Oct 24, 2016

View reviewed changes

jasnell force-pushed the icu-stringwidth branch from 5dcbf74 to 924eef3 Compare October 24, 2016 23:41

jasnell force-pushed the icu-stringwidth branch from 924eef3 to 518c711 Compare October 25, 2016 15:24

jasnell force-pushed the icu-stringwidth branch from 518c711 to 876bb83 Compare October 25, 2016 15:51

jasnell closed this Oct 25, 2016

evanlucas mentioned this pull request Nov 3, 2016

v7.1.0 proposal - 2016-11-08 #9438

Merged

gibfahn mentioned this pull request Apr 24, 2017

Potential Semver Minor Backports nodejs/Release#188

Closed

31 tasks

gibfahn added the dont-land-on-v6.x label May 16, 2017

readline: use icu based string width calculation #9040

readline: use icu based string width calculation #9040

Conversation

jasnell commented Oct 11, 2016

Checklist

Affected core subsystem(s)

Description of change

jasnell commented Oct 11, 2016

jasnell commented Oct 12, 2016

bnoordhuis commented Oct 12, 2016

jasnell commented Oct 12, 2016

bnoordhuis commented Oct 12, 2016

jasnell commented Oct 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasnell commented Oct 21, 2016 • edited Loading

jasnell commented Oct 22, 2016

bnoordhuis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasnell commented Oct 23, 2016

jasnell commented Oct 23, 2016

jbergstroem commented Oct 23, 2016

bnoordhuis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasnell commented Oct 24, 2016

jasnell commented Oct 25, 2016

jasnell commented Oct 25, 2016

jasnell commented Oct 25, 2016

jasnell commented Oct 25, 2016

bnoordhuis commented Oct 25, 2016

jasnell commented Oct 25, 2016

srl295 commented Oct 25, 2016

bnoordhuis commented Oct 25, 2016

MylesBorins commented May 15, 2017

jasnell commented May 15, 2017

srl295 commented Aug 31, 2017

jasnell commented Oct 21, 2016 •

edited

Loading