New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: enforce valid UTF-8 in WHATWG parser #11436

Closed
wants to merge 4 commits into
base: master
from

Conversation

Projects
8 participants
@TimothyGu
Member

TimothyGu commented Feb 17, 2017

This commit implements the Web IDL USVString conversion, which mandates all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT CHARACTER. It also disallows Symbols to be used as USVString, per spec.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

url

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 17, 2017

Member

Some background on why I chose to implement it in C++. I first implemented a fairly optimized JS version, but found out that there is a significant disparity between the best-case performance (validsurr) and the worst-case performance (allinvalid). This, to me, is unacceptable. This C++ implementation is a bit slower for the best case, but a lot faster for the worst case.

JS implementation
function toUSVString(V) {
  if (typeof V === 'symbol')
    throw new TypeError();

  const S = String(V);
  const n = S.length;
  var U = '';
  var lastPos = 0;
  for (var i = 0; i < n; ++i) {
    const c = S.charCodeAt(i);

    if (c < 0xD800 || c > 0xDFFF) {
      continue;
    } else if (0xDC00 <= c && c <= 0xDFFF || i === n - 1) {
      if (lastPos < i)
        U += S.slice(lastPos, i);
      lastPos = i + 1;
      U += '\ufffd';
    } else {
      const d = S.charCodeAt(i + 1);
      if (0xDC00 <= d && d <= 0xDFFF) {
        ++i;
        continue;
      } else {
        if (lastPos < i)
          U += S.slice(lastPos, i);
        lastPos = i + 1;
        U += '\ufffd';
      }
    }
  }
  if (lastPos === 0)
    return V;
  if (lastPos < n)
    return U + S.slice(lastPos, n);
  return U;
}
Benchmark results
h/usvstring.js n=50000000 input="valid" type="js": 15,533,604.407488512
h/usvstring.js n=50000000 input="validsurr" type="js": 17,423,172.615309086
h/usvstring.js n=50000000 input="someinvalid" type="js": 3,411,650.9908266817
h/usvstring.js n=50000000 input="allinvalid" type="js": 2,352,265.169112214
h/usvstring.js n=50000000 input="valid" type="cpp": 16,506,065.368628675
h/usvstring.js n=50000000 input="validsurr" type="cpp": 15,627,668.561148917
h/usvstring.js n=50000000 input="someinvalid" type="cpp": 8,808,868.726046104
h/usvstring.js n=50000000 input="allinvalid" type="cpp": 8,750,259.202803183
Member

TimothyGu commented Feb 17, 2017

Some background on why I chose to implement it in C++. I first implemented a fairly optimized JS version, but found out that there is a significant disparity between the best-case performance (validsurr) and the worst-case performance (allinvalid). This, to me, is unacceptable. This C++ implementation is a bit slower for the best case, but a lot faster for the worst case.

JS implementation
function toUSVString(V) {
  if (typeof V === 'symbol')
    throw new TypeError();

  const S = String(V);
  const n = S.length;
  var U = '';
  var lastPos = 0;
  for (var i = 0; i < n; ++i) {
    const c = S.charCodeAt(i);

    if (c < 0xD800 || c > 0xDFFF) {
      continue;
    } else if (0xDC00 <= c && c <= 0xDFFF || i === n - 1) {
      if (lastPos < i)
        U += S.slice(lastPos, i);
      lastPos = i + 1;
      U += '\ufffd';
    } else {
      const d = S.charCodeAt(i + 1);
      if (0xDC00 <= d && d <= 0xDFFF) {
        ++i;
        continue;
      } else {
        if (lastPos < i)
          U += S.slice(lastPos, i);
        lastPos = i + 1;
        U += '\ufffd';
      }
    }
  }
  if (lastPos === 0)
    return V;
  if (lastPos < n)
    return U + S.slice(lastPos, n);
  return U;
}
Benchmark results
h/usvstring.js n=50000000 input="valid" type="js": 15,533,604.407488512
h/usvstring.js n=50000000 input="validsurr" type="js": 17,423,172.615309086
h/usvstring.js n=50000000 input="someinvalid" type="js": 3,411,650.9908266817
h/usvstring.js n=50000000 input="allinvalid" type="js": 2,352,265.169112214
h/usvstring.js n=50000000 input="valid" type="cpp": 16,506,065.368628675
h/usvstring.js n=50000000 input="validsurr" type="cpp": 15,627,668.561148917
h/usvstring.js n=50000000 input="someinvalid" type="cpp": 8,808,868.726046104
h/usvstring.js n=50000000 input="allinvalid" type="cpp": 8,750,259.202803183
@targos

This comment has been minimized.

Show comment
Hide comment
@targos

targos Feb 17, 2017

Member

I think we already have somehow a valid implementation for toUSVString in the project. It is used when you call buffer.toString('utf8').

Member

targos commented Feb 17, 2017

I think we already have somehow a valid implementation for toUSVString in the project. It is used when you call buffer.toString('utf8').

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 17, 2017

Member

@targos, it is a lot slower (10x slower for validsurr) because of string-to-Buffer conversion

Member

TimothyGu commented Feb 17, 2017

@targos, it is a lot slower (10x slower for validsurr) because of string-to-Buffer conversion

@targos

This comment has been minimized.

Show comment
Hide comment
@targos

targos Feb 17, 2017

Member

Of course, but can't we reuse it directly on the string, without the conversion?

Member

targos commented Feb 17, 2017

Of course, but can't we reuse it directly on the string, without the conversion?

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 17, 2017

Member

but can't we reuse it directly on the string, without the conversion?

No, unfortunately.

> Buffer.prototype.toString.call('a')
TypeError: this.utf8Slice is not a function
> Buffer.prototype.utf8Slice.call('a')
TypeError: argument should be a Buffer

That notwithstanding, the actual conversion is done under-the-hood at buffer creation-time in StringBytes::Write, which uses the same mechanism as the Utf8Value class, which would work as well, except it is less efficient than the implementation in this PR.

Member

TimothyGu commented Feb 17, 2017

but can't we reuse it directly on the string, without the conversion?

No, unfortunately.

> Buffer.prototype.toString.call('a')
TypeError: this.utf8Slice is not a function
> Buffer.prototype.utf8Slice.call('a')
TypeError: argument should be a Buffer

That notwithstanding, the actual conversion is done under-the-hood at buffer creation-time in StringBytes::Write, which uses the same mechanism as the Utf8Value class, which would work as well, except it is less efficient than the implementation in this PR.

@targos

This comment has been minimized.

Show comment
Hide comment
@targos

targos Feb 17, 2017

Member

I didn't mean use the JS function, rather the underlying V8 C++ implementation. I'm sorry for the confusion, I didn't see your comment in node_url.cc that explains why you are not doing that.

Member

targos commented Feb 17, 2017

I didn't mean use the JS function, rather the underlying V8 C++ implementation. I'm sorry for the confusion, I didn't see your comment in node_url.cc that explains why you are not doing that.

@@ -48,8 +48,7 @@ TwoByteValue::TwoByteValue(Isolate* isolate, Local<Value> value) {
const size_t storage = string->Length() + 1;
AllocateSufficientStorage(storage);
const int flags =
String::NO_NULL_TERMINATION | String::REPLACE_INVALID_UTF8;
const int flags = String::NO_NULL_TERMINATION;

This comment has been minimized.

@mscdex

mscdex Feb 17, 2017

Contributor

I'm not sure this is right, since this would affect more than just the WHATWG URL implementation?

@mscdex

mscdex Feb 17, 2017

Contributor

I'm not sure this is right, since this would affect more than just the WHATWG URL implementation?

This comment has been minimized.

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

This comment has been minimized.

@TimothyGu

TimothyGu Feb 17, 2017

Member

As I've explained in 424722af4541cd0eec1bf299aa8afcdff6284f52, this flag is actually a no-op. TwoByteValue::TwoByteValue uses v8::String::Write as opposed to v8::String::WriteUtf8. The flag is only respected by WriteUtf8, and hence misleading being applied here in TwoByteValue.

@TimothyGu

TimothyGu Feb 17, 2017

Member

As I've explained in 424722af4541cd0eec1bf299aa8afcdff6284f52, this flag is actually a no-op. TwoByteValue::TwoByteValue uses v8::String::Write as opposed to v8::String::WriteUtf8. The flag is only respected by WriteUtf8, and hence misleading being applied here in TwoByteValue.

This comment has been minimized.

@bnoordhuis

bnoordhuis Feb 19, 2017

Member

Oh, you're right, I missed that it's the UTF-16 version. Objection withdrawn.

@bnoordhuis

bnoordhuis Feb 19, 2017

Member

Oh, you're right, I missed that it's the UTF-16 version. Objection withdrawn.

Show outdated Hide outdated src/node_url.cc
Show outdated Hide outdated src/node_url.cc
@@ -48,8 +48,7 @@ TwoByteValue::TwoByteValue(Isolate* isolate, Local<Value> value) {
const size_t storage = string->Length() + 1;
AllocateSufficientStorage(storage);
const int flags =
String::NO_NULL_TERMINATION | String::REPLACE_INVALID_UTF8;
const int flags = String::NO_NULL_TERMINATION;

This comment has been minimized.

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Indeed, this is not an acceptable change. You could turn the flags into an optional argument that defaults to what is now or make it a template trait.

EDIT: Objection withdrawn.

@@ -598,8 +598,7 @@ exports.WPT = {
try {
fn();
} catch (err) {
if (err instanceof Error)
err.message = `In ${desc}:\n ${err.message}`;
console.error(`In ${desc}:`);

This comment has been minimized.

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Left-over debug code?

@bnoordhuis

bnoordhuis Feb 17, 2017

Member

Left-over debug code?

This comment has been minimized.

@TimothyGu

TimothyGu Feb 17, 2017

Member

No, it was intentional, since it seems like after the Error is constructed changing the message doesn't change stack. (See 56f6b8dbad36a46f07668b2b35574dc367e6d146.)

@TimothyGu

TimothyGu Feb 17, 2017

Member

No, it was intentional, since it seems like after the Error is constructed changing the message doesn't change stack. (See 56f6b8dbad36a46f07668b2b35574dc367e6d146.)

This comment has been minimized.

@joyeecheung

joyeecheung Feb 17, 2017

Member

Hmm, yes, to patch the error message we would have to extend the Error class, something like:

class WPTError extends Error {
  constructor(desc, err) {
    super(`In ${desc}, ${err.name}: ${err.message}`);
    Error.captureStackTrace(this, WPTError);
  }
}

throw new WPTError(desc, err);

But then console.error() works too, though we would see two error traces this way?

@joyeecheung

joyeecheung Feb 17, 2017

Member

Hmm, yes, to patch the error message we would have to extend the Error class, something like:

class WPTError extends Error {
  constructor(desc, err) {
    super(`In ${desc}, ${err.name}: ${err.message}`);
    Error.captureStackTrace(this, WPTError);
  }
}

throw new WPTError(desc, err);

But then console.error() works too, though we would see two error traces this way?

This comment has been minimized.

@jasnell

jasnell Feb 17, 2017

Member

hmm.. it appears to:

> var m = new Error('test')
undefined
> m.message = 'foo'
'foo'
> m
Error: foo
    at repl:1:9
    at ContextifyScript.Script.runInThisContext (vm.js:23:33)
    at REPLServer.defaultEval (repl.js:340:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.onLine (repl.js:537:10)
    at emitOne (events.js:101:20)
    at REPLServer.emit (events.js:189:7)
    at REPLServer.Interface._onLine (readline.js:238:10)
    at REPLServer.Interface._line (readline.js:582:8)
@jasnell

jasnell Feb 17, 2017

Member

hmm.. it appears to:

> var m = new Error('test')
undefined
> m.message = 'foo'
'foo'
> m
Error: foo
    at repl:1:9
    at ContextifyScript.Script.runInThisContext (vm.js:23:33)
    at REPLServer.defaultEval (repl.js:340:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.onLine (repl.js:537:10)
    at emitOne (events.js:101:20)
    at REPLServer.emit (events.js:189:7)
    at REPLServer.Interface._onLine (readline.js:238:10)
    at REPLServer.Interface._line (readline.js:582:8)

This comment has been minimized.

@joyeecheung

joyeecheung Feb 17, 2017

Member

Oh, looks like it's the Error.captureStackTrace call in the AssertionError constructor "freezes" the stack. To recapture the stack, we need to:

diff --git a/test/common.js b/test/common.js
index 5f7dc25..f74d5f0 100644
--- a/test/common.js
+++ b/test/common.js
@@ -598,8 +598,10 @@ exports.WPT = {
     try {
       fn();
     } catch (err) {
-      if (err instanceof Error)
+      if (err instanceof Error) {
         err.message = `In ${desc}:\n  ${err.message}`;
+        Error.captureStackTrace(err);
+      }
       throw err;
     }
   },
@joyeecheung

joyeecheung Feb 17, 2017

Member

Oh, looks like it's the Error.captureStackTrace call in the AssertionError constructor "freezes" the stack. To recapture the stack, we need to:

diff --git a/test/common.js b/test/common.js
index 5f7dc25..f74d5f0 100644
--- a/test/common.js
+++ b/test/common.js
@@ -598,8 +598,10 @@ exports.WPT = {
     try {
       fn();
     } catch (err) {
-      if (err instanceof Error)
+      if (err instanceof Error) {
         err.message = `In ${desc}:\n  ${err.message}`;
+        Error.captureStackTrace(err);
+      }
       throw err;
     }
   },

This comment has been minimized.

@TimothyGu

TimothyGu Feb 17, 2017

Member

@joyeecheung, calling Error.captureStackTrace() there will only capture the stack of WPT.test, not the actual stack from err.

@TimothyGu

TimothyGu Feb 17, 2017

Member

@joyeecheung, calling Error.captureStackTrace() there will only capture the stack of WPT.test, not the actual stack from err.

This comment has been minimized.

@joyeecheung

joyeecheung Feb 18, 2017

Member

Hmm, yes, this probably deserves another PR to sort out. I am fine with a console.error work around.

@joyeecheung

joyeecheung Feb 18, 2017

Member

Hmm, yes, this probably deserves another PR to sort out. I am fine with a console.error work around.

@jasnell

This comment has been minimized.

Show comment
Hide comment
@jasnell

jasnell Feb 17, 2017

Member

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321). When the input is provided to the Parse() function within node_url.cc, it is currently interpreted as a UTF-8 string using UTF8Value. This ends up writing the string bytes out which means if we do the toUSVString() and then pass it off to Parse(), we'll be doing to write twice. By doing the conversion at that point, we can avoid the extra trip across the js/c++ boundary, we cover all of the setters, and there would be no need to export the toUSVString() function.

Member

jasnell commented Feb 17, 2017

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321). When the input is provided to the Parse() function within node_url.cc, it is currently interpreted as a UTF-8 string using UTF8Value. This ends up writing the string bytes out which means if we do the toUSVString() and then pass it off to Parse(), we'll be doing to write twice. By doing the conversion at that point, we can avoid the extra trip across the js/c++ boundary, we cover all of the setters, and there would be no need to export the toUSVString() function.

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 17, 2017

Member

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321).

Certainly. I'll see what I can do for this.

there would be no need to export the toUSVString() function.

I think it'll still be needed for the URLSearchParams interface, which is implemented entirely in JS.

Member

TimothyGu commented Feb 17, 2017

I'm wondering if a more efficient solution would be to make a USVStringValue alternative to UTF8Value (see https://github.com/nodejs/node/blob/master/src/node_url.cc#L1321).

Certainly. I'll see what I can do for this.

there would be no need to export the toUSVString() function.

I think it'll still be needed for the URLSearchParams interface, which is implemented entirely in JS.

@jasnell

This comment has been minimized.

Show comment
Hide comment
@jasnell

jasnell Feb 17, 2017

Member

I think it'll still be needed for the URLSearchParams interface

True. That said, I'm working on a C/C++ querystring parser implementation and evaluating the performance. It will likely make the most sense to keep that impl in JS land but we'll see what kind of numbers I can get.

Member

jasnell commented Feb 17, 2017

I think it'll still be needed for the URLSearchParams interface

True. That said, I'm working on a C/C++ querystring parser implementation and evaluating the performance. It will likely make the most sense to keep that impl in JS land but we'll see what kind of numbers I can get.

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 20, 2017

Member

@jasnell, after investigating, I realized that a TwoByteValue-based class that USVString is not helpful, at least in context of this PR. The current C++ Parse() method used by the setters already uses Utf8Value, which does the same thing as ToUSVString(), so nothing to change there. On the other hand, certain setters operate on the stringified parameter in JS before calling the binding's Parse(), so an efficient ToUSVString() method is needed for those methods anyway.

While the double conversion cannot be gotten rid of in the setters, the domainTo*() JS methods are a different case. I've simply gotten rid of the JS wrappers to handle String conversion fully in C++.

Member

TimothyGu commented Feb 20, 2017

@jasnell, after investigating, I realized that a TwoByteValue-based class that USVString is not helpful, at least in context of this PR. The current C++ Parse() method used by the setters already uses Utf8Value, which does the same thing as ToUSVString(), so nothing to change there. On the other hand, certain setters operate on the stringified parameter in JS before calling the binding's Parse(), so an efficient ToUSVString() method is needed for those methods anyway.

While the double conversion cannot be gotten rid of in the setters, the domainTo*() JS methods are a different case. I've simply gotten rid of the JS wrappers to handle String conversion fully in C++.

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 20, 2017

Member

Tried to address comments as far as possible. Changes:

PTAL.

CI: https://ci.nodejs.org/job/node-test-pull-request/6504/

Member

TimothyGu commented Feb 20, 2017

Tried to address comments as far as possible. Changes:

PTAL.

CI: https://ci.nodejs.org/job/node-test-pull-request/6504/

Show outdated Hide outdated src/node_url.cc
Show outdated Hide outdated lib/internal/url.js

Review of an older version

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
Member

TimothyGu commented Feb 22, 2017

Show outdated Hide outdated lib/internal/url.js
Show outdated Hide outdated src/node_url.cc
Show outdated Hide outdated src/node_url.cc
Show outdated Hide outdated src/node_url.cc
@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu
Member

TimothyGu commented Feb 23, 2017

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu
Member

TimothyGu commented Feb 25, 2017

@bnoordhuis, ping?

Show outdated Hide outdated lib/internal/url.js
@@ -1351,6 +1368,41 @@ namespace url {
v8::NewStringType::kNormal).ToLocalChecked());
}
static void ToUSVString(const FunctionCallbackInfo<Value>& args) {
Environment* env = Environment::GetCurrent(args);
CHECK_GE(args.Length(), 2);

This comment has been minimized.

@joyeecheung

joyeecheung Feb 25, 2017

Member

CHECK_EQ?

@joyeecheung

joyeecheung Feb 25, 2017

Member

CHECK_EQ?

This comment has been minimized.

@TimothyGu

TimothyGu Feb 25, 2017

Member

All the existing functions use CHECK_GE, and I don't see a reason to be stricter than what this function actually uses.

@TimothyGu

TimothyGu Feb 25, 2017

Member

All the existing functions use CHECK_GE, and I don't see a reason to be stricter than what this function actually uses.

This comment has been minimized.

@joyeecheung

joyeecheung Feb 26, 2017

Member

Yeah there are functions in other files doing EQ...not sure if we have a convention or not, just think GE is implying there could be more args, which doesn't seem to be the case for this one, though I don't feel very strongly about this.

@joyeecheung

joyeecheung Feb 26, 2017

Member

Yeah there are functions in other files doing EQ...not sure if we have a convention or not, just think GE is implying there could be more args, which doesn't seem to be the case for this one, though I don't feel very strongly about this.

const size_t n = value.length();
const int64_t start = args[1]->IntegerValue(env->context()).FromJust();
CHECK_GE(start, 0);

This comment has been minimized.

@joyeecheung

joyeecheung Feb 25, 2017

Member

Maybe another CHECK_LT(start, n)? Doesn't do any harm even if start is larger though.

@joyeecheung

joyeecheung Feb 25, 2017

Member

Maybe another CHECK_LT(start, n)? Doesn't do any harm even if start is larger though.

This comment has been minimized.

@TimothyGu

TimothyGu Feb 25, 2017

Member

I'm only checking start >= 0 because I'm converting start to a size_t, which is an unsigned type. In C++, signed-to-unsigned conversion, though a defined operation, acts oddly when the signed value is negative. On the other hand, n >= start is a more benign case, and I don't think requires the full effects of a runtime assertion (i.e. crashing).

@TimothyGu

TimothyGu Feb 25, 2017

Member

I'm only checking start >= 0 because I'm converting start to a size_t, which is an unsigned type. In C++, signed-to-unsigned conversion, though a defined operation, acts oddly when the signed value is negative. On the other hand, n >= start is a more benign case, and I don't think requires the full effects of a runtime assertion (i.e. crashing).

This comment has been minimized.

@joyeecheung

joyeecheung Feb 26, 2017

Member

Oh yep, not worth a full on abort ;)

@joyeecheung

joyeecheung Feb 26, 2017

Member

Oh yep, not worth a full on abort ;)

Show outdated Hide outdated test/parallel/test-whatwg-url-properties.js
@bnoordhuis

LGTM with a style nit.

Show outdated Hide outdated lib/internal/url.js
Show outdated Hide outdated lib/internal/url.js

TimothyGu added some commits Feb 4, 2017

test: fix WPT.test()'s error handling
Changing err.message after the construction of Error doesn't seem to
change err.stack.
src: remove misleading flag in TwoByteValue
String::REPLACE_INVALID_UTF8 is only applied in V8's
String::WriteUtf8() (i.e. Utf8Value).
url: enforce valid UTF-8 in WHATWG parser
This commit implements the Web IDL USVString conversion, which mandates
all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT
CHARACTER. It also disallows Symbols to be used as USVString per spec.

Certain functions call into C++ methods in the binding that use the
Utf8Value class to access string arguments. Utf8Value already does the
normalization using V8's String::Write, so in those cases, instead of
doing the full USVString normalization, only a symbol check is done
(`'' + val`, which uses ES's ToString, versus `String()` which has
special provisions for symbols).
@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu Feb 28, 2017

Member

@bnoordhuis, @jasnell, comments addressed. Will land tomorrow if nothing comes up.

CI: https://ci.nodejs.org/job/node-test-commit/8160/

Member

TimothyGu commented Feb 28, 2017

@bnoordhuis, @jasnell, comments addressed. Will land tomorrow if nothing comes up.

CI: https://ci.nodejs.org/job/node-test-commit/8160/

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu
Member

TimothyGu commented Mar 1, 2017

Landed in 7ceea2a...6123ed5.

@TimothyGu TimothyGu closed this Mar 1, 2017

@TimothyGu TimothyGu deleted the TimothyGu:url-usvstring branch Mar 1, 2017

TimothyGu added a commit that referenced this pull request Mar 1, 2017

test: fix WPT.test()'s error handling
Changing err.message after the construction of Error doesn't seem to
change err.stack.

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

TimothyGu added a commit that referenced this pull request Mar 1, 2017

src: remove misleading flag in TwoByteValue
String::REPLACE_INVALID_UTF8 is only applied in V8's
String::WriteUtf8() (i.e. Utf8Value).

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

TimothyGu added a commit that referenced this pull request Mar 1, 2017

url: enforce valid UTF-8 in WHATWG parser
This commit implements the Web IDL USVString conversion, which mandates
all unpaired Unicode surrogates be turned into U+FFFD REPLACEMENT
CHARACTER. It also disallows Symbols to be used as USVString per spec.

Certain functions call into C++ methods in the binding that use the
Utf8Value class to access string arguments. Utf8Value already does the
normalization using V8's String::Write, so in those cases, instead of
doing the full USVString normalization, only a symbol check is done
(`'' + val`, which uses ES's ToString, versus `String()` which has
special provisions for symbols).

PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

TimothyGu added a commit that referenced this pull request Mar 1, 2017

benchmark: add USVString conversion benchmark
PR-URL: #11436
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

@TimothyGu TimothyGu moved this from Existing spec to Done in WHATWG URL implementation Mar 1, 2017

@evanlucas

This comment has been minimized.

Show comment
Hide comment
@evanlucas

evanlucas Mar 7, 2017

Member

This is not landing cleanly on v7.x-staging. Want to submit a backport PR?

Member

evanlucas commented Mar 7, 2017

This is not landing cleanly on v7.x-staging. Want to submit a backport PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment