Skip to content

Proposal for a method to determine if a JavaScript string is a valid USV string

License

Notifications You must be signed in to change notification settings

mathiasbynens/proposal-is-usv-string

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

String is USV String

Champions: Guy Bedford Author: Guy Bedford

Status: draft

Problem Statement

ECMAScript strings permit unpaired UTF-16 surrogates, in contrast to many UTF implementations.

On the other hand, WebIDL defines USVString as lists of Unicode Scalar Values, which includes only the allowed UTF code point ranges [0, 0xD7FF] or [0xE000, 0x10FFFF], explicitly excluding unpaired surrogate values.

In addition, the Web Assembly Interface Types proposal also restricts string values to lists of Unicode Scalar Values, as polled in a recent CG meeting.

Since the interfacing of JavaScript strings with platform and Web Assembly APIs is a highly common use case, there is a regular need for string validations both within the platform and for certain userland use case scenarios.

Proposal

The proposal is to define in ECMA-262 a static String method to verify if a given ECMAScript string is a valid USV String or not.

As a highly common scenario for interfaces between WebIDL and Wasm, this should ease certain integration scenarios that can then decide to throw or run a conversion as necessary, without having to incur custom conversion code from the start.

Algorithm

The validation algorithm is effectively the standard UTF-16 validation algorithm, iterating through the string and pairing UTF-16 surrogates, failing validation for any unpaired surrogates or invalid surrogate prefix codes.

The equivalent algorithm in JavaScript is likely something along the lines of:

let i = 0;
while (i < str.length) {
  const surrogatePrefix = str.charCodeAt(i) & 0xFC00;
  // Non-surrogate code point, single increment
  if (surrogatePrefix < 0xD800) {
    i += 1;
  }
  // Surrogate start
  else if (surrogatePrefix === 0xD800) {
    // Validate surrogate pair, double increment
    if ((decoded.charCodeAt(i + 1) & 0xFC00) !== 0xDC00)
      return false;
    i += 2;
  }
  else {
    // Out-of-range surrogate prefix (above 0xD800)
    return false;
  }
}
return true;

FAQs

Isn't this possible today without needing a builtin API?

The two major use cases are for integration with other specifications and for userland code that for example interfaces with Web Assembly.

For users, it avoids having to write custom validators when dealing with string input in various forms, providing instead a full correct platform API that can allow easily determining the USV guarantee / invariant to apply for subsequent processing.

For integration with other platform specifications, having the ability to reference an ECMA-262 specification method for validation could also make integration easier where many APIs are now unifying on USV strings as a standard. For example, such a method came up as a need for integration with the TextEncoder API previously that a specification like this would be able to assist with.

Why is the proposal to add a static String method?

Making it a static method on String seems like the safest home for such a method. Adding custom methods to String.prototype is likely quite risky, but could be considered as well.

Are the primary benefits performance?

While the proposal is not entirely for performance reasons, it should hopefully still enable a builtin method that could be faster than user validation.

In future it might even be possible for a bit state to be associated with string data types and maintained through string functions to make the check entirely zero cost, but that would be entirely up to implementations and not anything within the reach of this specification.

About

Proposal for a method to determine if a JavaScript string is a valid USV string

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 100.0%