Deal in scalar values #3

mathiasbynens · 2014-04-11T12:30:48Z

Should we disallow lone surrogates as per whatwg/encoding@4abe74d? cc @jwerle

The text was updated successfully, but these errors were encountered:

jwerle · 2014-04-11T12:36:09Z

Is the goal to be whatwg/encoding compliant ?

mathiasbynens · 2014-10-03T09:06:17Z

Ok, there is now a separate WTF-8 encoding specified (thanks to @SimonSapin) for UTF-8 with added support for lone-surrogate byte sequences. JS library: https://github.com/mathiasbynens/wtf-8

So let’s make utf8.js deal with actual UTF-8 as per the encoding standard.

jwerle · 2014-10-03T11:48:56Z

awesome !

Ref. #3.

Closes #3.

jwerle · 2015-01-08T13:27:03Z

nice !

cblair · 2015-12-23T19:09:57Z

I had a question about this enhancement; I'm getting the 'Error: Lone surrogate XXX is not a scalar value' error after some tests that feed in random strings. The string is not valid as is, but I think it should be encodable into a string that is valid. This error is valid, but does utf8.js has the ability to encode lone surrogate code points if they're not in a pair?

The https://simonsapin.github.io/wtf-8/ page states that WTF-8 (great name) 'encodes surrogate code points if they are not in a pair'. I don't think this was added to utf8.js, but hopefully I'm missing something.

I.e., string 'í¹ºò�¢�â¼�í¹ºâ¼�' in byteArray has values 237,185,186,242,135,162,159,226,188,154,237,185,186,226,188,154 at utf8.js:70. At byteIndex == 3, codepoint == 56954 (0xDE7A). This throws the error.

For reproducibility, the original string (value) is "7bm68oein+K8mu25uuK8mg==", base 64 endcoded. I hit this error with the following JS:

utf8.decode(atob(value))

Thanks so much.

SimonSapin · 2015-12-23T22:06:05Z

https://simonsapin.github.io/wtf-8/#motivation has some background on what are surrogate code points and how they came to be.

The Unicode standard defines byte sequences in <ED, A0...BF, 80...BF> (that would otherwise represent surrogate code points in U+D800...U+DBFF) to be ill-formed in UTF-8. utf8.js deliberately rejects them.

Similarly, JavaScript strings are arbitrary 16-bit sequences and are not necessarily well-formed in UTF-16.

WTF-8 is designed to be able to encode any 16-bit sequences (such as JS strings) in a way compatible with UTF-8, but it is not UTF-8. You probably shouldn’t be using WTF-8.

cblair · 2015-12-28T21:49:33Z

Thanks Simon, makes sense. Our trouble is that we're using utf8 in some of our functional test code, and we want to feed in these illegal code points into our production code. This exception stops us from doing that.

Short term solution for us is to hold back to version 2.0.0. But maybe they'll be an allowed case for doing decodes on this in the future. Maybe I'll propose a PR at some time. :)

SimonSapin · 2015-12-28T22:05:48Z

Out of curiosity: why do you need this?

cblair · 2015-12-28T22:57:12Z

We have a String class for some internal code that's driven off our own specs and internal requirements. So we have to allow utf8 decodes of bad stuff in our test code, so that we can verify our String code catches it. The utf8 code is being too full featured, our internal code has to implement that feature!

I admit, its kind of a fringe use case. But, its kind of nice to be able to say specifically 'Decode this, despite some bad stuff I'm putting in'. For our use, we just want the bytes and the right amount of them, right or wrong.

mathiasbynens mentioned this issue Nov 14, 2014

exposed ucs2encode() + allowed return of codepoint as array from utf8dec... #6

Closed

mathiasbynens mentioned this issue Jan 1, 2015

Codepoint arrays and binary strings #7

Open

mathiasbynens added a commit that referenced this issue Jan 8, 2015

Deal in scalar values

f6735d4

Ref. #3.

mathiasbynens closed this as completed in f776c67 Jan 8, 2015

mathiasbynens added a commit that referenced this issue Jan 8, 2015

Deal in scalar values

d6bd4f1

Closes #3.

mathiasbynens added a commit that referenced this issue Jan 8, 2015

Deal in scalar values

b9e9786

Closes #3.

mathiasbynens added a commit that referenced this issue Jan 8, 2015

Deal in scalar values

728b895

Closes #3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal in scalar values #3

Deal in scalar values #3

mathiasbynens commented Apr 11, 2014

jwerle commented Apr 11, 2014

mathiasbynens commented Oct 3, 2014

jwerle commented Oct 3, 2014

jwerle commented Jan 8, 2015

cblair commented Dec 23, 2015

SimonSapin commented Dec 23, 2015

cblair commented Dec 28, 2015

SimonSapin commented Dec 28, 2015

cblair commented Dec 28, 2015

Deal in scalar values #3

Deal in scalar values #3

Comments

mathiasbynens commented Apr 11, 2014

jwerle commented Apr 11, 2014

mathiasbynens commented Oct 3, 2014

jwerle commented Oct 3, 2014

jwerle commented Jan 8, 2015

cblair commented Dec 23, 2015

SimonSapin commented Dec 23, 2015

cblair commented Dec 28, 2015

SimonSapin commented Dec 28, 2015

cblair commented Dec 28, 2015