Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal in scalar values #3

Closed
mathiasbynens opened this issue Apr 11, 2014 · 9 comments
Closed

Deal in scalar values #3

mathiasbynens opened this issue Apr 11, 2014 · 9 comments

Comments

@mathiasbynens
Copy link
Owner

Should we disallow lone surrogates as per whatwg/encoding@4abe74d? cc @jwerle

@jwerle
Copy link

jwerle commented Apr 11, 2014

Is the goal to be whatwg/encoding compliant ?

@mathiasbynens
Copy link
Owner Author

Ok, there is now a separate WTF-8 encoding specified (thanks to @SimonSapin) for UTF-8 with added support for lone-surrogate byte sequences. JS library: https://github.com/mathiasbynens/wtf-8

So let’s make utf8.js deal with actual UTF-8 as per the encoding standard.

@jwerle
Copy link

jwerle commented Oct 3, 2014

awesome !

mathiasbynens added a commit that referenced this issue Jan 8, 2015
mathiasbynens added a commit that referenced this issue Jan 8, 2015
mathiasbynens added a commit that referenced this issue Jan 8, 2015
mathiasbynens added a commit that referenced this issue Jan 8, 2015
@jwerle
Copy link

jwerle commented Jan 8, 2015

nice !

@cblair
Copy link

cblair commented Dec 23, 2015

I had a question about this enhancement; I'm getting the 'Error: Lone surrogate XXX is not a scalar value' error after some tests that feed in random strings. The string is not valid as is, but I think it should be encodable into a string that is valid. This error is valid, but does utf8.js has the ability to encode lone surrogate code points if they're not in a pair?

The https://simonsapin.github.io/wtf-8/ page states that WTF-8 (great name) 'encodes surrogate code points if they are not in a pair'. I don't think this was added to utf8.js, but hopefully I'm missing something.

I.e., string 'í¹ºò�¢�â¼�í¹ºâ¼�' in byteArray has values 237,185,186,242,135,162,159,226,188,154,237,185,186,226,188,154 at utf8.js:70. At byteIndex == 3, codepoint == 56954 (0xDE7A). This throws the error.

For reproducibility, the original string (value) is "7bm68oein+K8mu25uuK8mg==", base 64 endcoded. I hit this error with the following JS:

utf8.decode(atob(value))

Thanks so much.

@SimonSapin
Copy link

https://simonsapin.github.io/wtf-8/#motivation has some background on what are surrogate code points and how they came to be.

The Unicode standard defines byte sequences in <ED, A0...BF, 80...BF> (that would otherwise represent surrogate code points in U+D800...U+DBFF) to be ill-formed in UTF-8. utf8.js deliberately rejects them.

Similarly, JavaScript strings are arbitrary 16-bit sequences and are not necessarily well-formed in UTF-16.

WTF-8 is designed to be able to encode any 16-bit sequences (such as JS strings) in a way compatible with UTF-8, but it is not UTF-8. You probably shouldn’t be using WTF-8.

@cblair
Copy link

cblair commented Dec 28, 2015

Thanks Simon, makes sense. Our trouble is that we're using utf8 in some of our functional test code, and we want to feed in these illegal code points into our production code. This exception stops us from doing that.

Short term solution for us is to hold back to version 2.0.0. But maybe they'll be an allowed case for doing decodes on this in the future. Maybe I'll propose a PR at some time. :)

@SimonSapin
Copy link

Out of curiosity: why do you need this?

@cblair
Copy link

cblair commented Dec 28, 2015

We have a String class for some internal code that's driven off our own specs and internal requirements. So we have to allow utf8 decodes of bad stuff in our test code, so that we can verify our String code catches it. The utf8 code is being too full featured, our internal code has to implement that feature!

I admit, its kind of a fringe use case. But, its kind of nice to be able to say specifically 'Decode this, despite some bad stuff I'm putting in'. For our use, we just want the bytes and the right amount of them, right or wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants