skip first BOM test is invalid #2

gsnedders · 2013-04-10T13:31:35Z

JSON transmits abstract Unicode strings, so a leading U+FEFF is a ZWNBSP and not a BOM. As such, the expectation should be that it is tokenized as "\ufefffoo\uffeff".

Of course, if you were to serialize it as UTF-16-BE and then parse it as UTF-16, then yes, you would have a BOM, but this is not the case.

/prod @davidflanagan given it's his TC.

davidflanagan · 2013-04-10T16:54:15Z

I gather you're referring to this? https://github.com/html5lib/html5lib-tests/blob/master/tokenizer/domjs.test#L21

The test worked in a JS-based test harness, using, I assume JSON.parse(). But JS does use UTF-16 internally.
Is it failing for other languages?

It seems silly to have test cases encoded in JSON, I guess. Maybe each input needs to be in its own binary file?

I don't know the Unicode details of JSON, or what an "abstract Unicode string" is or what a ZWNBSP is either, so I'm unqualified to fix this. Perhaps removing the test is the best we can do.

gsnedders · 2013-04-10T17:17:12Z

Yes. Using JSON.parse on it would get something from which it shouldn't be stripped: you only want to strip a leading U+FEFF from a string when you have it in either UTF-8 or an encoding with an unknown endianness (e.g., UTF-16 but not UTF-16-BE). What you have in JS is a series of UTF-16 codeunits, and hence they are not in any encoding with an unknown endianness (there is no concept of endianness, as you just have a 16-bit unit!).

Perhaps JSON isn't great, but we've got enough implementations that it'd be a pain to change it.

By "abstract Unicode string" I just mean a sequence of Unicode codepoints that are detached from any encoding (as in the unicode type in Python 2, for example — there is some internal encoding used to represent it in memory, but it's not in any way visible). And a ZWNBSP is what U+FEFF is when it isn't a BOM.

Mostly I'm just curious if anyone will complain if I change the expectation to "\uFFEFfoo\uFFEF". @hsivonen, @abarth?

davidflanagan · 2013-04-10T17:30:06Z

you only want to strip a leading U+FEFF from a string when you have it in either UTF-8

And that is the test I was trying to write here, constrained by the limitations of the test infrastructure.

Inverting the sense of the test will obviously mean that my parser will now fail. But I don't know of anyone using it, so that might not be a big deal.

I'm not confident that this is testable with the current JSON-based infrastructure, and think it might be better to remove the test rather than trying to fix it. It was just a bonus thing I added because it improved the test coverage of my implementation.

gsnedders · 2013-04-10T17:36:48Z

FWIW, html5lib-python tests stripping the BOM (which is done in the input stream processing, as part of decoding the incoming datastream) in custom tests for the input stream. The vast majority of the pre-processing doesn't apply to the tokenizer tests as they don't need to be decoded before being run.

abarth · 2013-04-10T18:22:05Z

Fine with me. We don't run the tokenizer tests.

hsivonen · 2013-04-11T06:41:26Z

I think BOM handling belongs on the encoding conversion layer, so I think it would be appropriate not to test it on the tokenization layer.

…hould be processed as is by tokenizer ref: html5lib/html5lib-tests#2

gsnedders mentioned this issue Apr 10, 2013

skip first BOM test is invalid html5lib/html5lib-python#14

Closed

gsnedders closed this as completed in f6a1b20 Apr 11, 2013

inikulin added a commit to inikulin/parse5 that referenced this issue Sep 1, 2015

Spec changes - Tokenizer: BOM should be stripped by the decoder and s…

7804e49

…hould be processed as is by tokenizer ref: html5lib/html5lib-tests#2

sundouzis pushed a commit to sundouzis/parse5 that referenced this issue Sep 13, 2022

Spec changes - Tokenizer: BOM should be stripped by the decoder and s…

65699de

…hould be processed as is by tokenizer ref: html5lib/html5lib-tests#2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip first BOM test is invalid #2

skip first BOM test is invalid #2

gsnedders commented Apr 10, 2013

davidflanagan commented Apr 10, 2013

gsnedders commented Apr 10, 2013

davidflanagan commented Apr 10, 2013

gsnedders commented Apr 10, 2013

abarth commented Apr 10, 2013

hsivonen commented Apr 11, 2013

skip first BOM test is invalid #2

skip first BOM test is invalid #2

Comments

gsnedders commented Apr 10, 2013

davidflanagan commented Apr 10, 2013

gsnedders commented Apr 10, 2013

davidflanagan commented Apr 10, 2013

gsnedders commented Apr 10, 2013

abarth commented Apr 10, 2013

hsivonen commented Apr 11, 2013