Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip first BOM test is invalid #2

Closed
gsnedders opened this issue Apr 10, 2013 · 6 comments
Closed

skip first BOM test is invalid #2

gsnedders opened this issue Apr 10, 2013 · 6 comments
Labels

Comments

@gsnedders
Copy link
Member

JSON transmits abstract Unicode strings, so a leading U+FEFF is a ZWNBSP and not a BOM. As such, the expectation should be that it is tokenized as "\ufefffoo\uffeff".

Of course, if you were to serialize it as UTF-16-BE and then parse it as UTF-16, then yes, you would have a BOM, but this is not the case.

/prod @davidflanagan given it's his TC.

@davidflanagan
Copy link
Contributor

I gather you're referring to this? https://github.com/html5lib/html5lib-tests/blob/master/tokenizer/domjs.test#L21

The test worked in a JS-based test harness, using, I assume JSON.parse(). But JS does use UTF-16 internally.
Is it failing for other languages?

It seems silly to have test cases encoded in JSON, I guess. Maybe each input needs to be in its own binary file?

I don't know the Unicode details of JSON, or what an "abstract Unicode string" is or what a ZWNBSP is either, so I'm unqualified to fix this. Perhaps removing the test is the best we can do.

@gsnedders
Copy link
Member Author

Yes. Using JSON.parse on it would get something from which it shouldn't be stripped: you only want to strip a leading U+FEFF from a string when you have it in either UTF-8 or an encoding with an unknown endianness (e.g., UTF-16 but not UTF-16-BE). What you have in JS is a series of UTF-16 codeunits, and hence they are not in any encoding with an unknown endianness (there is no concept of endianness, as you just have a 16-bit unit!).

Perhaps JSON isn't great, but we've got enough implementations that it'd be a pain to change it.

By "abstract Unicode string" I just mean a sequence of Unicode codepoints that are detached from any encoding (as in the unicode type in Python 2, for example — there is some internal encoding used to represent it in memory, but it's not in any way visible). And a ZWNBSP is what U+FEFF is when it isn't a BOM.

Mostly I'm just curious if anyone will complain if I change the expectation to "\uFFEFfoo\uFFEF". @hsivonen, @abarth?

@davidflanagan
Copy link
Contributor

you only want to strip a leading U+FEFF from a string when you have it in either UTF-8

And that is the test I was trying to write here, constrained by the limitations of the test infrastructure.

Inverting the sense of the test will obviously mean that my parser will now fail. But I don't know of anyone using it, so that might not be a big deal.

I'm not confident that this is testable with the current JSON-based infrastructure, and think it might be better to remove the test rather than trying to fix it. It was just a bonus thing I added because it improved the test coverage of my implementation.

@gsnedders
Copy link
Member Author

FWIW, html5lib-python tests stripping the BOM (which is done in the input stream processing, as part of decoding the incoming datastream) in custom tests for the input stream. The vast majority of the pre-processing doesn't apply to the tokenizer tests as they don't need to be decoded before being run.

@abarth
Copy link
Contributor

abarth commented Apr 10, 2013

Fine with me. We don't run the tokenizer tests.

@hsivonen
Copy link
Member

I think BOM handling belongs on the encoding conversion layer, so I think it would be appropriate not to test it on the tokenization layer.

inikulin added a commit to inikulin/parse5 that referenced this issue Sep 1, 2015
sundouzis pushed a commit to sundouzis/parse5 that referenced this issue Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants