Proposed erratum: if a UTF-8 grammar begins with a BOM, the BOM must be ignored #174

ndw · 2023-04-13T09:07:55Z

E002: Ignore UTF-8 BOM

Implementations are instructed to ignore the BOM if it occurs at the beginning of a grammar encoded in UTF-8.

The second paragraph in the description of the grammar is changed to:

A grammar is an optional prolog, followed by a sequence of one or more rules, surrounded and separated by spacing and comments. Spacing and comments are entirely optional, except that rules must be separated by at least one of either (error S01). If an input grammar encoded in UTF-8 begins with a byte order mark (BOM), the BOM must be ignored

graydon2014 · 2023-04-13T14:13:44Z

I would like to say this is not sufficient, because the input could have been constructed by concatenating UTF-8 files which each begin with a BOM.

ndw · 2023-04-13T15:25:46Z

That feels like a bug in the program you're using to concatenate them. And critically U+FEFF has another meaning when not at the beginning of the file, it's a ZERO WIDTH NO-BREAK SPACE. I bet the Unicode consortium regrets that decision!

cmsmcq · 2023-05-08T23:52:35Z

Should there also be a rule about BOMs in the input string? Since we sometimes speak as if we believe ixml could be used to parse binary data, perhaps any rule about BOMs in the input string should use SHOULD, not MUST. Or am I missing something?

ndw mentioned this issue Apr 13, 2023

Invisible XML parsers should ignore a BOM on UTF-8 inputs #175

Closed

ndw mentioned this issue May 9, 2023

Proposed erratum E003: Ignore UTF-8 BOM #178

Merged

ndw closed this as completed in #178 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed erratum: if a UTF-8 grammar begins with a BOM, the BOM must be ignored #174

Proposed erratum: if a UTF-8 grammar begins with a BOM, the BOM must be ignored #174

ndw commented Apr 13, 2023

graydon2014 commented Apr 13, 2023

ndw commented Apr 13, 2023

cmsmcq commented May 8, 2023

Proposed erratum: if a UTF-8 grammar begins with a BOM, the BOM must be ignored #174

Proposed erratum: if a UTF-8 grammar begins with a BOM, the BOM must be ignored #174

Comments

ndw commented Apr 13, 2023

E002: Ignore UTF-8 BOM

graydon2014 commented Apr 13, 2023

ndw commented Apr 13, 2023

cmsmcq commented May 8, 2023