Ignore UTF BOM at the beginning of the stream #14

formatz · 2019-09-02T12:02:17Z

When the json file stream contains a BOM sequence at the begining, the parser throws an exception and cannot parse the file, even if the encoding is UTF-8.

With this change, the parser ignore the BOM sequence if found.

I included BOM sequences for UTF-16 (BE/LE) and UTF-32 (BE/LE) too.

halaxa · 2019-09-02T18:01:45Z

That is really useful feature, thanks. Could you please add tests for all 5 BOM cases? Additionally, I am afraid that adding this many ifs might slow things down, as Lexer is performance critical part. What about just add the BOM bytes to an existing list of whitespace chars? That should be constant in speed.

        // Lexer.php, lines 39-42
        ${' '} = 0;
        ${"\n"} = 0;
        ${"\r"} = 0;
        ${"\t"} = 0;
        // add it here

True, it would ignore more than just BOMs, but I think we can safely accept that.

… to UTF-8 to be compatible with byte stream parser

formatz · 2019-09-02T20:43:43Z

Thank you for your quick response and to maintain this project.

Firstly I removed UTF-16 and UTF-32 because with this charsets you cannot parse byte per byte to detect the char separators. Multiple bytes must be parsed even with ascii table chars. Sorry my fault.

Secondly we cannot check if a BOM is present because the BOM sequences are multi-bytes. We must use the buffer to compare with. I suppose the position number check do a quicker test than always comparing strings. It seems not to have a performance impact on the tests results.
Open to your suggestion for this point.

halaxa · 2019-09-09T19:00:29Z

For the sake of code tree simplicity I'd prefer this way - treating the BOM bytes as whitespace. Would you agree?

formatz · 2019-10-15T12:16:29Z

Thanks

halaxa · 2019-10-15T12:24:18Z

You're welcome. It would have been merged sooner but I was waiting for your response.

formatz · 2019-10-15T13:15:56Z

Sorry I forgot to respond. I'll try to be quicker next time.

halaxa · 2019-10-15T16:35:11Z

No problem, just explaining :)

halaxa · 2019-10-15T16:36:22Z

I am looking forward to next time.

Ignore UTF BOM at the beginning of the stream

b4b13e3

formatz added 3 commits September 2, 2019 22:29

add whitelist paths in phpunit config, useful for code coverage report

460c796

Remove UTF-16 and UTF-32 bytes detection - the file must be converted…

1c6c851

… to UTF-8 to be compatible with byte stream parser

Add test for UTF-8 BOM detection

4aa339e

Treat UTF-8 BOM as simple whitespace

7ab2bb4

halaxa merged commit c865e03 into halaxa:master Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore UTF BOM at the beginning of the stream #14

Ignore UTF BOM at the beginning of the stream #14

formatz commented Sep 2, 2019

halaxa commented Sep 2, 2019

formatz commented Sep 2, 2019

halaxa commented Sep 9, 2019

formatz commented Oct 15, 2019

halaxa commented Oct 15, 2019

formatz commented Oct 15, 2019

halaxa commented Oct 15, 2019

halaxa commented Oct 15, 2019

Ignore UTF BOM at the beginning of the stream #14

Ignore UTF BOM at the beginning of the stream #14

Conversation

formatz commented Sep 2, 2019

halaxa commented Sep 2, 2019

formatz commented Sep 2, 2019

halaxa commented Sep 9, 2019

formatz commented Oct 15, 2019

halaxa commented Oct 15, 2019

formatz commented Oct 15, 2019

halaxa commented Oct 15, 2019

halaxa commented Oct 15, 2019