Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore UTF BOM at the beginning of the stream #14

Merged
merged 5 commits into from
Oct 15, 2019
Merged

Ignore UTF BOM at the beginning of the stream #14

merged 5 commits into from
Oct 15, 2019

Conversation

formatz
Copy link
Contributor

@formatz formatz commented Sep 2, 2019

When the json file stream contains a BOM sequence at the begining, the parser throws an exception and cannot parse the file, even if the encoding is UTF-8.

With this change, the parser ignore the BOM sequence if found.

I included BOM sequences for UTF-16 (BE/LE) and UTF-32 (BE/LE) too.

@halaxa
Copy link
Owner

halaxa commented Sep 2, 2019

That is really useful feature, thanks. Could you please add tests for all 5 BOM cases? Additionally, I am afraid that adding this many ifs might slow things down, as Lexer is performance critical part. What about just add the BOM bytes to an existing list of whitespace chars? That should be constant in speed.

        // Lexer.php, lines 39-42
        ${' '} = 0;
        ${"\n"} = 0;
        ${"\r"} = 0;
        ${"\t"} = 0;
        // add it here

True, it would ignore more than just BOMs, but I think we can safely accept that.

@formatz
Copy link
Contributor Author

formatz commented Sep 2, 2019

Thank you for your quick response and to maintain this project.

Firstly I removed UTF-16 and UTF-32 because with this charsets you cannot parse byte per byte to detect the char separators. Multiple bytes must be parsed even with ascii table chars. Sorry my fault.

Secondly we cannot check if a BOM is present because the BOM sequences are multi-bytes. We must use the buffer to compare with. I suppose the position number check do a quicker test than always comparing strings. It seems not to have a performance impact on the tests results.
Open to your suggestion for this point.

@halaxa
Copy link
Owner

halaxa commented Sep 9, 2019

For the sake of code tree simplicity I'd prefer this way - treating the BOM bytes as whitespace. Would you agree?

@halaxa halaxa merged commit c865e03 into halaxa:master Oct 15, 2019
@formatz
Copy link
Contributor Author

formatz commented Oct 15, 2019

Thanks

@halaxa
Copy link
Owner

halaxa commented Oct 15, 2019

You're welcome. It would have been merged sooner but I was waiting for your response.

@formatz
Copy link
Contributor Author

formatz commented Oct 15, 2019

Sorry I forgot to respond. I'll try to be quicker next time.

@halaxa
Copy link
Owner

halaxa commented Oct 15, 2019

No problem, just explaining :)

@halaxa
Copy link
Owner

halaxa commented Oct 15, 2019

I am looking forward to next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants