-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore UTF BOM at the beginning of the stream #14
Conversation
That is really useful feature, thanks. Could you please add tests for all 5 BOM cases? Additionally, I am afraid that adding this many // Lexer.php, lines 39-42
${' '} = 0;
${"\n"} = 0;
${"\r"} = 0;
${"\t"} = 0;
// add it here True, it would ignore more than just BOMs, but I think we can safely accept that. |
… to UTF-8 to be compatible with byte stream parser
Thank you for your quick response and to maintain this project. Firstly I removed UTF-16 and UTF-32 because with this charsets you cannot parse byte per byte to detect the char separators. Multiple bytes must be parsed even with ascii table chars. Sorry my fault. Secondly we cannot check if a BOM is present because the BOM sequences are multi-bytes. We must use the buffer to compare with. I suppose the position number check do a quicker test than always comparing strings. It seems not to have a performance impact on the tests results. |
For the sake of code tree simplicity I'd prefer this way - treating the BOM bytes as whitespace. Would you agree? |
Thanks |
You're welcome. It would have been merged sooner but I was waiting for your response. |
Sorry I forgot to respond. I'll try to be quicker next time. |
No problem, just explaining :) |
I am looking forward to next time. |
When the json file stream contains a BOM sequence at the begining, the parser throws an exception and cannot parse the file, even if the encoding is UTF-8.
With this change, the parser ignore the BOM sequence if found.
I included BOM sequences for UTF-16 (BE/LE) and UTF-32 (BE/LE) too.