-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(PUP-5819) Error on byte order mark in lexer #4635
(PUP-5819) Error on byte order mark in lexer #4635
Conversation
e16a204
to
243b7b6
Compare
Rebased on to fixes in PUP-1780 - test order issue and mocha.... |
243b7b6
to
c386a8b
Compare
Uh oh, tests failing because Ruby is missing.... I don't think that is my fault :-) respinning. |
There's a lot of errors like:
A rebase is probably good too, to get the new .travis.yml. |
I just figured out what is going wrong - the platform is using UTF-16 and it reacts badly to the test data. The Ruby 1.9.3 works fine for some reason - perhaps running on something where UTF-8 is default (or some other Encoding that does not barf on that input. There is a PR up that will force file reading to use UTF-8, the problem will go away then. Maybe I should move the check earlier to avoid the problem; the reported issues are raised when the illegal String is given to the StringScanner. |
c386a8b
to
6415b7e
Compare
Rebased and approach changed to asserting the string before creating the scanner. Also with test that checks if a UTF-16 string with the content in question can be given to the lexer with expected error rather than an encoding error (which travis flagged earlier). |
Looks as this one needs to be rebased on master. It's not parented by the PUP-1780 commits that went in. |
Before this, a byte order mark at the begining of the input would cause the lexer to deliver the unsupported characters as an OTHER token. This leads to a Syntax Error. This is bad because some of the BOMs are non printable characters and users will not know what is wrong. Especially when files having been edited on a Windows platform, and later cause problem on a n*x box where users are less familiar with seeing BOMs. This PR adds an assertion when the lexer is about to deliver an OTHER token. This BOM assertio checks if the starting 2-4 bytes of the input is one of the 10 BOMs in wide use. When it errors it will show the name of the format e.g. UTF-16, or UTF-32, and the bytes that defines that BOM. The user is enciuraged to remove this sequence at the beginning of the file in the error message. Ruby, after first not supportig wide characters shifted and instead made it difficult to work with the raw bytes. Some gymnastics are required to represent and process the characters as several of the sequeces form invalid strings in the UTF-8 encoding we are using in puppet source.
Before this, when running on a platform using UTF-16, the string constructed with various high order bits set would cause an error to be raised as the result was invalid UTF-16 string- This changes that by: * Using bytes expressed as FixInts and converting them to strings using pack('C*') (unsigned bytes) in the tests * Using bytes values when comparing The beahvior of UTF-16 is simulated by force encoding strings using UTF-16, and then feeding them to the lexer.
6415b7e
to
3317bda
Compare
…mark-in-lexer (PUP-5819) Error on byte order mark in lexer
- Divergent code was introduced in master branch when #4635 was merged prior to code being merged up from stable, that includes #4585 (PUP-5728). The stable code includes some tests validating lexer behavior for UTF-8 that was updated in master. The prior merge commit fixes merge conflicts introduced by this issue, and this commit adds fixes for the two failing tests related to this problem.
See the commit e16a204 for details about this PR.