(PUP-5819) Error on byte order mark in lexer #4635

hlindberg · 2016-02-04T23:57:27Z

See the commit e16a204 for details about this PR.

hlindberg · 2016-02-05T02:37:27Z

Rebased on to fixes in PUP-1780 - test order issue and mocha....

hlindberg · 2016-02-05T17:33:30Z

Uh oh, tests failing because Ruby is missing.... I don't think that is my fault :-) respinning.

thallgren · 2016-02-08T16:37:27Z

There's a lot of errors like:

rspec './spec/unit/pops/parser/lexer2_spec.rb[1:180:2]' # Lexer2 when dealing with non UTF-8 and Byte Order Marks (BOMs) errors on the byte order mark for UTF_16_1 '[FE FF]'

A rebase is probably good too, to get the new .travis.yml.

hlindberg · 2016-02-08T17:55:56Z

I just figured out what is going wrong - the platform is using UTF-16 and it reacts badly to the test data. The Ruby 1.9.3 works fine for some reason - perhaps running on something where UTF-8 is default (or some other Encoding that does not barf on that input.

There is a PR up that will force file reading to use UTF-8, the problem will go away then.

Maybe I should move the check earlier to avoid the problem; the reported issues are raised when the illegal String is given to the StringScanner.

hlindberg · 2016-02-08T21:24:21Z

Rebased and approach changed to asserting the string before creating the scanner. Also with test that checks if a UTF-16 string with the content in question can be given to the lexer with expected error rather than an encoding error (which travis flagged earlier).

thallgren · 2016-02-09T08:11:42Z

Looks as this one needs to be rebased on master. It's not parented by the PUP-1780 commits that went in.

Before this, a byte order mark at the begining of the input would cause the lexer to deliver the unsupported characters as an OTHER token. This leads to a Syntax Error. This is bad because some of the BOMs are non printable characters and users will not know what is wrong. Especially when files having been edited on a Windows platform, and later cause problem on a n*x box where users are less familiar with seeing BOMs. This PR adds an assertion when the lexer is about to deliver an OTHER token. This BOM assertio checks if the starting 2-4 bytes of the input is one of the 10 BOMs in wide use. When it errors it will show the name of the format e.g. UTF-16, or UTF-32, and the bytes that defines that BOM. The user is enciuraged to remove this sequence at the beginning of the file in the error message. Ruby, after first not supportig wide characters shifted and instead made it difficult to work with the raw bytes. Some gymnastics are required to represent and process the characters as several of the sequeces form invalid strings in the UTF-8 encoding we are using in puppet source.

Before this, when running on a platform using UTF-16, the string constructed with various high order bits set would cause an error to be raised as the result was invalid UTF-16 string- This changes that by: * Using bytes expressed as FixInts and converting them to strings using pack('C*') (unsigned bytes) in the tests * Using bytes values when comparing The beahvior of UTF-16 is simulated by force encoding strings using UTF-16, and then feeding them to the lexer.

…mark-in-lexer (PUP-5819) Error on byte order mark in lexer

- Divergent code was introduced in master branch when #4635 was merged prior to code being merged up from stable, that includes #4585 (PUP-5728). The stable code includes some tests validating lexer behavior for UTF-8 that was updated in master. The prior merge commit fixes merge conflicts introduced by this issue, and this commit adds fixes for the two failing tests related to this problem.

hlindberg added the Language label Feb 4, 2016

hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch 2 times, most recently from e16a204 to 243b7b6 Compare February 5, 2016 02:36

hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from 243b7b6 to c386a8b Compare February 5, 2016 16:54

hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from c386a8b to 6415b7e Compare February 8, 2016 21:22

hlindberg added 2 commits February 9, 2016 17:33

hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from 6415b7e to 3317bda Compare February 9, 2016 16:36

thallgren added a commit that referenced this pull request Feb 9, 2016

Merge pull request #4635 from hlindberg/PUP-5819_error-on-byte-order-…

234810e

…mark-in-lexer (PUP-5819) Error on byte order mark in lexer

thallgren merged commit 234810e into puppetlabs:master Feb 9, 2016

hlindberg deleted the PUP-5819_error-on-byte-order-mark-in-lexer branch September 16, 2017 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(PUP-5819) Error on byte order mark in lexer #4635

(PUP-5819) Error on byte order mark in lexer #4635

hlindberg commented Feb 4, 2016

hlindberg commented Feb 5, 2016

hlindberg commented Feb 5, 2016

thallgren commented Feb 8, 2016

hlindberg commented Feb 8, 2016

hlindberg commented Feb 8, 2016

thallgren commented Feb 9, 2016

(PUP-5819) Error on byte order mark in lexer #4635

(PUP-5819) Error on byte order mark in lexer #4635

Conversation

hlindberg commented Feb 4, 2016

hlindberg commented Feb 5, 2016

hlindberg commented Feb 5, 2016

thallgren commented Feb 8, 2016

hlindberg commented Feb 8, 2016

hlindberg commented Feb 8, 2016

thallgren commented Feb 9, 2016