Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PUP-5819) Error on byte order mark in lexer #4635

Conversation

hlindberg
Copy link
Contributor

See the commit e16a204 for details about this PR.

@hlindberg hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch 2 times, most recently from e16a204 to 243b7b6 Compare February 5, 2016 02:36
@hlindberg
Copy link
Contributor Author

Rebased on to fixes in PUP-1780 - test order issue and mocha....

@hlindberg hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from 243b7b6 to c386a8b Compare February 5, 2016 16:54
@hlindberg
Copy link
Contributor Author

Uh oh, tests failing because Ruby is missing.... I don't think that is my fault :-) respinning.

@thallgren
Copy link
Contributor

There's a lot of errors like:

rspec './spec/unit/pops/parser/lexer2_spec.rb[1:180:2]' # Lexer2 when dealing with non UTF-8 and Byte Order Marks (BOMs) errors on the byte order mark for UTF_16_1 '[FE FF]'

A rebase is probably good too, to get the new .travis.yml.

@hlindberg
Copy link
Contributor Author

I just figured out what is going wrong - the platform is using UTF-16 and it reacts badly to the test data. The Ruby 1.9.3 works fine for some reason - perhaps running on something where UTF-8 is default (or some other Encoding that does not barf on that input.

There is a PR up that will force file reading to use UTF-8, the problem will go away then.

Maybe I should move the check earlier to avoid the problem; the reported issues are raised when the illegal String is given to the StringScanner.

@hlindberg hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from c386a8b to 6415b7e Compare February 8, 2016 21:22
@hlindberg
Copy link
Contributor Author

Rebased and approach changed to asserting the string before creating the scanner. Also with test that checks if a UTF-16 string with the content in question can be given to the lexer with expected error rather than an encoding error (which travis flagged earlier).

@thallgren
Copy link
Contributor

Looks as this one needs to be rebased on master. It's not parented by the PUP-1780 commits that went in.

Before this, a byte order mark at the begining of the input would cause
the lexer to deliver the unsupported characters as an OTHER token. This
leads to a Syntax Error.
This is bad because some of the BOMs are non printable characters and
users will not know what is wrong. Especially when files having been
edited on a Windows platform, and later cause problem on a n*x box where
users are less familiar with seeing BOMs.

This PR adds an assertion when the lexer is about to deliver an OTHER
token. This BOM assertio checks if the starting 2-4 bytes of the input
is one of the 10 BOMs in wide use. When it errors it will show the name
of the format e.g. UTF-16, or UTF-32, and the bytes that defines that
BOM. The user is enciuraged to remove this sequence at the beginning of
the file in the error message.

Ruby, after first not supportig wide characters shifted and instead made
it difficult to work with the raw bytes. Some gymnastics are required to
represent and process the characters as several of the sequeces form
invalid strings in the UTF-8 encoding we are using in puppet source.
Before this, when running on a platform using UTF-16, the string
constructed with various high order bits set would cause an error to be
raised as the result was invalid UTF-16 string-

This changes that by:
* Using bytes expressed as FixInts and converting them to strings using
pack('C*') (unsigned bytes) in the tests
* Using bytes values when comparing

The beahvior of UTF-16 is simulated by force encoding strings using
UTF-16, and then feeding them to the lexer.
@hlindberg hlindberg force-pushed the PUP-5819_error-on-byte-order-mark-in-lexer branch from 6415b7e to 3317bda Compare February 9, 2016 16:36
thallgren added a commit that referenced this pull request Feb 9, 2016
…mark-in-lexer

(PUP-5819) Error on byte order mark in lexer
@thallgren thallgren merged commit 234810e into puppetlabs:master Feb 9, 2016
Iristyle added a commit that referenced this pull request Feb 9, 2016
 - Divergent code was introduced in master branch when #4635 was merged
   prior to code being merged up from stable, that includes #4585
   (PUP-5728). The stable code includes some tests validating lexer
   behavior for UTF-8 that was updated in master.

   The prior merge commit fixes merge conflicts introduced by this
   issue, and this commit adds fixes for the two failing tests related
   to this problem.
@hlindberg hlindberg deleted the PUP-5819_error-on-byte-order-mark-in-lexer branch September 16, 2017 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants