Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't parse Python file with cp1252 encoding from xlwt #431

Closed
nedbat opened this issue Oct 15, 2015 · 3 comments
Closed

Couldn't parse Python file with cp1252 encoding from xlwt #431

nedbat opened this issue Oct 15, 2015 · 3 comments
Labels
bug Something isn't working xml

Comments

@nedbat
Copy link
Owner

nedbat commented Oct 15, 2015

Originally reported by Murat Knecht (Bitbucket: muratk, GitHub: muratk)


Coverage choked on a file from the xlwt library.

Couldn't parse 'venv/lib/python2.7/site-packages/xlwt/BIFFRecords.py' as Python source: ''charmap' codec can't decode byte 0x9d in position 68292: character maps to <undefined>' at line 0

To reproduce install these dependencies in a virtualenv env in the working directory:

argparse==1.2.1
coverage==4.0
wsgiref==0.1.2
xlwt==0.7.5

Then run this (which is the boiled down version of what coverage does, afaict:

from coverage import phystokens

f = open("./env/lib/python2.7/site-packages/xlwt/BIFFRecords.py", "rb")
raw = f.read()

enc = phystokens._source_encoding_py2(raw)
print("encoding: {}".format(enc))

uni = raw.decode(enc, "replace")
phystokens.compile_unicode(uni, "<string>", "exec")

When using compile directly on raw, it works.

Possibly related to #157.


@nedbat
Copy link
Owner Author

nedbat commented Oct 15, 2015

Somehow, utf8 is getting mixed into this:

>>> b"\x93hi\x94".decode("cp1252").encode("utf8")
'\xe2\x80\x9chi\xe2\x80\x9d'

The xlwt code has curly quotes in the docstrings (\x93 and \x94 in cp1252). Converted to utf8, there are \x9d bytes, which are then being interpreted as cp1252, which has no character at \x9d.

@nedbat
Copy link
Owner Author

nedbat commented Oct 15, 2015

Original comment by Murat Knecht (Bitbucket: muratk, GitHub: muratk)


It looks like a compile bug in that it re-encodes the already Unicode source with the encoding specified in the header … which does not make sense. Nevertheless, most combinations of loading the file and dumping it into compile work, so coverage might want to use a workaround.

@nedbat
Copy link
Owner Author

nedbat commented Oct 17, 2015

Fixed in 8b27dd77d0f1 (bb) (for 4.0.2)

@nedbat nedbat closed this as completed Oct 17, 2015
@nedbat nedbat added minor bug Something isn't working xml labels Jun 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xml
Projects
None yet
Development

No branches or pull requests

1 participant