New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find encoding for Python files #1526
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
8ec813a
Add IPython.utils.openpy to decode Python files.
takluyver bb13c29
Fix for %run on a Python file using non-default encoding.
takluyver 6357257
Use openpy module for %loadpy magic.
takluyver 198fed8
Add file required for Unicode test.
takluyver 9c22fc1
Add encoding cookie to test_run.
takluyver 0cb09db
Remove unused encoding declaration regex in IPython.core.magic.
takluyver 6811272
Add docstrings for read_py_file and read_py_url.
takluyver b9b8a6f
Add tests for IPython.utils.openpy
takluyver File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# encoding: iso-8859-5 | ||
# (Unlikely to be the default encoding for most testers.) | ||
# ������������������� <- Cyrillic characters | ||
from __future__ import unicode_literals | ||
u = '����' | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
""" | ||
Tools to open .py files as Unicode, using the encoding specified within the file, | ||
as per PEP 263. | ||
|
||
Much of the code is taken from the tokenize module in Python 3.2. | ||
""" | ||
from __future__ import absolute_import | ||
|
||
import __builtin__ | ||
import io | ||
from io import TextIOWrapper | ||
import re | ||
import urllib | ||
|
||
cookie_re = re.compile(ur"coding[:=]\s*([-\w.]+)", re.UNICODE) | ||
cookie_comment_re = re.compile(ur"^\s*#.*coding[:=]\s*([-\w.]+)", re.UNICODE) | ||
|
||
try: | ||
# Available in Python 3 | ||
from tokenize import detect_encoding | ||
except ImportError: | ||
from codecs import lookup, BOM_UTF8 | ||
|
||
# Copied from Python 3.2 tokenize | ||
def _get_normal_name(orig_enc): | ||
"""Imitates get_normal_name in tokenizer.c.""" | ||
# Only care about the first 12 characters. | ||
enc = orig_enc[:12].lower().replace("_", "-") | ||
if enc == "utf-8" or enc.startswith("utf-8-"): | ||
return "utf-8" | ||
if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \ | ||
enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")): | ||
return "iso-8859-1" | ||
return orig_enc | ||
|
||
# Copied from Python 3.2 tokenize | ||
def detect_encoding(readline): | ||
""" | ||
The detect_encoding() function is used to detect the encoding that should | ||
be used to decode a Python source file. It requires one argment, readline, | ||
in the same way as the tokenize() generator. | ||
|
||
It will call readline a maximum of twice, and return the encoding used | ||
(as a string) and a list of any lines (left as bytes) it has read in. | ||
|
||
It detects the encoding from the presence of a utf-8 bom or an encoding | ||
cookie as specified in pep-0263. If both a bom and a cookie are present, | ||
but disagree, a SyntaxError will be raised. If the encoding cookie is an | ||
invalid charset, raise a SyntaxError. Note that if a utf-8 bom is found, | ||
'utf-8-sig' is returned. | ||
|
||
If no encoding is specified, then the default of 'utf-8' will be returned. | ||
""" | ||
bom_found = False | ||
encoding = None | ||
default = 'utf-8' | ||
def read_or_stop(): | ||
try: | ||
return readline() | ||
except StopIteration: | ||
return b'' | ||
|
||
def find_cookie(line): | ||
try: | ||
line_string = line.decode('ascii') | ||
except UnicodeDecodeError: | ||
return None | ||
|
||
matches = cookie_re.findall(line_string) | ||
if not matches: | ||
return None | ||
encoding = _get_normal_name(matches[0]) | ||
try: | ||
codec = lookup(encoding) | ||
except LookupError: | ||
# This behaviour mimics the Python interpreter | ||
raise SyntaxError("unknown encoding: " + encoding) | ||
|
||
if bom_found: | ||
if codec.name != 'utf-8': | ||
# This behaviour mimics the Python interpreter | ||
raise SyntaxError('encoding problem: utf-8') | ||
encoding += '-sig' | ||
return encoding | ||
|
||
first = read_or_stop() | ||
if first.startswith(BOM_UTF8): | ||
bom_found = True | ||
first = first[3:] | ||
default = 'utf-8-sig' | ||
if not first: | ||
return default, [] | ||
|
||
encoding = find_cookie(first) | ||
if encoding: | ||
return encoding, [first] | ||
|
||
second = read_or_stop() | ||
if not second: | ||
return default, [first] | ||
|
||
encoding = find_cookie(second) | ||
if encoding: | ||
return encoding, [first, second] | ||
|
||
return default, [first, second] | ||
|
||
try: | ||
# Available in Python 3.2 and above. | ||
from tokenize import open | ||
except ImportError: | ||
# Copied from Python 3.2 tokenize | ||
def open(filename): | ||
"""Open a file in read only mode using the encoding detected by | ||
detect_encoding(). | ||
""" | ||
buffer = io.open(filename, 'rb') # Tweaked to use io.open for Python 2 | ||
encoding, lines = detect_encoding(buffer.readline) | ||
buffer.seek(0) | ||
text = TextIOWrapper(buffer, encoding, line_buffering=True) | ||
text.mode = 'r' | ||
return text | ||
|
||
def strip_encoding_cookie(filelike): | ||
"""Generator to pull lines from a text-mode file, skipping the encoding | ||
cookie if it is found in the first two lines. | ||
""" | ||
it = iter(filelike) | ||
try: | ||
first = next(it) | ||
if not cookie_comment_re.match(first): | ||
yield first | ||
second = next(it) | ||
if not cookie_comment_re.match(second): | ||
yield second | ||
except StopIteration: | ||
return | ||
|
||
for line in it: | ||
yield line | ||
|
||
def read_py_file(filename, skip_encoding_cookie=True): | ||
"""Read a Python file, using the encoding declared inside the file. | ||
|
||
Parameters | ||
---------- | ||
filename : str | ||
The path to the file to read. | ||
skip_encoding_cookie : bool | ||
If True (the default), and the encoding declaration is found in the first | ||
two lines, that line will be excluded from the output - compiling a | ||
unicode string with an encoding declaration is a SyntaxError in Python 2. | ||
|
||
Returns | ||
------- | ||
A unicode string containing the contents of the file. | ||
""" | ||
with open(filename) as f: # the open function defined in this module. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this function and the next, let's add at least proper docstrings (i.e. with full Parameters and Returns descriptions), as they are likely to be useful for others in general. |
||
if skip_encoding_cookie: | ||
return "".join(strip_encoding_cookie(f)) | ||
else: | ||
return f.read() | ||
|
||
def read_py_url(url, errors='replace', skip_encoding_cookie=True): | ||
"""Read a Python file from a URL, using the encoding declared inside the file. | ||
|
||
Parameters | ||
---------- | ||
url : str | ||
The URL from which to fetch the file. | ||
errors : str | ||
How to handle decoding errors in the file. Options are the same as for | ||
bytes.decode(), but here 'replace' is the default. | ||
skip_encoding_cookie : bool | ||
If True (the default), and the encoding declaration is found in the first | ||
two lines, that line will be excluded from the output - compiling a | ||
unicode string with an encoding declaration is a SyntaxError in Python 2. | ||
|
||
Returns | ||
------- | ||
A unicode string containing the contents of the file. | ||
""" | ||
response = urllib.urlopen(url) | ||
buffer = io.BytesIO(response.read()) | ||
encoding, lines = detect_encoding(buffer.readline) | ||
buffer.seek(0) | ||
text = TextIOWrapper(buffer, encoding, errors=errors, line_buffering=True) | ||
text.mode = 'r' | ||
if skip_encoding_cookie: | ||
return "".join(strip_encoding_cookie(text)) | ||
else: | ||
return text.read() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
import io | ||
import os.path | ||
import nose.tools as nt | ||
|
||
from IPython.utils import openpy | ||
|
||
mydir = os.path.dirname(__file__) | ||
nonascii_path = os.path.join(mydir, '../../core/tests/nonascii.py') | ||
|
||
def test_detect_encoding(): | ||
f = open(nonascii_path, 'rb') | ||
enc, lines = openpy.detect_encoding(f.readline) | ||
nt.assert_equal(enc, 'iso-8859-5') | ||
|
||
def test_read_file(): | ||
read_specified_enc = io.open(nonascii_path, encoding='iso-8859-5').read() | ||
read_detected_enc = openpy.read_py_file(nonascii_path, skip_encoding_cookie=False) | ||
nt.assert_equal(read_detected_enc, read_specified_enc) | ||
assert u'encoding: iso-8859-5' in read_detected_enc | ||
|
||
read_strip_enc_cookie = openpy.read_py_file(nonascii_path, skip_encoding_cookie=True) | ||
assert u'encoding: iso-8859-5' not in read_strip_enc_cookie | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Github displays this file using a default encoding (probably latin-1 or cp1252), so these characters don't look like cyrillic characters. They are compared with a literal in the UTF-8 encoded test file.