Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Handle dvi font names as ASCII bytestrings #6977
Conversation
mdboom
added the
needs_review
label
Aug 25, 2016
tacaswell
added this to the
2.0.1 (next bug fix release)
milestone
Aug 25, 2016
jkseppan
referenced
this pull request
Aug 27, 2016
Open
savefig to pdf: 'str' object has no attribute 'decode' #6516
jkseppan
requested a review
from anntzer
Dec 26, 2016
| @@ -526,8 +527,7 @@ class DviFont(object): | ||
| __slots__ = ('texname', 'size', 'widths', '_scale', '_vf', '_tfm') | ||
| def __init__(self, scale, tfm, texname, vf): | ||
| - if six.PY3 and isinstance(texname, bytes): | ||
| - texname = texname.decode('ascii') | ||
| + assert(isinstance(texname, bytes)) |
anntzer
Dec 26, 2016
Contributor
No parentheses here. (And in general I think we should avoid bare asserts in favor of raising explicit exceptions.)
| @@ -510,7 +511,7 @@ class DviFont(object): | ||
| Name of the font as used internally by TeX and friends. This | ||
| is usually very different from any external font names, and | ||
| :class:`dviread.PsfontsMap` can be used to find the external | ||
| - name of the font. | ||
| + name of the font. ASCII bytestring. |
anntzer
Dec 26, 2016
Contributor
If we can't use numpydoc/napoleon-style docstrings I would at least put the type at a more prominent place.
| enc = find_tex_file(result.encoding) | ||
| return result._replace(filename=fn, encoding=enc) | ||
| def _parse(self, file): | ||
| - """Parse each line into words.""" | ||
| + """Parse each line into words and process them.""" |
anntzer
Dec 26, 2016
Contributor
This docstring seems rather pointless (it is no news that a method named _parse parses and processes). I would either make it more explicit (what is the parsed format?) or just get rid of it.
| line = line.strip() | ||
| - if line == '' or line.startswith('%'): | ||
| + if line == b'' or line.startswith(b'%'): | ||
| continue | ||
| words, pos = [], 0 | ||
| while pos < len(line): |
anntzer
Dec 26, 2016
Contributor
If I understand the logic of this method correctly it should be possible to rewrite this loop simply as something like (untested)
re.findall(b'("[^"]*"|[^ ]*)', line)
right? (as findall returns nonoverlapping matches)
| @@ -861,19 +878,23 @@ def _register(self, words): | ||
| # http://tex.stackexchange.com/questions/10826/ | ||
| # http://article.gmane.org/gmane.comp.tex.pdftex/4914 | ||
| + # input must be bytestrings (the file format is ASCII) | ||
| + for word in words: | ||
| + assert(isinstance(word, bytes)) |
| + # input must be bytestrings (the file format is ASCII) | ||
| + for word in words: | ||
| + assert(isinstance(word, bytes)) | ||
| + |
anntzer
Dec 26, 2016
Contributor
Again, the logic of this function may perhaps be more easily expressed using regexes.
|
@jkseppan Not sure why you asked me to review the PR (I don't know that much about that part of the codebase) but I had a quick look for now. |
|
@anntzer Many thanks for the review! I thought this had been waiting for quite a while and I noticed your name in the git history for dviread.py, so I figured you have at least taken a look at it recently. |
|
No worries. |
|
I changed a bunch of the docstrings to the numpydoc format, and rebased because I couldn't get the documentation build to work otherwise. |
| - filename = word | ||
| + # input must be bytestrings (the file format is ASCII) | ||
| + for word in words: | ||
| + assert isinstance(word, bytes) |
tacaswell
Dec 29, 2016
Owner
A while ago where was an effort to remove assert from all of the main code base and replace with if not ...: raise blocks. Are these here for testing reasons or run-time checks?
jkseppan
Dec 29, 2016
Member
This is a private function and should only be called from the same class, so if this triggers it's an internal error in the code. While writing this code I felt this was more obvious than the later errors you get from mixing bytestrings and strings, but perhaps this is not really needed.
|
The test failures look real |
| @@ -927,26 +981,27 @@ def _parse(self, file): | ||
| state = 0 | ||
| for line in file: | ||
| - comment_start = line.find('%') | ||
| + line = six.b(line) | ||
| + comment_start = line.find(b'%') | ||
| if comment_start > -1: | ||
| line = line[:comment_start] | ||
| line = line.strip() | ||
| if state == 0: |
anntzer
Dec 30, 2016
Contributor
I tried googling without much success the format of enc files but it looks like this is looking for patterns for the form
/FooEncoding [ /abc/def/ghi ] def
and returning the slash-separated words within the brackets. At least the searching for brackets-surrounded fragments can clearly be rewritten using re.findall as above.
|
That seems to have helped with Python 3 failures, but now we get test failures on Python 2.7. |
|
Now the Travis build passed after I restarted one of the jobs. |
codecov-io
commented
Jan 1, 2017
•
Current coverage is 62.10% (diff: 91.89%)@@ master #6977 diff @@
==========================================
Files 109 174 +65
Lines 46648 56012 +9364
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 31060 34786 +3726
- Misses 15588 21226 +5638
Partials 0 0
|
| + Used for verifying against the dvi file. | ||
| + design_size : int | ||
| + Design size of the font (unknown units) | ||
| + width : dict |
QuLogic
Jan 2, 2017
Member
Assuming they are all of similar description and the original just eschewed repeating all that, this can be combined with the next two as:
width, height, depth : dict
Dimensions of each character, ...
| with open(filename, 'rt') as file: | ||
| self._parse(file) | ||
| def __getitem__(self, texname): | ||
| + assert(isinstance(texname, bytes)) |
| try: | ||
| result = self._font[texname] | ||
| except KeyError: | ||
| - result = self._font[texname.decode('ascii')] | ||
| + matplotlib.verbose.report(textwrap.fill | ||
| + ('A PostScript file for the font whose TeX name is "%s" ' |
QuLogic
Jan 2, 2017
Member
It seems weird to me to wrap between a function call and its opening parenthesis.
| + 'package manager.' % (texname.decode('ascii'), | ||
| + self._filename), | ||
| + break_on_hyphens=False, break_long_words=False), | ||
| + 'helpful') |
QuLogic
Jan 2, 2017
Member
Especially because this line is in a different function call than the ones above it.
| for line in file: | ||
| + line = six.b(line) |
QuLogic
Jan 2, 2017
Member
Why not open the file in binary mode in __init__ instead of decoding and re-encoding here? Also, six.b is supposed to be applied on string literals exclusively; this should use an explicit encode if you really want to do this.
| + font_files = [word.lstrip(b'<') | ||
| + for word in words | ||
| + if word.startswith(b'<') | ||
| + and not re.match(encoding_re, word)] |
QuLogic
Jan 2, 2017
Member
Not sure that 3 list comprehensions over words was the best change, efficiency-wise.
| - if comment_start > -1: | ||
| - line = line[:comment_start] | ||
| - line = line.strip() | ||
| + lines = (line[:line.find(b'%')] if b'%' in line else line.strip() |
QuLogic
Jan 2, 2017
Member
The previous code ran strip after removing the comment as well; this can be shortened a bit: line.split(b'%')[0].strip().
| + lines = (line[:line.find(b'%')] if b'%' in line else line.strip() | ||
| + for line in file) | ||
| + data = b''.join(lines) | ||
| + match = re.search(six.b(r'\['), data) |
| + if not match: | ||
| + raise ValueError("Cannot locate beginning of encoding in {}" | ||
| + .format(file)) | ||
| + data = data[match.span()[1]:] |
| + raise ValueError("Cannot locate beginning of encoding in {}" | ||
| + .format(file)) | ||
| + data = data[match.span()[1]:] | ||
| + match = re.search(six.b(r'\]'), data) |
| + if not match: | ||
| + raise ValueError("Cannot locate end of encoding in {}" | ||
| + .format(file)) | ||
| + data = data[:match.span()[0]] |
| - raise ValueError("Broken name in encoding file: " + w) | ||
| - | ||
| - return result | ||
| + return re.findall(six.b(r'/([^][{}<>\s]+)'), data) |
| @skip_if_command_unavailable(["kpsewhich", "-version"]) | ||
| def test_dviread(): | ||
| dir = os.path.join(os.path.dirname(__file__), 'baseline_images', 'dviread') | ||
| with open(os.path.join(dir, 'test.json')) as f: | ||
| correct = json.load(f) | ||
| + for entry in correct: | ||
| + entry['text'] = [[a, b, c, six.b(d), e] |
|
The Travis failures are in test_scales.py (fix in #7726), and for some reason the OSX builder is missing LaTeX and Ghostscript. |
|
The osx build does not install latex because it takes way to long. IMHO the tests should be robust to this and skip when latex is not available |
|
I was wrong: it seems that the actual failure in the OSX case is from test_scales.py too, even though there are error messages about missing latex. |
jkseppan
added some commits
Aug 25, 2016
|
Rebased because of conflicts with recently merged PRs |
QuLogic
referenced
this pull request
Feb 10, 2017
Open
"savefig" bug with unicode characters (version 2.0.0) #8039
| @@ -698,9 +736,9 @@ def _write_afm_font(self, filename): | ||
| self.writeObject(fontdictObject, fontdict) | ||
| return fontdictObject | ||
| - def embedTeXFont(self, texname, fontinfo): |
jkseppan
Feb 12, 2017
Member
I don't think that's something that an external user would ever have called. Let's mark the function private with an underscore and document the change.
| @@ -1596,12 +1633,6 @@ def check_gc(self, gc, fillcolor=None): | ||
| gc._fillcolor = orig_fill | ||
| gc._effective_alphas = orig_alphas | ||
| - def tex_font_mapping(self, texfont): |
| - with open(filename, 'rt') as file: | ||
| + self._filename = filename | ||
| + if six.PY3 and isinstance(filename, bytes): | ||
| + self._filename = filename.decode('ascii', errors='replace') |
jkseppan
Feb 12, 2017
Member
I suppose the system encoding is more correct, but conversions like that make me somewhat wary. It's not really enough to specify UTF-8, you have to know which representation to choose for characters where you have a choice. (For example, the Wikipedia page on HFS+: "File and folder names in HFS Plus are [...] normalized to a form very nearly the same as Unicode Normalization Form D (NFD)". At least at one time the Linux HFS+ implementation didn't follow this.)
QuLogic
Feb 12, 2017
Member
The correct encoding depends on where the bytestring originates. If it's out of a TeX file, I wouldn't be surprised if ASCII were good enough considering the esoteric requirements like fitting in 8 characters.
If it's something the user supplies, then there's really no good default and they really should have done it themselves. If the "user" is us, then we really need to fix that end instead.
jkseppan
Feb 13, 2017
Member
I think the only current users in our code are the PDF backend and the text2path code, both of which just pass in the location of "pdftex.map". I initially thought this might need to be made customizable but I've never seen that as a feature request.
| - word = word.lstrip('<') | ||
| - if word.startswith('[') or word.endswith('.enc'): | ||
| + empty_re = re.compile(br'%|\s*$') | ||
| + word_re = re.compile( |
tacaswell
Feb 11, 2017
Owner
I learned quite a bit about regular expressions understanding this pattern.
For example, you can make them readable and name groups!
| + psname = w.group('eff2') or w.group('eff1') | ||
| + | ||
| + for w in words: | ||
| + eff = w.group('eff1') or w.group('eff2') |
tacaswell
Feb 11, 2017
Owner
Everywhere else these are listed in reverse order to how they are in the expression, does that matter?
jkseppan
Feb 12, 2017
Member
The named groups are mutually exclusive so the order doesn't matter for correctness, but there may be a very slight performance difference. I think I was listing the groups in an approximate order of probability of occurrence, e.g. effects are almost certainly quoted because they include arguments, but file and font names are almost certainly not quoted. I can see how this would be confusing; I'll add a comment.
| + raise ValueError("Cannot locate beginning of encoding in {}" | ||
| + .format(file)) | ||
| + data = data[beginning:] | ||
| + end = data.find(b']') |
|
That was fun to review! I left 1 question about order of an I am overall |
tacaswell
modified the milestone: 2.1 (next point release), 2.0.1 (next bug fix release)
Feb 11, 2017
|
Also re-milestoned for 2.1. It isn't really fixing a regression (these issues are very long standing) and makes some pretty substantial changes to subtle code so seems higher-risk. I am open to arguments that this should be backported though. |
|
Also jkseppan#6 |
jkseppan
added some commits
Feb 12, 2017
|
I think the issue dates back to when Matplotlib started supporting Python 3. I think this started out as a smaller fix, but the earlier round of review suggested some improvements that I agreed with. I think the 2.1 milestone is appropriate. |
jkseppan commentedAug 25, 2016
Dvi is a binary format that includes some ASCII strings such as
TeX names of some fonts. The associated files such as psfonts.map
need to be ASCII too. This patch changes their handling to keep
them as binary strings all the time.
This avoids the ugly workaround
which is essentially saying that texname is sometimes a string,
sometimes a bytestring. The workaround masks real KeyErrors,
leading to incomprehensible error messages such as in #6516.