Handle dvi font names as ASCII bytestrings #6977

Merged
merged 24 commits into from Feb 26, 2017

Conversation

Projects
None yet
7 participants
Member

jkseppan commented Aug 25, 2016

Dvi is a binary format that includes some ASCII strings such as
TeX names of some fonts. The associated files such as psfonts.map
need to be ASCII too. This patch changes their handling to keep
them as binary strings all the time.

This avoids the ugly workaround

    try:
        result = some_mapping[texname]
    except KeyError:
        result = some_mapping[texname.decode('ascii')]

which is essentially saying that texname is sometimes a string,
sometimes a bytestring. The workaround masks real KeyErrors,
leading to incomprehensible error messages such as in #6516.

mdboom added the needs_review label Aug 25, 2016

jkseppan requested a review from anntzer Dec 26, 2016

lib/matplotlib/dviread.py
@@ -526,8 +527,7 @@ class DviFont(object):
__slots__ = ('texname', 'size', 'widths', '_scale', '_vf', '_tfm')
def __init__(self, scale, tfm, texname, vf):
- if six.PY3 and isinstance(texname, bytes):
- texname = texname.decode('ascii')
+ assert(isinstance(texname, bytes))
@anntzer

anntzer Dec 26, 2016

Contributor

No parentheses here. (And in general I think we should avoid bare asserts in favor of raising explicit exceptions.)

lib/matplotlib/dviread.py
@@ -510,7 +511,7 @@ class DviFont(object):
Name of the font as used internally by TeX and friends. This
is usually very different from any external font names, and
:class:`dviread.PsfontsMap` can be used to find the external
- name of the font.
+ name of the font. ASCII bytestring.
@anntzer

anntzer Dec 26, 2016

Contributor

If we can't use numpydoc/napoleon-style docstrings I would at least put the type at a more prominent place.

lib/matplotlib/dviread.py
enc = find_tex_file(result.encoding)
return result._replace(filename=fn, encoding=enc)
def _parse(self, file):
- """Parse each line into words."""
+ """Parse each line into words and process them."""
@anntzer

anntzer Dec 26, 2016

Contributor

This docstring seems rather pointless (it is no news that a method named _parse parses and processes). I would either make it more explicit (what is the parsed format?) or just get rid of it.

lib/matplotlib/dviread.py
line = line.strip()
- if line == '' or line.startswith('%'):
+ if line == b'' or line.startswith(b'%'):
continue
words, pos = [], 0
while pos < len(line):
@anntzer

anntzer Dec 26, 2016

Contributor

If I understand the logic of this method correctly it should be possible to rewrite this loop simply as something like (untested)

re.findall(b'("[^"]*"|[^ ]*)', line)

right? (as findall returns nonoverlapping matches)

lib/matplotlib/dviread.py
@@ -861,19 +878,23 @@ def _register(self, words):
# http://tex.stackexchange.com/questions/10826/
# http://article.gmane.org/gmane.comp.tex.pdftex/4914
+ # input must be bytestrings (the file format is ASCII)
+ for word in words:
+ assert(isinstance(word, bytes))
@anntzer

anntzer Dec 26, 2016

Contributor

Same remark as above re: assertions.

lib/matplotlib/dviread.py
+ # input must be bytestrings (the file format is ASCII)
+ for word in words:
+ assert(isinstance(word, bytes))
+
@anntzer

anntzer Dec 26, 2016

Contributor

Again, the logic of this function may perhaps be more easily expressed using regexes.

Contributor

anntzer commented Dec 26, 2016

@jkseppan Not sure why you asked me to review the PR (I don't know that much about that part of the codebase) but I had a quick look for now.

Member

jkseppan commented Dec 27, 2016

@anntzer Many thanks for the review! I thought this had been waiting for quite a while and I noticed your name in the git history for dviread.py, so I figured you have at least taken a look at it recently.

Contributor

anntzer commented Dec 27, 2016

No worries.

Member

jkseppan commented Dec 29, 2016

I changed a bunch of the docstrings to the numpydoc format, and rebased because I couldn't get the documentation build to work otherwise.

lib/matplotlib/dviread.py
- filename = word
+ # input must be bytestrings (the file format is ASCII)
+ for word in words:
+ assert isinstance(word, bytes)
@tacaswell

tacaswell Dec 29, 2016

Owner

A while ago where was an effort to remove assert from all of the main code base and replace with if not ...: raise blocks. Are these here for testing reasons or run-time checks?

@jkseppan

jkseppan Dec 29, 2016

Member

This is a private function and should only be called from the same class, so if this triggers it's an internal error in the code. While writing this code I felt this was more obvious than the later errors you get from mixing bytestrings and strings, but perhaps this is not really needed.

Owner

tacaswell commented Dec 29, 2016

The test failures look real

lib/matplotlib/dviread.py
@@ -927,26 +981,27 @@ def _parse(self, file):
state = 0
for line in file:
- comment_start = line.find('%')
+ line = six.b(line)
+ comment_start = line.find(b'%')
if comment_start > -1:
line = line[:comment_start]
line = line.strip()
if state == 0:
@anntzer

anntzer Dec 30, 2016

Contributor

I tried googling without much success the format of enc files but it looks like this is looking for patterns for the form

/FooEncoding [ /abc/def/ghi ] def

and returning the slash-separated words within the brackets. At least the searching for brackets-surrounded fragments can clearly be rewritten using re.findall as above.

Member

jkseppan commented Dec 31, 2016

That seems to have helped with Python 3 failures, but now we get test failures on Python 2.7.

Member

jkseppan commented Jan 1, 2017

Now the Travis build passed after I restarted one of the jobs.

codecov-io commented Jan 1, 2017 edited

Current coverage is 62.10% (diff: 91.89%)

Merging #6977 into master will decrease coverage by 4.47%

@@             master      #6977   diff @@
==========================================
  Files           109        174     +65   
  Lines         46648      56012   +9364   
  Methods           0          0           
  Messages          0          0           
  Branches          0          0           
==========================================
+ Hits          31060      34786   +3726   
- Misses        15588      21226   +5638   
  Partials          0          0           

Powered by Codecov. Last update b12b6a7...08c69f5

lib/matplotlib/dviread.py
+ Used for verifying against the dvi file.
+ design_size : int
+ Design size of the font (unknown units)
+ width : dict
@QuLogic

QuLogic Jan 2, 2017

Member

Assuming they are all of similar description and the original just eschewed repeating all that, this can be combined with the next two as:

width, height, depth : dict
    Dimensions of each character, ...
lib/matplotlib/dviread.py
with open(filename, 'rt') as file:
self._parse(file)
def __getitem__(self, texname):
+ assert(isinstance(texname, bytes))
@QuLogic

QuLogic Jan 2, 2017

Member

Parentheses removal, from @anntzer.

lib/matplotlib/dviread.py
try:
result = self._font[texname]
except KeyError:
- result = self._font[texname.decode('ascii')]
+ matplotlib.verbose.report(textwrap.fill
+ ('A PostScript file for the font whose TeX name is "%s" '
@QuLogic

QuLogic Jan 2, 2017

Member

It seems weird to me to wrap between a function call and its opening parenthesis.

lib/matplotlib/dviread.py
+ 'package manager.' % (texname.decode('ascii'),
+ self._filename),
+ break_on_hyphens=False, break_long_words=False),
+ 'helpful')
@QuLogic

QuLogic Jan 2, 2017

Member

Especially because this line is in a different function call than the ones above it.

lib/matplotlib/dviread.py
for line in file:
+ line = six.b(line)
@QuLogic

QuLogic Jan 2, 2017

Member

Why not open the file in binary mode in __init__ instead of decoding and re-encoding here? Also, six.b is supposed to be applied on string literals exclusively; this should use an explicit encode if you really want to do this.

lib/matplotlib/dviread.py
+ font_files = [word.lstrip(b'<')
+ for word in words
+ if word.startswith(b'<')
+ and not re.match(encoding_re, word)]
@QuLogic

QuLogic Jan 2, 2017

Member

Not sure that 3 list comprehensions over words was the best change, efficiency-wise.

lib/matplotlib/dviread.py
- if comment_start > -1:
- line = line[:comment_start]
- line = line.strip()
+ lines = (line[:line.find(b'%')] if b'%' in line else line.strip()
@QuLogic

QuLogic Jan 2, 2017

Member

The previous code ran strip after removing the comment as well; this can be shortened a bit: line.split(b'%')[0].strip().

@anntzer

anntzer Jan 2, 2017

Contributor

line.split(b'%', 1)[0].strip() is a bit truer in spirit.

lib/matplotlib/dviread.py
+ lines = (line[:line.find(b'%')] if b'%' in line else line.strip()
+ for line in file)
+ data = b''.join(lines)
+ match = re.search(six.b(r'\['), data)
@QuLogic

QuLogic Jan 2, 2017

Member

six.b is for 2.5 support; it can be dropped and the proper prefix used.

@anntzer

anntzer Jan 2, 2017 edited

Contributor

Shouldn't this just be data.find('[') (update below accordingly)?

lib/matplotlib/dviread.py
+ if not match:
+ raise ValueError("Cannot locate beginning of encoding in {}"
+ .format(file))
+ data = data[match.span()[1]:]
@QuLogic

QuLogic Jan 2, 2017

Member

match.end()?

lib/matplotlib/dviread.py
+ raise ValueError("Cannot locate beginning of encoding in {}"
+ .format(file))
+ data = data[match.span()[1]:]
+ match = re.search(six.b(r'\]'), data)
@QuLogic

QuLogic Jan 2, 2017

Member

Remove six.b here as well.

lib/matplotlib/dviread.py
+ if not match:
+ raise ValueError("Cannot locate end of encoding in {}"
+ .format(file))
+ data = data[:match.span()[0]]
@QuLogic

QuLogic Jan 2, 2017

Member

match.start()?

lib/matplotlib/dviread.py
- raise ValueError("Broken name in encoding file: " + w)
-
- return result
+ return re.findall(six.b(r'/([^][{}<>\s]+)'), data)
@QuLogic

QuLogic Jan 2, 2017

Member

More six.b removal.

lib/matplotlib/tests/test_dviread.py
@skip_if_command_unavailable(["kpsewhich", "-version"])
def test_dviread():
dir = os.path.join(os.path.dirname(__file__), 'baseline_images', 'dviread')
with open(os.path.join(dir, 'test.json')) as f:
correct = json.load(f)
+ for entry in correct:
+ entry['text'] = [[a, b, c, six.b(d), e]
@QuLogic

QuLogic Jan 2, 2017

Member

Replace six.b with explicit encode.

Member

jkseppan commented Jan 3, 2017

The Travis failures are in test_scales.py (fix in #7726), and for some reason the OSX builder is missing LaTeX and Ghostscript.

Owner

jenshnielsen commented Jan 3, 2017

The osx build does not install latex because it takes way to long. IMHO the tests should be robust to this and skip when latex is not available

Member

jkseppan commented Jan 3, 2017

I was wrong: it seems that the actual failure in the OSX case is from test_scales.py too, even though there are error messages about missing latex.

jkseppan added some commits Aug 25, 2016

@jkseppan jkseppan Handle dvi font names as ASCII bytestrings
Dvi is a binary format that includes some ASCII strings such as
TeX names of some fonts. The associated files such as psfonts.map
need to be ASCII too. This patch changes their handling to keep
them as binary strings all the time.

This avoids the ugly workaround

        try:
            result = some_mapping[texname]
        except KeyError:
            result = some_mapping[texname.decode('ascii')]

which is essentially saying that texname is sometimes a string,
sometimes a bytestring. The workaround masks real KeyErrors,
leading to incomprehensible error messages such as in #6516.
705b021
@jkseppan jkseppan Test that the KeyError is raised when the font is missing dbc8b9e
@jkseppan jkseppan Mention bytestrings in docstring 93fad55
@jkseppan jkseppan Add a helpful note when raising KeyError from dviread.PsFonts
So if you follow the troubleshooting instructions and rerun
with --verbose-helpful you get a hint about the usual reason
for #6516.
4874e4e
@jkseppan jkseppan Attempted fix for Python 3.4 compatibility a130ba7
@jkseppan jkseppan More python 3.4 compatibility 0f0e41a
@jkseppan jkseppan Use numpydoc format for several dviread docstrings a7b5772
@jkseppan jkseppan Remove useless docstring 803a96e
@jkseppan jkseppan Raise a more useful exception ec5d80e
@jkseppan jkseppan Remove misleading parentheses from assert fe52808
@jkseppan jkseppan Simplify parsing with regular expressions aa8c4f6
@jkseppan jkseppan Perhaps simplify further with regular expressions 9de07aa
@jkseppan jkseppan Remove useless assert
If the class has an internal error triggering this assertion,
we will get some errors later from mixing strings and bytestrings.
c87b653
@jkseppan jkseppan Fix dvi font name handling in pdf backend
These are now ASCII bytestrings so we should not assume they
are strings.
2e19a61
@jkseppan jkseppan Separate the handling of dvi fonts in the pdf backend
Don't mix filenames and dvi font names as keys of the same dict.
119934a
@jkseppan jkseppan Simplify enc file parsing
Use re.findall, and open the file as binary.
8fa303f
@jkseppan jkseppan Small changes in response to code review
Improve a docstring, remove unneeded parens from an assert,
open a file as binary instead of encoding each line read from it,
don't call six.b on variable strings, simplify string handling,
improve the formatting of a matplotlib.verbose.report call.
94587b1
@jkseppan jkseppan Simplify psfonts.map parsing further
Combine the word splitting and classification in one regex so we
only have to scan each line once. Add some quotation marks in the
test case to check that we are handling quoted words correctly
(the behavior should always have matched this test case).
254e3df
Member

jkseppan commented Jan 29, 2017

Rebased because of conflicts with recently merged PRs

@jkseppan jkseppan Try to fix the KeyError test
a8674b3
@@ -698,9 +736,9 @@ def _write_afm_font(self, filename):
self.writeObject(fontdictObject, fontdict)
return fontdictObject
- def embedTeXFont(self, texname, fontinfo):
@tacaswell

tacaswell Feb 11, 2017

Owner

Do we care about this API change?

@jkseppan

jkseppan Feb 12, 2017

Member

I don't think that's something that an external user would ever have called. Let's mark the function private with an underscore and document the change.

@@ -1596,12 +1633,6 @@ def check_gc(self, gc, fillcolor=None):
gc._fillcolor = orig_fill
gc._effective_alphas = orig_alphas
- def tex_font_mapping(self, texfont):
@tacaswell

tacaswell Feb 11, 2017

Owner

do we care about this API change?

@jkseppan

jkseppan Feb 12, 2017

Member

Likewise.

lib/matplotlib/dviread.py
- with open(filename, 'rt') as file:
+ self._filename = filename
+ if six.PY3 and isinstance(filename, bytes):
+ self._filename = filename.decode('ascii', errors='replace')
@tacaswell

tacaswell Feb 11, 2017

Owner

Why ascii instead of utf-8 or the system encoding?

@jkseppan

jkseppan Feb 12, 2017

Member

I suppose the system encoding is more correct, but conversions like that make me somewhat wary. It's not really enough to specify UTF-8, you have to know which representation to choose for characters where you have a choice. (For example, the Wikipedia page on HFS+: "File and folder names in HFS Plus are [...] normalized to a form very nearly the same as Unicode Normalization Form D (NFD)". At least at one time the Linux HFS+ implementation didn't follow this.)

@QuLogic

QuLogic Feb 12, 2017

Member

The correct encoding depends on where the bytestring originates. If it's out of a TeX file, I wouldn't be surprised if ASCII were good enough considering the esoteric requirements like fitting in 8 characters.

If it's something the user supplies, then there's really no good default and they really should have done it themselves. If the "user" is us, then we really need to fix that end instead.

@jkseppan

jkseppan Feb 13, 2017

Member

I think the only current users in our code are the PDF backend and the text2path code, both of which just pass in the location of "pdftex.map". I initially thought this might need to be made customizable but I've never seen that as a feature request.

- word = word.lstrip('<')
- if word.startswith('[') or word.endswith('.enc'):
+ empty_re = re.compile(br'%|\s*$')
+ word_re = re.compile(
@tacaswell

tacaswell Feb 11, 2017

Owner

I learned quite a bit about regular expressions understanding this pattern.

For example, you can make them readable and name groups!

lib/matplotlib/dviread.py
+ psname = w.group('eff2') or w.group('eff1')
+
+ for w in words:
+ eff = w.group('eff1') or w.group('eff2')
@tacaswell

tacaswell Feb 11, 2017

Owner

Everywhere else these are listed in reverse order to how they are in the expression, does that matter?

@jkseppan

jkseppan Feb 12, 2017

Member

The named groups are mutually exclusive so the order doesn't matter for correctness, but there may be a very slight performance difference. I think I was listing the groups in an approximate order of probability of occurrence, e.g. effects are almost certainly quoted because they include arguments, but file and font names are almost certainly not quoted. I can see how this would be confusing; I'll add a comment.

+ raise ValueError("Cannot locate beginning of encoding in {}"
+ .format(file))
+ data = data[beginning:]
+ end = data.find(b']')
@tacaswell

tacaswell Feb 11, 2017

Owner

should this be an rfind?

@tacaswell

tacaswell Feb 11, 2017

Owner

nevermind 🐑

Owner

tacaswell commented Feb 11, 2017

That was fun to review!

I left 1 question about order of an or that should be addressed. The 2 API change questions I think just need to be documented or shimmed back to what they were.

I am overall 👍 on this.

Owner

tacaswell commented Feb 11, 2017

Also re-milestoned for 2.1. It isn't really fixing a regression (these issues are very long standing) and makes some pretty substantial changes to subtle code so seems higher-risk.

I am open to arguments that this should be backported though.

@tacaswell tacaswell ENH: make texFontMap a property
Only used once in the code, but makes the lazy parsing more standard.
25a8fed
Owner

tacaswell commented Feb 11, 2017

Also jkseppan#6

jkseppan added some commits Feb 12, 2017

@jkseppan jkseppan Merge pull request #6 from tacaswell/dvi-ascii
ENH: make texFontMap a property
92e2c52
@jkseppan jkseppan Use file system encoding for the psfonts file name
5ba21b0
@jkseppan jkseppan Document minor API changes
And add an underscore in the beginning of the method whose signature
changes.
10135bf
@jkseppan jkseppan Explain named group ordering
6de9813
Member

jkseppan commented Feb 12, 2017

I think the issue dates back to when Matplotlib started supporting Python 3. I think this started out as a smaller fix, but the earlier round of review suggested some improvements that I agreed with. I think the 2.1 milestone is appropriate.

@QuLogic QuLogic merged commit 4a2c850 into matplotlib:master Feb 26, 2017

4 checks passed

codecov/patch 95.14% of diff hit (target 62.13%)
Details
codecov/project 68.62% (+6.49%) compared to 939747e
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

QuLogic removed the needs_review label Mar 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment