Decode subprocess output to utf-8 or regex will fail #1879

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
Contributor

tomspur commented Apr 3, 2013

This is an alternative fix for issue #1767, which changes the regular expression to bytes instead. But that will fail later on in if os.path.splitext(file)[1][1:] in fontexts, as fontexts will then be a set of strings and the former will be the current extension, but in bytes.

After this patch both will be strings and anything works here for me so far.

The RH bug for this is at:
https://bugzilla.redhat.com/show_bug.cgi?id=928326

Owner

mdboom commented Apr 3, 2013

This will only work if the filesystem encoding happens to be utf-8, though, right? (That's usually the case, but not always). I think instead this needs to use sys.getfilesystemencoding().

Contributor

tomspur commented Apr 5, 2013

Three variables must be considered:

  • Regular expressions compiler
  • output of subprocess, which is processed with the reqular expressions compiler. By default returns bytes
  • fontexts synonyms, which are plain strings at the moment.

Things to consider:

  • The reqular expressions compiler and the output of subprocess encoding must match, which is the case, when using a bytes reqular expression or decoding the bytes object to a string. utf-8 is only the default for bytes.decode and could also be omited.
  • Even if the system file encoding will be differently, subprocess returns byes, which can be decoded to whatever you like. So utf-8, what is the default for bytes.decode anyway looks fine.

The only reason for sys.getfilesystemencoding() would be, that matplotlib uses the same encoding internally than the file system. But even there, I think it would be more sane to directly use utf-8 instead and would favour that in any case...

Owner

mdboom commented Apr 8, 2013

I think you've raised a valid issue with the fix in #1767, however I think this still is not correct.

If the file system encoding were, for example latin-1, it's possible, and actually quite easy, to have filenames that are not valid utf-8. (Note: I'm not saying not valid Unicode -- utf-8 is only an encoding, and not all bytestrings are valid utf-8). I looked deeper into fc-match -- the manpage doesn't really address encodings, but it does look from the code like the output is basically utf-8, though the filename part is passed in directly through, so it can be in the filesystem encoding. So I think the correct fix is:

--- a/lib/matplotlib/font_manager.py
+++ b/lib/matplotlib/font_manager.py
@@ -1289,11 +1289,12 @@ if USE_FONTCONFIG and sys.platform != 'win32':
         if pipe.returncode == 0:
             for match in _fc_match_regex.finditer(output):
                 file = match.group(1)
+                file = file.decode(sys.getfilesystemencoding())
                 if os.path.splitext(file)[1][1:] in fontexts:
                     return file
         return None

-    _fc_match_regex = re.compile(r'\sfile:\s+"([^"]*)"')
+    _fc_match_regex = re.compile(rb'\sfile:\s+"([^"]*)"')
     _fc_match_cache = {}

     def findfont(prop, fontext='ttf'):
Member

pelson commented Apr 18, 2013

I've marked this a blocker for v1.3 as it sounds pretty nasty. @mdboom & @tomspur - do we have a consensus on where this PR should go?

Contributor

tomspur commented Apr 18, 2013

@pelson I think @mdboom's appoach is more reasonable. Unfortunately, I currently have other problems with my former reproducer from above (unrelated to matplotlib) and cannot verify his solution currently...

mdboom closed this in 2415c62 Apr 18, 2013

tomspur deleted the tomspur:font_manager_utf8regex branch Mar 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment