Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Type-1 font parsing #20715

Merged
merged 2 commits into from Oct 4, 2021
Merged

Conversation

jkseppan
Copy link
Member

@jkseppan jkseppan commented Jul 22, 2021

PR Summary

Parse font properties also from the encrypted part of the file, and reimplement the parsing so it understands more of PostScript's syntax. This fixes a bug where Type1Font.transform would not remove the UniqueID key but break some PostScript code referring to UniqueID instead.

Incidentally, fix the bug where every font had a weight property with value 'Normal' - the correct property is spelled Weight with a capital letter.

This is a prerequisite for subsetting Type-1 fonts (#127).

PR Checklist

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (run flake8 on changed files to check).
  • New features are documented, with examples if plot related.
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
  • New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

@jkseppan jkseppan force-pushed the type1-improved-parsing branch 4 times, most recently from 981ef67 to 90b5889 Compare July 22, 2021 12:46
jkseppan added a commit to jkseppan/matplotlib that referenced this pull request Jul 22, 2021
With this I can produce smaller pdf files with usetex in some small
tests, but this obviously needs more extensive testing, thus marking
as draft.

On top of matplotlib#20634 and matplotlib#20715. Closes matplotlib#127.
@jkseppan jkseppan mentioned this pull request Jul 22, 2021
7 tasks
jkseppan added a commit to jkseppan/matplotlib that referenced this pull request Jul 22, 2021
With this I can produce smaller pdf files with usetex in some small
tests, but this obviously needs more extensive testing, thus marking
as draft.

On top of matplotlib#20634 and matplotlib#20715. Closes matplotlib#127.
@jkseppan jkseppan force-pushed the type1-improved-parsing branch 2 times, most recently from e35728b to 9418b35 Compare July 22, 2021 15:59
jkseppan added a commit to jkseppan/matplotlib that referenced this pull request Jul 22, 2021
With this I can produce smaller pdf files with usetex in some small
tests, but this obviously needs more extensive testing, thus marking
as draft.

On top of matplotlib#20715. Closes matplotlib#127.
@jkseppan jkseppan force-pushed the type1-improved-parsing branch 2 times, most recently from 19119d8 to 25613b9 Compare August 28, 2021 18:15
@jkseppan
Copy link
Member Author

Thanks for the review @anntzer!

Move Type1Font._tokens into a top-level function _tokenize that is a
coroutine. The parsing stage consuming the tokens can instruct the
tokenizer to return a binary token - this is necessary when decrypting
the CharStrings and Subrs arrays, since the preceding context determines
which parts of the data need to be decrypted.

The function now also parses the encrypted portion of the font file.

To support usage as a coroutine, move the whitespace filtering into the
function, since passing the information about binary tokens would not
easily work through a filter.

The function now returns tokens as subclasses of a new _Token class,
which carry the position and value of the token and can have
token-specific helper methods. The position data will be needed when
modifying the file, as the font is transformed or subsetted.

A new helper function _expression can be used to consume tokens that
form a balanced subexpression delimited by [] or {}. This helps fix a
bug in UniqueID removal: if the font includes PostScript code that
checks if the UniqueID is set in the current dictionary, the previous
code broke that code instead of removing the UniqueID definition. Fonts
can include UniqueID in the encrypted portion as well as the cleartext
one, and removal is now done in both portions.

Fix a bug related to font weight: the key is title-cased and not
lower-cased, so font.prop['weight'] should not exist.
Type-1 fonts are required to have subroutines with specific contents
but their names may vary. They are usually ND, NP and RD but names
like | and |- appear too.
@jkseppan
Copy link
Member Author

I added some tests and realized that the string escaping was slightly wrong. Now the code also parses string values, although it is unlikely that font properties would include escaped whitespace characters.

Copy link
Contributor

@anntzer anntzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I still haven't followed all the logic, but we can always check again later :)

jkseppan added a commit to jkseppan/matplotlib that referenced this pull request Aug 29, 2021
With this I can produce smaller pdf files with usetex in some small
tests, but this obviously needs more extensive testing, thus marking
as draft.

Give dviread.DviFont a fake filename attribute for character tracking.

On top of matplotlib#20715. Closes matplotlib#127.
jkseppan added a commit to jkseppan/matplotlib that referenced this pull request Aug 30, 2021
With this I can produce smaller pdf files with usetex in some small
tests, but this obviously needs more extensive testing, thus marking
as draft.

Give dviread.DviFont a fake filename attribute for character tracking.

On top of matplotlib#20715. Closes matplotlib#127.
depth += 1
elif match.group() == ')':
depth -= 1
else: # a backslash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that here you don't really care about handling the backslash escapes, and all you want to do is simply to match the (unescaped) parentheses, so you could perhaps just replace instring_re by something like (?<!\\)[()] (parentheses not preceded by a backslash)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I gave this a try but it wouldn't work because the parenthesis can be preceded by backslashes if the backslash is itself escaped (i.e. one would really need to search for "parenthesis not preceded by an even number of backslashes"). So let's forget about this for now.

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge based on @anntzer review. This should not go into 3.5 so that it has time to be used on master...

@jklymak jklymak added this to the v3.6.0 milestone Oct 4, 2021
@jklymak jklymak merged commit e9bd017 into matplotlib:master Oct 4, 2021
tacaswell pushed a commit to tacaswell/matplotlib that referenced this pull request Oct 12, 2021
tacaswell pushed a commit that referenced this pull request Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants