Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Include and use the 14 standard font files. #12726

Merged
merged 1 commit into from Jun 8, 2021

Conversation

brendandahl
Copy link
Contributor

Fixes #4244 (at least for the standard font issue).

I need to do some more testing to see if this impacts performance for the browser use case. I'm also considering disabling this feature for browsers so we use the builtin fonts. Though, we may want to use the various symbol fonts since those seem to vary per platform.

@brendandahl
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @brendandahl received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5b052292e34e465/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @brendandahl received. Current queue size: 0

Live output at: http://3.101.106.178:8877/9cd5c996cf5da70/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/5b052292e34e465/output.txt

Total script time: 26.47 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/5b052292e34e465/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://3.101.106.178:8877/9cd5c996cf5da70/output.txt

Total script time: 28.68 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/9cd5c996cf5da70/reftest-analyzer.html#web=eq.log

@brendandahl brendandahl changed the title Include and use the 14 standard fonts files. Include and use the 14 standard font files. Dec 11, 2020
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems like a very reasonable approach, and the re-use between the CMap/StandardFont factories is really nice.
Besides a few inline comments, based on a quick look at the code, I've got one "bigger picture" suggestion below.

I'm also considering disabling this feature for browsers so we use the builtin fonts. Though, we may want to use the various symbol fonts since those seem to vary per platform.

My general train of thought here (ignoring the symbol fonts for now), is that we'd only fetch/use the standard font files when actually necessary. In practice, that'd mean:


Edit: Thinking about my suggestion a bit more, it seems that conditional loading will probably be somewhat difficult to implement well given the existing font-parsing code and the multitude of cases to consider. However, with caching implemented in the worker (note the inline comment), I'm not sure how much of a problem always loading these font files would really be!?

(Back when I added worker-thread caching of CMap-files, overall performance improved significantly for affected documents.)

src/core/evaluator.js Show resolved Hide resolved
let file = null;
if (standardFontName) {
const data = await this.fetchStandardFontData(standardFontName);
file = new Stream(data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let fetchStandardFontData return the data as a Stream, rather than doing that manually at the call-sites.

@@ -243,6 +243,12 @@ function getFontType(type, subtype) {
}
}

function getStandardFontName(name) {
const fontName = name.replace(/[,_]/g, "-").replace(/\s/g, "");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using the same regular expression at

let fontName = name.replace(/[,_]/g, "-").replace(/\s/g, "");

can you please extract the regexp-part to a new constant, used by both call-sites, to avoid these getting out of sync?

src/display/api.js Show resolved Hide resolved
return { cMapData, compressionType };
});
}
return fetchData(url, this.isCompressed).then(data => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
return fetchData(url, this.isCompressed).then(data => {
return fetchData(url, /* asTypedArray = */ this.isCompressed).then(data => {

});
}
}

class DOMStandardFontDataFactory extends BaseStandardFontDataFactory {
_fetchData(url) {
return fetchData(url, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
return fetchData(url, true);
return fetchData(url, /* asTypedArray = */ true);

@@ -178,6 +178,14 @@ const defaultOptions = {
: "../web/cmaps/",
kind: OptionKind.API,
},
standardFontDataUrl: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this further down, since the parameters for each OptionKind should be sorted alphabetically.

Comment on lines 392 to 407
const standardFontNameToFileName = {
Courier: "FoxitFixed",
"Courier-Bold": "FoxitFixedBold",
"Courier-BoldOblique": "FoxitFixedBoldItalic",
"Courier-Oblique": "FoxitFixedItalic",
Helvetica: "FoxitSans",
"Helvetica-Bold": "FoxitSansBold",
"Helvetica-BoldOblique": "FoxitSansBoldItalic",
"Helvetica-Oblique": "FoxitSansItalic",
"Times-Roman": "FoxitSerif",
"Times-Bold": "FoxitSerifBold",
"Times-BoldItalic": "FoxitSerifBoldItalic",
"Times-Italic": "FoxitSerifItalic",
Symbol: "FoxitSymbol",
ZapfDingbats: "FoxitDingbats",
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Snuffleupagus
Copy link
Collaborator

Given the sheer number of long-standing issues/bugs that this PR would solve, this seems like something that would be really nice to try and fix-up and land sooner rather than later :-)
@brendandahl Do you have time to finish this, such that we can land the PR?


Also, we're probably overdue for a new PDF.js release at this point; however I'd suggest that we hold off on that a little while longer since this PR would be really nice to include in the next release. /cc @timvandermeij

@timvandermeij
Copy link
Contributor

I agree, this would be a really nice one to include.

@brendandahl
Copy link
Contributor Author

Still waiting on some Firefox talos runs for windows, but locally this patch is much slower loading the first page of unix01.pdf in the pdfpaint benchmark. 20 runs median 238.43ms vs 332.51ms

@Snuffleupagus
Copy link
Collaborator

Windows talos runs also show a slow down:
base: 257
new: 313
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=2f45f4bcccb0fc60c48180aaaa432e5f1d96424d&newProject=try&newRevision=5cac140a3074ed804b1d72413384f2f065ed83d7&framework=1&showOnlyImportant=1

I suppose that it's not entirely surprising if some slowdown is observed, since we're now having to actually parse font data that we previously didn't have to. The question is obviously how large of a regression we can accept!?

Looking briefly at the tested patch, it appears that it doesn't actually implement worker-thread caching of the standard font data (something that I mentioned as a potential problem above). If that's indeed the case, I'd not be entirely surprised if that at least in part contributes to the observed regression.

What would be very interesting to know, is more precisely where the regression comes from:

  • If it's only caused by src/core/cff_parser.js being used a lot more, then a regression would seem entirely acceptable since we obviously need to parse more font data with this patch.
  • If, on the other hand, a significant part of the regression is related to the additional data loading itself then I'd suggest that worker-thread caching might help. A related idea could perhaps be to pre-load the font data early during document loading, to cut down on the waiting time during the actual font-parsing!?

All-in-all, I suppose that it's going to take more work to get this landed than I'd initially hoped.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Jan 23, 2021

Having played around with this patch locally, focusing on the actual standard font data loading (using console.time/timeEnd), it seems that a very significant part of the regression comes from just waiting for the font data to become available in the worker.

Based on that observation I have one idea, however it's not a great one: What if we simply bundled the standard font data into the pdf.worker.js file, such that it's synchronously available?
Even suggesting this feel all kinds of wrong, but I'm still wondering it this would be an acceptable trade-off here?

It's difficult to not compare with the CMaps, but I do believe that this situation is slightly different here:

  • A very large number of existing PDF files use non-embedded standard fonts, whereas CMaps are much less common in practice (e.g. mostly non-Latin documents). Hence it's going to be very common to have to access the standard font data.
  • The standard font data files included in this patch take up 266 616 bytes, whereas the CMap files are 1 167 756 bytes in total. Obviously the pdf.worker.js is already large, but that kind of size increase doesn't seem completely unacceptable in this case.
  • We've seen again and again how people report bugs, in custom PDF.js implementations, with broken font rendering where the culprit is that the CMap-parameters weren't provided when calling getDocument[1]; bundling the standard font data into pdf.worker.js would prevent similar issues there.
  • Given how many duplicate issues/bugs we've seen regarding missing standard fonts in PDF.js, it seems very unlikely that users would not want to use this feature anyway.

[1] Many users seem to just ignore the errors/warnings, mentioning the problem, printed in the console.

@brendandahl
Copy link
Contributor Author

I also tried fetching the font data from the worker. It was a bit faster but not much.

I'd be curious to try directly embedding the files in the worker, but 266,616 bytes seems like a big increase for something that is not really needed in the common case.

Looking briefly at the tested patch, it appears that it doesn't actually implement worker-thread caching of the standard font data (something that I mentioned as a potential problem above). If that's indeed the case, I'd not be entirely surprised if that at least in part contributes to the observed regression.

For the talos test, the web page is reloaded so the cache wouldn't help there. Also, I only see the font data being loaded once right now since we cache fonts, so I'm not sure this would help in the re-render case either.

I'm planning to put this patch on the back burner though as we have some higher priority stuff for firefox (XFA and tagged PDFs). I'll push up my changes with comments addressed, then we could add a pref to have it disabled by default for the time being.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Jan 28, 2021

I'd be curious to try directly embedding the files in the worker, but 266,616 bytes seems like a big increase for something that is not really needed in the common case.

One more idea here: How about generating a separate bundle with the standard font data, and then using importScripts to load it into the worker-thread when needed?

Edit: Or how about simply generating one large font data file, similar to but a lot simpler than the bcmap ones, such that we only need to fetch a single (larger) file once rather than do multiple requests for the individual standard font data files?
Based on some very quick testing with pdfBug=Stats, it seems that loading only one file would help reduce the overhead.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Jan 31, 2021

I've tried implementing the second idea in #12726 (comment), i.e. creating a bundle with all of the standard font data such that we only need to fetch a single file.
Despite the bundle being fairly large (~260 kb), this is a significant performance improvement for the unix01.pdf document (~33 percent faster than this PR). Obviously it's still not as fast as the master branch, but the remaining slowdown is unavoidable since there's more parsing happening in src/core/fonts.js and src/core/cff_parser.js.

Please find my patch at master...Snuffleupagus:standard-fonts

@brendandahl This is probably the best that we can do, without inlining the standard font data in the pdf.worker.js file; could we perhaps consider moving forward with our combined patches here?

@Snuffleupagus
Copy link
Collaborator

[...] as we have some higher priority stuff for firefox (XFA and tagged PDFs).

I obviously don't question the value of either of those features.
However, I'm willing to bet that including standard font data in PDF.js will fix a lot more things than the above features combined :-)

@Snuffleupagus
Copy link
Collaborator

@brendandahl Ping, how about my idea/implementation mentioned in #12726 (comment)?

Given that it's going to be impossible to add this functionality without any sort of performance overhead, since we now need to parse more font data in affected cases, the only thing that we can realistically do here is to limit the necessary data fetching/loading by only having one (larger) font data bundle.

@brendandahl
Copy link
Contributor Author

I've tried implementing the second idea in #12726 (comment), i.e. creating a bundle with all of the standard font data such that we only need to fetch a single file.
Despite the bundle being fairly large (~260 kb), this is a significant performance improvement for the unix01.pdf document (~33 percent faster than this PR). Obviously it's still not as fast as the master branch, but the remaining slowdown is unavoidable since there's more parsing happening in src/core/fonts.js and src/core/cff_parser.js.

Please find my patch at master...Snuffleupagus:standard-fonts

@brendandahl This is probably the best that we can do, without inlining the standard font data in the pdf.worker.js file; could we perhaps consider moving forward with our combined patches here?

I imagine that will slow down the case where there's only one font used (especially for the generic version of the viewer)?

Another thought, we could try to translate the fonts so they don't need to go through the full font parser. I imagine this would be quite a bit of worker though. It'd probably be good to profile some more though, and see if the font translation is slowing it down or just loading the font in the browser is the slow part.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Feb 25, 2021

I imagine that will slow down the case where there's only one font used (especially for the generic version of the viewer)?

At least locally, the difference in loading time is quite small between loading one font file and loading the (larger) bundle. It seems to me that the overhead related to simply opening the connection, fetching the data, and finally transferring it to the worker-thread is what dominates overall.
However, it would be quite simple to by default only use the bundle in MOZCENTRAL builds, but use the individual files for the GENERIC build. I purposely modified the relevant code such that both "file formats" would still be supported :-)

Another thought, we could try to translate the fonts so they don't need to go through the full font parser. I imagine this would be quite a bit of worker though.

I've also had that idea, but thinking about the practical implications of that made me quickly abandon it. Besides the overall complexity, we'd also need to add a way to automatically re-build the translated fonts for any code-changes.

[...] and see if the font translation is slowing it down or just loading the font in the browser is the slow part.

With the bundle approach, it's mostly the actual font conversion code that contributes to the "regression" here.

However, I do believe that the comparison is perhaps not entirely fair here:
In an ideal world every PDF document would simply embed all of its fonts, and we'd thus always have to run our font conversion code.
Hence it may be slightly more fair to compare the performance for a PDF document which has embedded standard fonts, but instead forcibly load "our" standard font data!?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Mar 2, 2021

Another thing that I think this patch would also improve is rendering of standard fonts on Linux, since we've seen a few bugs/issues reported where non-embedded standard fonts render badly on Linux because of poor font substitution.

Some examples include https://bugzilla.mozilla.org/show_bug.cgi?id=1695727 and issue #11840

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Mar 19, 2021

I'd be curious to try directly embedding the files in the worker, but 266,616 bytes seems like a big increase for something that is not really needed in the common case.

Well, with recent development I don't really think that argument holds any more :-)

The XFA functionality, which just landed, increased the size of the built pdf.worker.js file alone by 214 903 bytes in the MOZCENTRAL build. In my experience, XFA is quite uncommon whereas every other PDF file would benefit from us bundling standard fonts.
Hence I think that if we're willing to accept such a large size increase for a niche feature such as XFA, it really ought to be acceptable for non-embedded standard fonts given just how common they are.

@rtestard
Copy link

The XFA functionality, which just landed, increased the size of the built pdf.worker.js file alone by 214 903 bytes in the MOZCENTRAL build. In my experience, XFA is quite uncommon whereas every other PDF file would benefit from us bundling standard fonts.
Hence I think that if we're willing to accept such a large size increase for a niche feature such as XFA, it really ought to be acceptable for non-embedded standard fonts given just how common they are.

XFA is a lot less frequent than Acroform but we see about a million unique monthly active users encountering them in our desktop telemetry. I know at least of the Canadian Gov website using them. This is a significant number of users who will benefit from this change.

@Snuffleupagus
Copy link
Collaborator

XFA is a lot less frequent than Acroform but we see about a million unique monthly active users encountering them in our desktop telemetry.

In all fairness, then we should also measure how often users encounter non-embedded standard fonts in PDF documents; I'd be extremely surprised if that number isn't at least one order of magnitude larger :-)

@marco-c
Copy link
Contributor

marco-c commented Mar 25, 2021

I know at least of the Canadian Gov website using them

There are some XFA forms on the UK and USA gov websites too.

@pdfjsbot
Copy link

pdfjsbot commented Jun 7, 2021

From: Bot.io (Linux m4)


Received

Command cmd_test from @brendandahl received. Current queue size: 0

Live output at: http://54.67.70.0:8877/7e9df84dd3886ac/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 7, 2021

From: Bot.io (Windows)


Received

Command cmd_test from @brendandahl received. Current queue size: 0

Live output at: http://3.101.106.178:8877/bc88cc4cd488960/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 7, 2021

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/7e9df84dd3886ac/output.txt

Total script time: 26.19 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/7e9df84dd3886ac/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Jun 7, 2021

From: Bot.io (Windows)


Failed

Full output at http://3.101.106.178:8877/bc88cc4cd488960/output.txt

Total script time: 29.65 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/bc88cc4cd488960/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus merged commit e7dc822 into mozilla:master Jun 8, 2021
@Snuffleupagus
Copy link
Collaborator

/botio makeref

@pdfjsbot
Copy link

pdfjsbot commented Jun 8, 2021

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/cdfe0a1ad516c1b/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 8, 2021

From: Bot.io (Windows)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/029042f654db4da/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 8, 2021

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/cdfe0a1ad516c1b/output.txt

Total script time: 22.76 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

pdfjsbot commented Jun 8, 2021

From: Bot.io (Windows)


Success

Full output at http://3.101.106.178:8877/029042f654db4da/output.txt

Total script time: 26.50 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

calixteman added a commit to calixteman/pdf.js that referenced this pull request Jun 8, 2021
  - a lot of xfa files are using Myriad pro or Arial fonts without embedding them and some containers have some dimensions based on those font metrics. So not having the exact same font leads to a wrong display.
  - since it's pretty hard to find a replacement font with the exact same metrics, this patch gives the possibility to read glyf table, rescale each glyph and then write a new table.
  - so once PR mozilla#12726 is merged we could rescale for example Helvetica to replace Myriad Pro.
@kmcclellan
Copy link

Could we get a new build of pdfjs-dist to npm please?

@Snuffleupagus
Copy link
Collaborator

Could we get a new build of pdfjs-dist to npm please?

Not immediately, since this just landed and we (obviously) need some time to handle any possible regressions and there's also various follow-up work that should happen first. (Also, the last release was less than two weeks ago.)

@Snipeye
Copy link

Snipeye commented Jun 9, 2021

I don't github (verb) very often, so I don't know if this is correct/appropriate (or if I'm about to ping a million people, in which case I am deeply sorry) but I just read through this and wanted to add my "thank you 1e6" for the effort to get this in; at least in my case it dramatically increases the usability.

calixteman added a commit to calixteman/pdf.js that referenced this pull request Jun 9, 2021
  - a lot of xfa files are using Myriad pro or Arial fonts without embedding them and some containers have some dimensions based on those font metrics. So not having the exact same font leads to a wrong display.
  - since it's pretty hard to find a replacement font with the exact same metrics, this patch gives the possibility to read glyf table, rescale each glyph and then write a new table.
  - so once PR mozilla#12726 is merged we could rescale for example Helvetica to replace Myriad Pro.
@ceztko
Copy link

ceztko commented Nov 19, 2021

Sorry for the late comment. I just wanted to say that the fonts taken from PDFium are possibly misnamed, or their format is not immediately clear. pfb extension seems to be mostly taken for Printer Font Binary, which is basically the encrypted part of a pfa font. pfb can't be used alone but must be coupled with a pfm file that just contains the metrics. The fonts taken from PDFium are neither canonical pfb or pfa, but they seem to be a raw OpenType compatible CFF representation, and I can confirm they are opened by the CFF driver in Freetype. I don't know if there's a better extension for these. At least their actual format could be better documented.

bh213 pushed a commit to bh213/pdf.js that referenced this pull request Jun 3, 2022
  - a lot of xfa files are using Myriad pro or Arial fonts without embedding them and some containers have some dimensions based on those font metrics. So not having the exact same font leads to a wrong display.
  - since it's pretty hard to find a replacement font with the exact same metrics, this patch gives the possibility to read glyf table, rescale each glyph and then write a new table.
  - so once PR mozilla#12726 is merged we could rescale for example Helvetica to replace Myriad Pro.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when loading PDF that uses system fonts
9 participants