Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support font subsetting to reduce size of pdf #103

Open
Yang-Xijie opened this issue May 27, 2022 · 14 comments
Open

Support font subsetting to reduce size of pdf #103

Yang-Xijie opened this issue May 27, 2022 · 14 comments

Comments

@Yang-Xijie
Copy link

Yang-Xijie commented May 27, 2022

Describe the bug

I want to add Chinese and Japanese in PDF. I did present Chinese and Japanese characters (は哈) successfully, but the size of output.pdf is too large (14MB).

I read the example doc and found the chapter 8.6.2 Composite fonts. I just want to render each character seperately, namely extract the font of a single character and then package these characters in PDF file. How to achieve this using borb? I wonder if there is an exact configuration in borb?

To Reproduce

Steps to reproduce the behaviour:

Download Microsoft Yahei.ttf at https://github.com/dolbydu/font/blob/master/unicode/Microsoft%20Yahei.ttf

from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
import time

from pathlib import Path

def print_current_time():
    print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))

if __name__ == "__main__":

    print_current_time()

    font_path = Path(__file__).parent / "font" / "Microsoft Yahei.ttf"
    custom_font = TrueTypeFont.true_type_font_from_file(font_path)

    print_current_time()

    doc = Document()
    page = Page()
    doc.append_page(page)
    layout = SingleColumnLayout(page)
    layout.add(Paragraph("はははは哈哈", font=custom_font))

    print_current_time()

    timestamp = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime())
    pdf_name = timestamp + ".pdf"
    pdf_path = Path(__file__).parent / "pdf" / pdf_name
    with open(pdf_path, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

    print_current_time()
2022-05-27 21:19:11
2022-05-27 21:19:26
2022-05-27 21:19:27
2022-05-27 21:20:02
[ 288]  .
├── [  97]  README.md
├── [ 128]  font
│   ├── [ 21M]  Microsoft Yahei.ttf
│   └── [ 74M]  PingFang.ttc
├── [1.3K]  main.py
└── [  96]  pdf
    └── [ 14M]  2022_05_27_20_49_11.pdf

Expected behaviour

The size of PDF file should be less than 1MB.

Desktop (please complete the following information):

  • OS: macOS 12.3
  • borb version 2.0.26
  • Python 3.9.5
@jorisschellekens
Copy link
Owner

In order to reduce the size of the pdf, borb would need to perform font subsetting.

This is when a pdf contains a special "made up" font that contains only those characters that are actually used in the document.

So for instance, if you created a pdf containing only the text "Hello World" you would find a font inside the pdf that only contains the characters H, e, l, o, W, r and d.

Font subsetting is currently not supported in borb.

Kind regards,
Joris Schellekens

@Yang-Xijie
Copy link
Author

Thanks for your reply!

Font subsetting is such an important feature for languages with large character sets. Hope that borb will support it soon.

@Yang-Xijie Yang-Xijie changed the title How to use composite fonts Support font subsetting May 27, 2022
@Yang-Xijie Yang-Xijie changed the title Support font subsetting Support font subsetting to reduce size of pdf May 27, 2022
@orklann
Copy link

orklann commented Jul 9, 2022

@jorisschellekens As you use fonttools, subsetting TrueType fonts by using fonttools is simple, just see this example.

https://github.com/orklann/caprice/blob/main/caprice/font/truetype/font.py#L89

For none Latin TrueType fonts, subsetting is a important feature, since fonts in this category are always large in size.

@jorisschellekens
Copy link
Owner

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not.
It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf
output_with_subsetting.pdf

@jorisschellekens
Copy link
Owner

✔️ According to the PDF validator I use (vera pdf), my output is a valid PDF.
✔️ The code has been documented,
✔️ a test has been added to verify both the subset and not-subset document.

Next I want to try it with your particular font and code, and see whether the results still hold.
If that turns out to be the case, this feature will be included in the next release.

Kind regards,
Joris Schellekens

@jorisschellekens
Copy link
Owner

Turns out I already had a test using Simhei.ttf.
Same results.

  • The font-file is roughly 10Mb big.
  • Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
  • With font-subsetting the PDF is 3.2 Kb

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself.
output_001.pdf

That means this feature will be included in the next release 📣

Kind regards,
Joris Schellekens

@Yang-Xijie
Copy link
Author

Yang-Xijie commented Jul 10, 2022

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf output_with_subsetting.pdf

These two PDFs looks different using Preview (the default PDF viewer) on macOS 12.4.

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

It might not be the expected behaviour.

@Yang-Xijie
Copy link
Author

Turns out I already had a test using Simhei.ttf. Same results.

  • The font-file is roughly 10Mb big.
  • Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
  • With font-subsetting the PDF is 3.2 Kb

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf

That means this feature will be included in the next release 📣

Kind regards, Joris Schellekens

The attached PDF is blank opening by Preview (the default PDF viewer) on macOS 12.4.

image

However, you said that you added "你好世界" in this PDF. It might not be the expected behavior.

@jorisschellekens
Copy link
Owner

That is definitely not the expected behaviour.

It's using a substitute font (so it's claiming that it can't find the font file inside the PDF)

Can you open it in Adobe?

@Yang-Xijie
Copy link
Author

Chrome 103.0.5060.114 (Official Build) (x86_64) on macOS 12.4

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

image

@Yang-Xijie
Copy link
Author

It seems that certain standards of PDF are not satisfied.

@Yang-Xijie
Copy link
Author

Yang-Xijie commented Jul 10, 2022

Adobe Acrobat Reader DC Version 2022.001.20142 on macOS 12.4

Architecture: x86_64
Processor: Intel
Build: 22.1.20142.0
AGM: 4.30.117
CoolType: 6.2.1
JP2K: 2.0.6.50420

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

blank

@Yang-Xijie
Copy link
Author

Yang-Xijie commented Jul 19, 2022

It is wierd that I received your comments from email but I cannot find that comment at GitHub.

image

macOS 12.4 Preview.app & Chrome.app & Safari.app

image

@jorisschellekens
Copy link
Owner

After having discussed this issue with another PDF expert, it seems like the actual subsetting of the font (rather than the dictionaries in the PDF) is going awry.

Sadly, that makes this problem a bit trickier. Currently I use fonttools to do the subsetting. And I'd prefer to keep most of that functionality delegated to an external library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants