Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fonts with 'space' character id = 0 display incorrectly #44

Closed
oldmanofthemountain opened this issue Oct 31, 2021 · 15 comments
Closed

Fonts with 'space' character id = 0 display incorrectly #44

oldmanofthemountain opened this issue Oct 31, 2021 · 15 comments

Comments

@oldmanofthemountain
Copy link

I have been modifying borb to have paragraphs that will split across pages. That works. But while experimenting, I tried some different fonts that I downloaded. When viewing the results, the lines were too long. After some digging, I discovered that the fonts assigned the 'space' character to character id 0. When the GlyphLine class encountered this as a two byte character (0x00??) and retrieved the following character but discarded the space (glyph_line.py:line 66).

I patched this by changing
if i + 1 < len(text_bytes):
to
if i + 1 < len(text_bytes) and text_bytes[i]:

I don't know if this is generally applicable (i.e. everything I know about fonts I learned while trying to figure this out),
but it works for me. It also seems reasonable since a two-byte value starting with 0x00 returns the same value as a one-byte value.

@jorisschellekens
Copy link
Owner

Do you have a font with this particular character mapping? Then I can attempt to reproduce the problem.

I don't think I fully understand why your proposed would work, or whether it is the best way to tackle the problem.

But I want to be able to reproduce it at least.

Kind regards,
Joris Schellekens

@oldmanofthemountain
Copy link
Author

This is the font I was working with:
apple_garamond.zip

I don't know if this is the best way to handle the issue, but this is what happens.

sample text: <29><00><30><7c><79><8d><53> (AT Without)
The current code interprets 'AT' then looks up the next two bytes (<00><30>), and since this equates to <30>, it returns 'W' causing the space (<00>) to be discarded. The modification only does the lookup if the first byte is not zero. The <00> can then be looked up as a single byte, and returns a 'space'.

Thanks

@edd34
Copy link

edd34 commented Nov 6, 2021

Hi @oldmanofthemountain, I am trying to reproducte the bug. Given the following folder structure :
image

What is missing from my code to reproduce the error ?
content of main script (the file named "helloworld.py")

from pathlib import Path

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont  
from borb.pdf.canvas.font.font import Font

# create an empty Document
pdf = Document()
# add an empty Page
page = Page()
pdf.append_page(page)

# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)


# construct the Font object
font_path: Path = Path(__file__).parent / "apple_garamond/AppleGaramond.ttf"
font: Font = TrueTypeFont.true_type_font_from_file(font_path)

# add a Paragraph object
layout.add(Paragraph("AT Without", font=font))

# store the PDF
with open(Path("output.pdf"), "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

@oldmanofthemountain
Copy link
Author

The issue only shows up if the text is longer than the column width. Short text looks fine. Try the following:

    text = """
    Without, the night was cold and wet, but in the small parlour of Laburnam
    Villa the blinds were drawn and the fire burned brightly.  Father and son
    """
    layout.add(Paragraph(text, font=font)
    layout.add(Paragraph(text, font='TimesRoman'))

The comparison with TimesRoman shows the overrun clearly. At a font size of 10 it is even more obvious.

@edd34
Copy link

edd34 commented Nov 7, 2021

The issue only shows up if the text is longer than the column width. Short text looks fine. Try the following:

    text = """
    Without, the night was cold and wet, but in the small parlour of Laburnam
    Villa the blinds were drawn and the fire burned brightly.  Father and son
    """
    layout.add(Paragraph(text, font=font)
    layout.add(Paragraph(text, font='TimesRoman'))

The comparison with TimesRoman shows the overrun clearly. At a font size of 10 it is even more obvious.

I can reproduce it, however this behaviour isn't it the way it should be ? I mean at least it save some char in the final document.
In the PDF spec at chapter 9.4 you may find the information, but I can't understand it well.
Maybe @jorisschellekens has a clue about it ?

@jorisschellekens
Copy link
Owner

Sorry for my lack of activity on this one.
It's been a busy week, loads of little fixes to be included to borb.

I'd like to already thank you for everything you've done. It's great to see a little community come together around borb.

You guys are awesome.

Kind regards,
Joris Schellekens

@jorisschellekens
Copy link
Owner

I found (and fixed) the bug.

details:

  • a Font can specify information about its glyphs using a cmap
  • sometimes different renditions of characters map to the same unicode, in order to avoid a mess, a font identifies characters by a character id
  • character id's are typically represented as hexadecimal numbers, of either length 2 or 4, in the cmap
  • the PDF spec says (when decoding this information) you ought to give precedence to longer byte-runs, rather than shorter

e.g.

when you encounter "030F" you should first lookup the entire hex number to see if any character-id matches, and if nothing matches, you should look up "03".

That's where the bug got in.

space being defined as "00" means it will match (once converted to ints) as the prefix of any other byte sequence.
So it will never be the longest byte-sequence.

This messes up a lot of things, including calculating the width of a piece of text.

The solution:

I noticed that the translation between text and bytes actually happens twice.
borb used the font to look up how to encode the characters to bytes (character ids), and then passed those to a GlyphLine object, which does the lookup.

In other words, even though borb knew the boundaries of each character, it still attempted to split the byte-array into chunks using the aforementioned logic.

I added 2 new static methods:

  • GlyphLine.from_str: which creates a GlyphLine from text
  • GlyphLine.from_bytes: which creates a GlyphLine from raw bytes

I changed all the callers.

Results:

I am now running tests.

I first created a small test that simply adds the text "AB" to a document and draws a box around the text, using the calculated width. As expected, before the fix, this box is 1 character short (missing space).

After the fix, the box is fitted nicely around the text.

@jorisschellekens
Copy link
Owner

Update:

All tests are back to green. There was a tiny amount of refactoring I'd forgotten. But it seems to be fixed now.
The performance of the text-extraction test has dropped though. I want to see how this fix has impacted that. Perhaps I need to make a few more modifications.

Kind regards,
Joris Schellekens

@jorisschellekens
Copy link
Owner

jorisschellekens commented Nov 14, 2021

Update:

Everything works as expected again.
This will be available in the next release.

Kind regards,
Joris Schellekens

@oldmanofthemountain
Copy link
Author

Glad you found a less 'ad hoc' solution. Thanks for all of your work.

I have found other font issues though. For instance, one font has cmap entries of the form u.015 that all end up resolving to 'u' causing the width of 'u' to be too narrow. I have another 'ad hoc' solution which is ugly. And yet another (type 0) has no space glyph defined, only non-breaking space. So any text with a space fails. I've found no workaround for that.

Fonts give me a headache.

Best to you.

@jorisschellekens
Copy link
Owner

If you have other issues, feel free to create a ticket for them.
The guidelines for what constitutes a good ticket can be found here.

Keep in mind that I will not build in fixes for broken fonts.
If the font does not have a definition for the 'space' character, then (from a PDF spec viewpoint) the user is not allowed to use that character (even implicitly).

Kind regards,
Joris Schellekens

@oldmanofthemountain
Copy link
Author

I have tried the new release, and it works well. Thanks.

I do have a couple of additional comments/questions though.

First, the way GlyphLine was called in the original did not actually pass a byte string. It passed an array of integers. Therefore, adjacent list items would not represent components of a multi-byte character id.
Example:
(font.py:main):Line 71--> Code:
text = ['W', 'i','t','h', 'Ά']
w = GlyphLine.from_bytes(text, font1, 12)
(glyph_line.py:from_bytes):Line 76--> Received ['W', 'i', 't', 'h', 'Ά'] type of text <class 'list'>
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=W text[i+1]=i
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=W
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=i text[i+1]=t
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=i
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=t text[i+1]=h
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=t
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=h text[i+1]=Ά
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=h
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=Ά
(font.py:main):Line 73--> Code:
encoded_bytes: bytes = [font1.unicode_to_character_identifier(i) for i in text]
w = GlyphLine.from_bytes(encoded_bytes, font1, 12)
(glyph_line.py:from_bytes):Line 76--> Received [58, 76, 87, 75, 342] type of text <class 'list'>
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=58 text[i+1]=76
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=58
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=76 text[i+1]=87
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=76
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=87 text[i+1]=75
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=87
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=75 text[i+1]=342
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=75
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=342

Calling GlyphLine.from_bytes would still act the same way if called the same way.

Second, the font that I was referring to does not have a glyph for 'space', but it does have a cmap entry for 'space' that links to uni00A0. Should the Type0Font incorporate the font's cmap when doing encoding? By the way, the I'm referring to is LiberationSerif-Regular released by Red Hat and part of the standard fonts on my Linux distribution.

I listened to, and enjoyed, your interview on 'Real Python'.

Thanks, and regards

@oldmanofthemountain
Copy link
Author

Sorry, the first part of the previous code was wrong. The format should have been 'text = b"WithΆ', which would have failed because b'char' only accepts ASCII. Anyway, I don't see how 'from_bytes' would see two adjacent list elements (bytes) as a unicode character unless 'text' is hand crafted to break up two-byte characters into separate components. What am I missing?

Regards

@jorisschellekens
Copy link
Owner

I didnt know b"something" is ASCII only.

Anyway, I'd prefer not to work on fonts for a while. They are such a pain to deal with.

I want to focus on new features in the new release.

Kind regards,
Joris Schellekens

@oldmanofthemountain
Copy link
Author

I concur. Time for a break from fonts. Back to why I first looked at borb. I want to use it to print books from Project Gutenberg text. I have added the ability to break paragraphs across pages and print 'folded signatures' (i.e. four pages per sheet of paper). Needs further work though.

Case closed for now on fonts.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants