Fonts with 'space' character id = 0 display incorrectly #44

oldmanofthemountain · 2021-10-31T18:57:43Z

I have been modifying borb to have paragraphs that will split across pages. That works. But while experimenting, I tried some different fonts that I downloaded. When viewing the results, the lines were too long. After some digging, I discovered that the fonts assigned the 'space' character to character id 0. When the GlyphLine class encountered this as a two byte character (0x00??) and retrieved the following character but discarded the space (glyph_line.py:line 66).

I patched this by changing
if i + 1 < len(text_bytes):
to
if i + 1 < len(text_bytes) and text_bytes[i]:

I don't know if this is generally applicable (i.e. everything I know about fonts I learned while trying to figure this out),
but it works for me. It also seems reasonable since a two-byte value starting with 0x00 returns the same value as a one-byte value.

jorisschellekens · 2021-10-31T22:08:17Z

Do you have a font with this particular character mapping? Then I can attempt to reproduce the problem.

I don't think I fully understand why your proposed would work, or whether it is the best way to tackle the problem.

But I want to be able to reproduce it at least.

Kind regards,
Joris Schellekens

oldmanofthemountain · 2021-11-01T17:44:26Z

This is the font I was working with:
apple_garamond.zip

I don't know if this is the best way to handle the issue, but this is what happens.

sample text: <29><00><30><7c><79><8d><53> (AT Without)
The current code interprets 'AT' then looks up the next two bytes (<00><30>), and since this equates to <30>, it returns 'W' causing the space (<00>) to be discarded. The modification only does the lookup if the first byte is not zero. The <00> can then be looked up as a single byte, and returns a 'space'.

Thanks

edd34 · 2021-11-06T14:29:29Z

Hi @oldmanofthemountain, I am trying to reproducte the bug. Given the following folder structure :

What is missing from my code to reproduce the error ?
content of main script (the file named "helloworld.py")

from pathlib import Path

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont  
from borb.pdf.canvas.font.font import Font

# create an empty Document
pdf = Document()
# add an empty Page
page = Page()
pdf.append_page(page)

# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)


# construct the Font object
font_path: Path = Path(__file__).parent / "apple_garamond/AppleGaramond.ttf"
font: Font = TrueTypeFont.true_type_font_from_file(font_path)

# add a Paragraph object
layout.add(Paragraph("AT Without", font=font))

# store the PDF
with open(Path("output.pdf"), "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

oldmanofthemountain · 2021-11-06T21:43:57Z

The issue only shows up if the text is longer than the column width. Short text looks fine. Try the following:

    text = """
    Without, the night was cold and wet, but in the small parlour of Laburnam
    Villa the blinds were drawn and the fire burned brightly.  Father and son
    """
    layout.add(Paragraph(text, font=font)
    layout.add(Paragraph(text, font='TimesRoman'))

The comparison with TimesRoman shows the overrun clearly. At a font size of 10 it is even more obvious.

edd34 · 2021-11-07T06:11:43Z

The issue only shows up if the text is longer than the column width. Short text looks fine. Try the following:
    text = """
    Without, the night was cold and wet, but in the small parlour of Laburnam
    Villa the blinds were drawn and the fire burned brightly.  Father and son
    """
    layout.add(Paragraph(text, font=font)
    layout.add(Paragraph(text, font='TimesRoman'))
The comparison with TimesRoman shows the overrun clearly. At a font size of 10 it is even more obvious.

I can reproduce it, however this behaviour isn't it the way it should be ? I mean at least it save some char in the final document.
In the PDF spec at chapter 9.4 you may find the information, but I can't understand it well.
Maybe @jorisschellekens has a clue about it ?

jorisschellekens · 2021-11-07T13:43:30Z

Sorry for my lack of activity on this one.
It's been a busy week, loads of little fixes to be included to borb.

I'd like to already thank you for everything you've done. It's great to see a little community come together around borb.

You guys are awesome.

Kind regards,
Joris Schellekens

jorisschellekens · 2021-11-14T12:19:48Z

I found (and fixed) the bug.

details:

a Font can specify information about its glyphs using a cmap
sometimes different renditions of characters map to the same unicode, in order to avoid a mess, a font identifies characters by a character id
character id's are typically represented as hexadecimal numbers, of either length 2 or 4, in the cmap
the PDF spec says (when decoding this information) you ought to give precedence to longer byte-runs, rather than shorter

e.g.

when you encounter "030F" you should first lookup the entire hex number to see if any character-id matches, and if nothing matches, you should look up "03".

That's where the bug got in.

space being defined as "00" means it will match (once converted to ints) as the prefix of any other byte sequence.
So it will never be the longest byte-sequence.

This messes up a lot of things, including calculating the width of a piece of text.

The solution:

I noticed that the translation between text and bytes actually happens twice.
borb used the font to look up how to encode the characters to bytes (character ids), and then passed those to a GlyphLine object, which does the lookup.

In other words, even though borb knew the boundaries of each character, it still attempted to split the byte-array into chunks using the aforementioned logic.

I added 2 new static methods:

GlyphLine.from_str: which creates a GlyphLine from text
GlyphLine.from_bytes: which creates a GlyphLine from raw bytes

I changed all the callers.

Results:

I am now running tests.

I first created a small test that simply adds the text "AB" to a document and draws a box around the text, using the calculated width. As expected, before the fix, this box is 1 character short (missing space).

After the fix, the box is fitted nicely around the text.

jorisschellekens · 2021-11-14T16:23:43Z

Update:

All tests are back to green. There was a tiny amount of refactoring I'd forgotten. But it seems to be fixed now.
The performance of the text-extraction test has dropped though. I want to see how this fix has impacted that. Perhaps I need to make a few more modifications.

Kind regards,
Joris Schellekens

jorisschellekens · 2021-11-14T22:21:31Z

Update:

Everything works as expected again.
This will be available in the next release.

Kind regards,
Joris Schellekens

oldmanofthemountain · 2021-11-15T17:19:55Z

Glad you found a less 'ad hoc' solution. Thanks for all of your work.

I have found other font issues though. For instance, one font has cmap entries of the form u.015 that all end up resolving to 'u' causing the width of 'u' to be too narrow. I have another 'ad hoc' solution which is ugly. And yet another (type 0) has no space glyph defined, only non-breaking space. So any text with a space fails. I've found no workaround for that.

Fonts give me a headache.

Best to you.

jorisschellekens · 2021-11-15T17:45:44Z

If you have other issues, feel free to create a ticket for them.
The guidelines for what constitutes a good ticket can be found here.

Keep in mind that I will not build in fixes for broken fonts.
If the font does not have a definition for the 'space' character, then (from a PDF spec viewpoint) the user is not allowed to use that character (even implicitly).

Kind regards,
Joris Schellekens

oldmanofthemountain · 2021-11-19T17:06:48Z

I have tried the new release, and it works well. Thanks.

I do have a couple of additional comments/questions though.

First, the way GlyphLine was called in the original did not actually pass a byte string. It passed an array of integers. Therefore, adjacent list items would not represent components of a multi-byte character id.
Example:
(font.py:main):Line 71--> Code:
text = ['W', 'i','t','h', 'Ά']
w = GlyphLine.from_bytes(text, font1, 12)
(glyph_line.py:from_bytes):Line 76--> Received ['W', 'i', 't', 'h', 'Ά'] type of text <class 'list'>
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=W text[i+1]=i
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=W
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=i text[i+1]=t
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=i
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=t text[i+1]=h
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=t
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=h text[i+1]=Ά
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=h
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=Ά
(font.py:main):Line 73--> Code:
encoded_bytes: bytes = [font1.unicode_to_character_identifier(i) for i in text]
w = GlyphLine.from_bytes(encoded_bytes, font1, 12)
(glyph_line.py:from_bytes):Line 76--> Received [58, 76, 87, 75, 342] type of text <class 'list'>
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=58 text[i+1]=76
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=58
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=76 text[i+1]=87
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=76
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=87 text[i+1]=75
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=87
(glyph_line.py:from_bytes):Line 81--> (i+1 < len(text)) text[i]=75 text[i+1]=342
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=75
(glyph_line.py:from_bytes):Line 98--> (i < len(text)) text[i]=342

Calling GlyphLine.from_bytes would still act the same way if called the same way.

Second, the font that I was referring to does not have a glyph for 'space', but it does have a cmap entry for 'space' that links to uni00A0. Should the Type0Font incorporate the font's cmap when doing encoding? By the way, the I'm referring to is LiberationSerif-Regular released by Red Hat and part of the standard fonts on my Linux distribution.

I listened to, and enjoyed, your interview on 'Real Python'.

Thanks, and regards

oldmanofthemountain · 2021-11-19T17:52:52Z

Sorry, the first part of the previous code was wrong. The format should have been 'text = b"WithΆ', which would have failed because b'char' only accepts ASCII. Anyway, I don't see how 'from_bytes' would see two adjacent list elements (bytes) as a unicode character unless 'text' is hand crafted to break up two-byte characters into separate components. What am I missing?

Regards

jorisschellekens · 2021-11-19T17:58:40Z

I didnt know b"something" is ASCII only.

Anyway, I'd prefer not to work on fonts for a while. They are such a pain to deal with.

I want to focus on new features in the new release.

Kind regards,
Joris Schellekens

oldmanofthemountain · 2021-11-20T13:38:38Z

I concur. Time for a break from fonts. Back to why I first looked at borb. I want to use it to print books from Project Gutenberg text. I have added the ability to break paragraphs across pages and print 'folded signatures' (i.e. four pages per sheet of paper). Needs further work though.

Case closed for now on fonts.

Regards

jorisschellekens closed this as completed Nov 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fonts with 'space' character id = 0 display incorrectly #44

Fonts with 'space' character id = 0 display incorrectly #44

oldmanofthemountain commented Oct 31, 2021

jorisschellekens commented Oct 31, 2021

oldmanofthemountain commented Nov 1, 2021

edd34 commented Nov 6, 2021

oldmanofthemountain commented Nov 6, 2021

edd34 commented Nov 7, 2021 •

edited

jorisschellekens commented Nov 7, 2021

jorisschellekens commented Nov 14, 2021

jorisschellekens commented Nov 14, 2021

jorisschellekens commented Nov 14, 2021 •

edited

oldmanofthemountain commented Nov 15, 2021

jorisschellekens commented Nov 15, 2021

oldmanofthemountain commented Nov 19, 2021

oldmanofthemountain commented Nov 19, 2021

jorisschellekens commented Nov 19, 2021

oldmanofthemountain commented Nov 20, 2021

Fonts with 'space' character id = 0 display incorrectly #44

Fonts with 'space' character id = 0 display incorrectly #44

Comments

oldmanofthemountain commented Oct 31, 2021

jorisschellekens commented Oct 31, 2021

oldmanofthemountain commented Nov 1, 2021

edd34 commented Nov 6, 2021

oldmanofthemountain commented Nov 6, 2021

edd34 commented Nov 7, 2021 • edited

jorisschellekens commented Nov 7, 2021

jorisschellekens commented Nov 14, 2021

jorisschellekens commented Nov 14, 2021

jorisschellekens commented Nov 14, 2021 • edited

oldmanofthemountain commented Nov 15, 2021

jorisschellekens commented Nov 15, 2021

oldmanofthemountain commented Nov 19, 2021

oldmanofthemountain commented Nov 19, 2021

jorisschellekens commented Nov 19, 2021

oldmanofthemountain commented Nov 20, 2021

edd34 commented Nov 7, 2021 •

edited

jorisschellekens commented Nov 14, 2021 •

edited