New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fonts with 'space' character id = 0 display incorrectly #44
Comments
Do you have a font with this particular character mapping? Then I can attempt to reproduce the problem. I don't think I fully understand why your proposed would work, or whether it is the best way to tackle the problem. But I want to be able to reproduce it at least. Kind regards, |
This is the font I was working with: I don't know if this is the best way to handle the issue, but this is what happens. sample text: <29><00><30><7c><79><8d><53> (AT Without) Thanks |
Hi @oldmanofthemountain, I am trying to reproducte the bug. Given the following folder structure : What is missing from my code to reproduce the error ? from pathlib import Path
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
from borb.pdf.canvas.font.font import Font
# create an empty Document
pdf = Document()
# add an empty Page
page = Page()
pdf.append_page(page)
# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)
# construct the Font object
font_path: Path = Path(__file__).parent / "apple_garamond/AppleGaramond.ttf"
font: Font = TrueTypeFont.true_type_font_from_file(font_path)
# add a Paragraph object
layout.add(Paragraph("AT Without", font=font))
# store the PDF
with open(Path("output.pdf"), "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, pdf) |
The issue only shows up if the text is longer than the column width. Short text looks fine. Try the following:
The comparison with TimesRoman shows the overrun clearly. At a font size of 10 it is even more obvious. |
I can reproduce it, however this behaviour isn't it the way it should be ? I mean at least it save some char in the final document. |
Sorry for my lack of activity on this one. I'd like to already thank you for everything you've done. It's great to see a little community come together around borb. You guys are awesome. Kind regards, |
I found (and fixed) the bug. details:
e.g. when you encounter "030F" you should first lookup the entire hex number to see if any character-id matches, and if nothing matches, you should look up "03". That's where the bug got in. space being defined as "00" means it will match (once converted to ints) as the prefix of any other byte sequence. This messes up a lot of things, including calculating the width of a piece of text. The solution: I noticed that the translation between text and bytes actually happens twice. In other words, even though borb knew the boundaries of each character, it still attempted to split the byte-array into chunks using the aforementioned logic. I added 2 new static methods:
I changed all the callers. Results: I am now running tests. I first created a small test that simply adds the text "AB" to a document and draws a box around the text, using the calculated width. As expected, before the fix, this box is 1 character short (missing space). After the fix, the box is fitted nicely around the text. |
Update: All tests are back to green. There was a tiny amount of refactoring I'd forgotten. But it seems to be fixed now. Kind regards, |
Update: Everything works as expected again. Kind regards, |
Glad you found a less 'ad hoc' solution. Thanks for all of your work. I have found other font issues though. For instance, one font has cmap entries of the form u.015 that all end up resolving to 'u' causing the width of 'u' to be too narrow. I have another 'ad hoc' solution which is ugly. And yet another (type 0) has no space glyph defined, only non-breaking space. So any text with a space fails. I've found no workaround for that. Fonts give me a headache. Best to you. |
If you have other issues, feel free to create a ticket for them. Keep in mind that I will not build in fixes for broken fonts. Kind regards, |
I have tried the new release, and it works well. Thanks. I do have a couple of additional comments/questions though. First, the way GlyphLine was called in the original did not actually pass a byte string. It passed an array of integers. Therefore, adjacent list items would not represent components of a multi-byte character id. Calling GlyphLine.from_bytes would still act the same way if called the same way. Second, the font that I was referring to does not have a glyph for 'space', but it does have a cmap entry for 'space' that links to uni00A0. Should the Type0Font incorporate the font's cmap when doing encoding? By the way, the I'm referring to is LiberationSerif-Regular released by Red Hat and part of the standard fonts on my Linux distribution. I listened to, and enjoyed, your interview on 'Real Python'. Thanks, and regards |
Sorry, the first part of the previous code was wrong. The format should have been 'text = b"WithΆ', which would have failed because b'char' only accepts ASCII. Anyway, I don't see how 'from_bytes' would see two adjacent list elements (bytes) as a unicode character unless 'text' is hand crafted to break up two-byte characters into separate components. What am I missing? Regards |
I didnt know b"something" is ASCII only. Anyway, I'd prefer not to work on fonts for a while. They are such a pain to deal with. I want to focus on new features in the new release. Kind regards, |
I concur. Time for a break from fonts. Back to why I first looked at borb. I want to use it to print books from Project Gutenberg text. I have added the ability to break paragraphs across pages and print 'folded signatures' (i.e. four pages per sheet of paper). Needs further work though. Case closed for now on fonts. Regards |
I have been modifying borb to have paragraphs that will split across pages. That works. But while experimenting, I tried some different fonts that I downloaded. When viewing the results, the lines were too long. After some digging, I discovered that the fonts assigned the 'space' character to character id 0. When the GlyphLine class encountered this as a two byte character (0x00??) and retrieved the following character but discarded the space (glyph_line.py:line 66).
I patched this by changing
if i + 1 < len(text_bytes):
to
if i + 1 < len(text_bytes) and text_bytes[i]:
I don't know if this is generally applicable (i.e. everything I know about fonts I learned while trying to figure this out),
but it works for me. It also seems reasonable since a two-byte value starting with 0x00 returns the same value as a one-byte value.
The text was updated successfully, but these errors were encountered: