Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word wrapping for Thai text (zero-width space) may not work as intended // wrapmode==WORD should also use zero-width space? #1190

Closed
carlhiggs opened this issue Jun 3, 2024 · 2 comments · Fixed by #1191

Comments

@carlhiggs
Copy link

When attempting to produce PDF reports in Thai, in which spaces are used as punctuation and not as word seperators (which are not visible; they can be represented by zero-width spaces), line breaks do not appear to be functioning as intended. Instead of occuring for Thai words, they often occur when there is a space from an incidental English word or phrases that has spaces.

So, while Thai text should wrap using the default wrapmode of Word, its not happening as intended and results in incorrectly broken text:
image

Error details
I think this occurs because the line_break.py code specifically looks for spaces (' '),

SPACE = " "

when defining space_break_hints

fpdf2/fpdf/line_break.py

Lines 452 to 460 in f0bd468

if character == SPACE:
self.space_break_hint = SpaceHint(
original_fragment_index,
original_character_index,
len(self.fragments),
len(active_fragment.characters),
self.width,
self.number_of_spaces,
)

and consequently, when lines are checked for appropriate terminating characters first against a space --- this checks for ' ' but no wrapping is made for zero-width spaces:

if character == SPACE: # must come first, always drop a current space.

Perhaps a simple solution would be to test for inclusion in a list to see if character is in the list of space or zero-width space:

if character in [SPACE, ZWS]:

Then I think either scenario should result in a wrap?

Minimal code

from fpdf import FPDF
pdf = FPDF()
# using font from FPDF font pack https://github.com/reingart/pyfpdf/releases/download/binary/fpdf_unicode_font_pack.zip
font_path = 'configuration/fonts/fpdf_unicode_font_pack/Waree.ttf'
pdf.add_font(fname=font_path)
pdf.set_font('Waree', size=12)
pdf.add_page()
pdf.write(8, u"Thai (ideally wouldn't wrap after the space after 1000'): นโยบาย​สาธารณะ​มี​ความ​สำคัญ​ต่อ​การ​สนับสนุน​การ​ออก​แบบ​และ​การ​สร้าง​ชุมชน​และ​เมือง​สุขภาพ​ดี​และ​ยั่งยืน รายการ​ตรวจ​สอบนโยบาย​ความ​ท้าทาย 1,000 เมือง​สำหรับ​ใช้​เพื่อ​ประเมิน​การ​มี​อยู่​และ​คุณภาพ​ของ​นโยบาย​ที่​สอด​คล้อง​กับ​หลัก​ฐาน​และ​หลัก​การ​สำหรับ​เมือง​ที่​มี​สุขภาพ​ดี​และ​ยั่งยืน")
pdf.output("unicode.pdf")

Here is the example output generated from the above code that illustrates the issue:
image

Just so its clearer, the Thai text there does contain word wrap indicators in the form of zero-width spaces (U+200B). You can view the above text with hyphens instead to see that there would be other wrapping opportunities,

นโยบาย-สาธารณะ-มี-ความ-สำคัญ-ต่อ-การ-สนับสนุน-การ-ออก-แบบ-และ-การ-สร้าง-ชุมชน-และ-เมือง-สุขภาพ-ดี-และ-ยั่งยืน รายการ-ตรวจ-สอบนโยบาย-ความ-ท้าทาย 1,000 เมือง-สำหรับ-ใช้-เพื่อ-ประเมิน-การ-มี-อยู่-และ-คุณภาพ-ของ-นโยบาย-ที่-สอด-คล้อง-กับ-หลัก-ฐาน-และ-หลัก-การ-สำหรับ-เมือง-ที่-มี-สุขภาพ-ดี-และ-ยั่งยืน

Ideally the fpdf2 output would be more like displayed here in the browser:

นโยบาย​สาธารณะ​มี​ความ​สำคัญ​ต่อ​การ​สนับสนุน​การ​ออก​แบบ​และ​การ​สร้าง​ชุมชน​และ​เมือง​สุขภาพ​ดี​และ​ยั่งยืน รายการ​ตรวจ​สอบนโยบาย​ความ​ท้าทาย 1,000 เมือง​สำหรับ​ใช้​เพื่อ​ประเมิน​การ​มี​อยู่​และ​คุณภาพ​ของ​นโยบาย​ที่​สอด​คล้อง​กับ​หลัก​ฐาน​และ​หลัก​การ​สำหรับ​เมือง​ที่​มี​สุขภาพ​ดี​และ​ยั่งยืน

Caveat
I cannot read Thai myself, but am working with others who do who advised me of this proble. I understand that others have used FPDF2 with Thai (as indicated in issues I've searched, and the documentation). Perhaps there are other ways to get correct word wrapping working for Thai? I couldn't figure it out or find a solution in documentation or issues, so thought I'd check in first. If others think its worth pursuing I could attempt to make a code edit for this.

@carlhiggs carlhiggs added the bug label Jun 3, 2024
@carlhiggs carlhiggs changed the title Word wrapping for Thai text (zero-width space) may not work as intended // wrapmode==WORD should also use zero-width space Word wrapping for Thai text (zero-width space) may not work as intended // wrapmode==WORD should also use zero-width space? Jun 3, 2024
@andersonhc
Copy link
Collaborator

Ideally we should implement the Unicode line breaking algorithm in fpdf2 to produce results similar to document editors and browsers.

@carlhiggs
Copy link
Author

Ideally we should implement the Unicode line breaking algorithm in fpdf2 to produce results similar to document editors and browsers.

Sounds great @andersonhc ; that's probably beyond what I'd have capacity to assist with right now, but after writing the above I made a quick sketch and tested that my suggestion would work for now; I believe it does. I went a head and made a pull request. The unicode algorithm sounds like an ideal solution, but if this pull request could work for now, at least for my purposes, I think it would address the issue.

Hope the pull request is useful, at least as a short/medium-term solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants