Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take a long time after choose an unicode font #907

Open
diego-insaurralde opened this issue Aug 28, 2023 · 23 comments
Open

Take a long time after choose an unicode font #907

diego-insaurralde opened this issue Aug 28, 2023 · 23 comments

Comments

@diego-insaurralde
Copy link

Hi Everyone,
First of all, I'm sorry if someone asked help in the same subject and for my terrible english xD.

I have to convert two txt files to one pdf. Each one has a different codification, one ISO-8859-1, other UTF-8.
When I use only one file with ISO-8859-1, I generated a pdf fastly and ok. When I try from two, I have a trouble, because I need to use Unicode font, so, I'm using DejaVu font, and now, when I try to generate a new pdf, they take a long time, and not finish.
I try to use font from especific directory, no one errror is displayed, i need to force pause, because it never ends.
In the begin, I was using FPDF version 1, and a warning is displayed: "UserWarning: cmap value too big/small", and it take a long time too. I checked this version is deprecated, so now i'm here with version 2, with the same problem, (but the userwarning is not displayed)

So that's it, I'm waiting for an answer.

Thanks a lot for all contributtor

@Lucas-C
Copy link
Member

Lucas-C commented Aug 28, 2023

Hi @diego-insaurralde and welcome! 😊

I'd be happy to try to help you,
but could you please provide some minimal Python code as a starting point,
so that I can reproduce what you are trying to do?

If you don't know how to craft some minimal code reproducing your issue,
you will find guidelines there: https://stackoverflow.com/help/minimal-reproducible-example

@diego-insaurralde
Copy link
Author

diego-insaurralde commented Aug 29, 2023

Hello @Lucas-C , I appreciate you to give me attention.

Sorry for let you waiting, but let me explain now.

I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó".
Please, desconsider all redundancies in the code, I tried to reproduce the same behavior that I need.

IIn PdfGeneratorClass line 24, I add an unicode font, I chose DejaVu, and I put the all fonts in the fonts folder , in the same directory (just to test).
In the first test, I will read a txt file with UTF-8 codification, but I will encode using ISO-8859-1, In line 27 and 55, I set a builtin font "Times".
In the second file, I have some functions, to read, prepare and generate pdf. In the line 66, I'm reading the file with ISO-8859-1 encoding (to the first test).
In this test, I have generated a PDF file in approximately 5 seconds, but I generated a invalid file, because i can't convert all characteres properly.

To te second test, I set an unicode font, DejaVu, and line 66, I change to encode using UTF-8.
It take a more time, more than 2 minutes, and I generated the expected file. But I need to handle large files.

How Can I handle it.
I hopefully have been clearly. If any question, let me know.
I will attach a txt file, it contents, doesn't matter.

First it's my class PDF Generator, that use FPDF:

from fpdf import FPDF
from typing import List


class PdfGenerator(FPDF):
    def __init__(
        self,
        header_btb: List[str],
        registers: List[str],
        totals: dict,
        initial_page: int,
        initial_seq_number: int,
    ):
        self.header_btb = header_btb
        self.registers = registers
        self.totals = totals
        self.initial_page = initial_page
        self.initial_seq_number = initial_seq_number

        self.line_separator = 1
        self.line_height = 3

        FPDF.__init__(self)
        self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')

    def header(self) -> None:
        self.set_font("Times", "", 5)
        self.set_text_color(19, 16, 196)
        self.ln(2)
        self.header_btb[2] = (
            self.header_btb[2][:163] + f"{self.initial_page + self.page_no() - 1:>7d}"
        )

        for line in self.header_btb:
            if "------" in line:
                break
            self.cell(0, self.line_height, line, 0, self.line_separator)

        self.set_text_color(128, 128, 128)
        self.cell(
            0,
            self.line_height,
            f"{' ' * 5}{'_' * 141}{' ' * 5}",
            0,
            self.line_separator,
        )
        self.ln(1)
        self.set_text_color(19, 16, 196)

    def generate_report(self):
        sep_total = 128 * " " + "-" * 42
        self.set_margins(left=0.5, right=0.5, top=0.3)
        self.alias_nb_pages()
        self.add_page()
        self.set_font("Times", "", 5)
        self.set_text_color(19, 16, 196)
        for index, header in enumerate(self.registers):
            for item in header["register_contents"]:
                self.cell(0, self.line_height, item, 0, self.line_separator)
            self.set_text_color(128, 128, 128)
            self.cell(0, self.line_height, sep_total, 0, self.line_separator)
            # self.line(142, self.get_y(), 170, self.get_y())
            self.set_text_color(19, 16, 196)
            self.cell(0, self.line_height, header["total_line"], 0, self.line_separator)
            self.set_text_color(19, 16, 196)

        self.cell(0, self.line_height, self.totals["total"], 0, self.line_separator)

Second File to handle txt file and generate pdf.:

from pdf_gen import PdfGenerator
from werkzeug.datastructures import FileStorage
import io 

def separate_header_content(file: io.TextIOWrapper):
    j = 0
    header = []
    contents = []
    isHeaderBuilding = False
    isNewContent = True 
    numericNew = 0 
    registers = []

    for line in file:
        if "Cabe" in line:
            j = 1
            if not header:
                isHeaderBuilding = True 
                header.append(line)
            else:
                isHeaderBuilding = False
        
        elif j < 8:
            if isHeaderBuilding:
                header.append(line)
            j+=1 

        elif "---" in line:
            continue

        elif "Total" in line:
            total_line = line
       
        else:
            numeric = line[1:8].strip()
            if numeric.isnumeric():
                if isNewContent:
                    isNewContent = False
                    numericNew = numeric
                    contents.append(line)

                else:
                    if numeric == numericNew:
                        contents.append(line)
                    else:
                        old_contents = contents.copy()
                        registers.append({
                            "register_contents": old_contents,
                            "total_line": total_line 
                            })

                        contents.clear()
                        isNewContent = True

                        

    return header,registers 




def test_pdf(file_utf8, file_latin1=""):
    input("PRESS ANY BUTTON TO START")
    with open(file_utf8, "rb") as f:
        f1 = FileStorage(f)
        f1_buff = io.TextIOWrapper(f1, encoding="ISO-8859-1") #encoding="utf-8-sig")
        header, registers = separate_header_content(f1_buff)  
        
        for line in header:
            print(line)
        if file_latin1:
            f2 = open(file_latin1, "rb")
            f2 = FileStorage(f2)
            f2_buff = io.TextIOWrapper(f2, encoding="ISO-8859-1")
            header2, registers2 = separate_header_content(f2_buff)  
            header += header2
            registers += registers2

        pdf = PdfGenerator(header, registers,{"total": "0,0", "accumulated": "0,0"} , 1, 1)
        pdf.generate_report()
        pdf.output(name="teste.pdf")


        if file_latin1:
            f2.close()


if __name__ == "__main__":
    file_utf8 = "utf8.txt"            
    file_latin1 = ""    
    test_pdf(file_utf8)

utf8.txt

@gmischler
Copy link
Collaborator

@diego-insaurralde, have you tried profiling your code?

import cProfile
cProfile.run('test_pdf(file_utf8)')

This would give you (and us) a first idea of where your program is spending all that time.

@andersonhc
Copy link
Collaborator

@diego-insaurralde could you please execute the command pip freeze and send us the output?

@diego-insaurralde
Copy link
Author

diego-insaurralde commented Aug 29, 2023

@diego-insaurralde, have you tried profiling your code?

import cProfile
cProfile.run('test_pdf(file_utf8)')

This would give you (and us) a first idea of where your program is spending all that time.

I did it now, and I get this result. The first one I organize the data in pandas and sort by cumtime and get head 50.
The second, I sorted by tottime and get head 50

     243122522 function calls (243101129 primitive calls) in 177.024 seconds
ncalls          | tottime   | percall   | cumtime   | percall2  | filename:lineno(function)
---------------------------------------------------------------------------------------------------------------------
1               | 0.000     | 0.000     | 177.063   | 177.063   | <string>:1(<module>)
1               | 0.000     | 0.000     | 177.063   | 177.063   | main.py:62(test_pdf)
12/1            | 0.000     | 0.000     | 177.063   | 177.063   | {built-inmethodbuiltins.exec}
1               | 0.111     | 0.111     | 175.562   | 175.562   | pdf_gen.py:38(generate_report)
44638/39369     | 0.207     | 0.000     | 175.27    | 0.004     | fpdf.py:213(wrapper)
43678/39367     | 0.951     | 0.000     | 175.08    | 0.004     | fpdf.py:2660(cell)
43678/39367     | 2.204     | 0.000     | 170.485   | 0.004     | fpdf.py:2780(_render_styled_text_line)
43678           | 0.096     | 0.000     | 157.456   | 0.004     | line_break.py:204(render_pdf_text)
43678           | 12.979    | 0.000     | 157.282   | 0.004     | line_break.py:213(render_pdf_text_ttf)
7088728         | 6.944     | 0.000     | 139.757   | 0.000     | fonts.py:395(pick)
7088746         | 20.137    | 0.000     | 93.908    | 0.000     | fonts.py:414(get_glyph)
7054707         | 8.163     | 0.000     | 63.311    | 0.000     | ttFont.py:806(getBestCmap)
7060696/7060686 | 8.310     | 0.000     | 40.54     | 0.000     | ttFont.py:444(__getitem__)
7088728         | 10.464    | 0.000     | 38.905    | 0.000     | fonts.py:401(pick_glyph)
14150556        | 13.178    | 0.000     | 36.431    | 0.000     | {method'get'of'dict'objects}
7054707         | 5.984     | 0.000     | 14.736    | 0.000     | _c_m_a_p.py:78(getBestCmap)
7060941         | 5.555     | 0.000     | 12.305    | 0.000     | textTools.py:19(__new__)
14097404        | 11.587    | 0.000     | 11.587    | 0.000     | <string>:2(__eq__)
14097894        | 7.768     | 0.000     | 10.581    | 0.000     | <string>:2(__hash__)
43678/39367     | 0.051     | 0.000     | 9.5       | 0.000     | fpdf.py:3277(_perform_page_break_if_need_be)
7063336         | 5.825     | 0.000     | 9.476     | 0.000     | textTools.py:25(__eq__)
480             | 0.016     | 0.000     | 9.421     | 0.020     | fpdf.py:817(add_page)
479             | 0.002     | 0.000     | 9.4       | 0.020     | fpdf.py:3290(_perform_page_break)
480             | 0.016     | 0.000     | 9.362     | 0.020     | pdf_gen.py:21(header)
7054707         | 8.752     | 0.000     | 8.752     | 0.000     | _c_m_a_p.py:59(getcmap)
7048789         | 8.678     | 0.000     | 8.678     | 0.000     | <string>:2(__init__)
87356           | 0.649     | 0.000     | 8.35      | 0.000     | line_break.py:159(get_width)
14124277        | 4.803     | 0.000     | 7.655     | 0.000     | textTools.py:13(transcode)
87356           | 0.284     | 0.000     | 7.337     | 0.000     | fonts.py:197(get_text_width)
87394           | 1.653     | 0.000     | 7.026     | 0.000     | {built-inmethodbuiltins.sum}
14264812        | 4.322     | 0.000     | 5.373     | 0.000     | fonts.py:200(<genexpr>)
23100135        | 4.085     | 0.000     | 4.12      | 0.000     | {built-inmethodbuiltins.isinstance}
7060998         | 3.449     | 0.000     | 3.449     | 0.000     | textTools.py:28(__hash__)
14097894        | 2.813     | 0.000     | 2.813     | 0.000     | {built-inmethodbuiltins.hash}
7114366         | 2.781     | 0.000     | 2.781     | 0.000     | {built-inmethod__new__oftypeobjectat0x7fb22b487a00}
43678           | 0.623     | 0.000     | 2.707     | 0.000     | deprecation.py:44(get_stack_level)
21266217        | 2.013     | 0.000     | 2.013     | 0.000     | {built-inmethodbuiltins.ord}
7394474         | 1.912     | 0.000     | 1.912     | 0.000     | line_break.py:49(font)
174712          | 0.654     | 0.000     | 1.35      | 0.000     | inspect.py:896(getfile)
7048948         | 1.067     | 0.000     | 1.067     | 0.000     | {built-inmethodbuiltins.chr}
7088770         | 0.937     | 0.000     | 0.937     | 0.000     | {method'keys'of'dict'objects}
43678           | 0.140     | 0.000     | 0.773     | 0.000     | drawing.py:211(serialize)
141917/138991   | 0.217     | 0.000     | 0.73      | 0.000     | {method'join'of'str'objects}
1               | 0.560     | 0.560     | 0.56      | 0.560     | {built-inmethodbuiltins.input}
87356           | 0.333     | 0.000     | 0.542     | 0.000     | <frozenposixpath>:150(dirname)
43678           | 0.048     | 0.000     | 0.521     | 0.000     | fpdf.py:3090(_preload_font_styles)
1               | 0.000     | 0.000     | 0.513     | 0.513     | fpdf.py:4746(output)
174712          | 0.107     | 0.000     | 0.504     | 0.000     | drawing.py:212(<genexpr>)
1               | 0.008     | 0.008     | 0.474     | 0.474     | output.py:344(bufferize)
43678           | 0.182     | 0.000     | 0.473     | 0.000     | fpdf.py:3122(_parse_chars)
ncalls          | tottime   | percall   | cumtime   | percall2  | filename:lineno(function)
---------------------------------------------------------------------------------------------------------------------
7088746         | 20.137    | 0.000     | 93.908    | 0.000     | fonts.py:414(get_glyph)
14150556        | 13.178    | 0.000     | 36.431    | 0.000     | {method'get'of'dict'objects}
43678           | 12.979    | 0.000     | 157.282   | 0.004     | line_break.py:213(render_pdf_text_ttf)
14097404        | 11.587    | 0.000     | 11.587    | 0.000     | <string>:2(__eq__)
7088728         | 10.464    | 0.000     | 38.905    | 0.000     | fonts.py:401(pick_glyph)
7054707         | 8.752     | 0.000     | 8.752     | 0.000     | _c_m_a_p.py:59(getcmap)
7048789         | 8.678     | 0.000     | 8.678     | 0.000     | <string>:2(__init__)
7060696/7060686 | 8.31      | 0.000     | 40.54     | 0.000     | ttFont.py:444(__getitem__)
7054707         | 8.163     | 0.000     | 63.311    | 0.000     | ttFont.py:806(getBestCmap)
14097894        | 7.768     | 0.000     | 10.581    | 0.000     | <string>:2(__hash__)
7088728         | 6.944     | 0.000     | 139.757   | 0.000     | fonts.py:395(pick)
7054707         | 5.984     | 0.000     | 14.736    | 0.000     | _c_m_a_p.py:78(getBestCmap)
7063336         | 5.825     | 0.000     | 9.476     | 0.000     | textTools.py:25(__eq__)
7060941         | 5.555     | 0.000     | 12.305    | 0.000     | textTools.py:19(__new__)
14124277        | 4.803     | 0.000     | 7.655     | 0.000     | textTools.py:13(transcode)
14264812        | 4.322     | 0.000     | 5.373     | 0.000     | fonts.py:200(<genexpr>)
23100135        | 4.085     | 0.000     | 4.12      | 0.000     | {built-inmethodbuiltins.isinstance}
7060998         | 3.449     | 0.000     | 3.449     | 0.000     | textTools.py:28(__hash__)
14097894        | 2.813     | 0.000     | 2.813     | 0.000     | {built-inmethodbuiltins.hash}
7114366         | 2.781     | 0.000     | 2.781     | 0.000     | {built-inmethod__new__oftypeobjectat0x7fb22b487a00}
43678/39367     | 2.204     | 0.000     | 170.485   | 0.004     | fpdf.py:2780(_render_styled_text_line)
21266217        | 2.013     | 0.000     | 2.013     | 0.000     | {built-inmethodbuiltins.ord}
7394474         | 1.912     | 0.000     | 1.912     | 0.000     | line_break.py:49(font)
87394           | 1.653     | 0.000     | 7.026     | 0.000     | {built-inmethodbuiltins.sum}
7048948         | 1.067     | 0.000     | 1.067     | 0.000     | {built-inmethodbuiltins.chr}
43678/39367     | 0.951     | 0.000     | 175.08    | 0.004     | fpdf.py:2660(cell)
7088770         | 0.937     | 0.000     | 0.937     | 0.000     | {method'keys'of'dict'objects}
174712          | 0.654     | 0.000     | 1.35      | 0.000     | inspect.py:896(getfile)
87356           | 0.649     | 0.000     | 8.35      | 0.000     | line_break.py:159(get_width)
43678           | 0.623     | 0.000     | 2.707     | 0.000     | deprecation.py:44(get_stack_level)
1               | 0.56      | 0.560     | 0.56      | 0.560     | {built-inmethodbuiltins.input}
131034          | 0.353     | 0.000     | 0.397     | 0.000     | drawing.py:90(number_to_str)
90784           | 0.338     | 0.000     | 0.341     | 0.000     | {method'encode'of'str'objects}
87356           | 0.333     | 0.000     | 0.542     | 0.000     | <frozenposixpath>:150(dirname)
45118           | 0.293     | 0.000     | 0.347     | 0.000     | fpdf.py:4316(_out)
87356           | 0.284     | 0.000     | 7.337     | 0.000     | fonts.py:197(get_text_width)
43678           | 0.283     | 0.000     | 0.283     | 0.000     | {built-inmethod_warnings.warn}
87356           | 0.241     | 0.000     | 0.241     | 0.000     | line_break.py:69(font_size_pt)
141917/138991   | 0.217     | 0.000     | 0.73      | 0.000     | {method'join'of'str'objects}
44638/39369     | 0.207     | 0.000     | 175.27    | 0.004     | fpdf.py:213(wrapper)
482             | 0.194     | 0.000     | 0.194     | 0.000     | {built-inmethodzlib.compress}
43678           | 0.182     | 0.000     | 0.473     | 0.000     | fpdf.py:3122(_parse_chars)
1               | 0.173     | 0.173     | 0.219     | 0.219     | main.py:5(separate_header_content)
43678           | 0.14      | 0.000     | 0.773     | 0.000     | drawing.py:211(serialize)
43678           | 0.138     | 0.000     | 0.145     | 0.000     | line_break.py:27(__init__)
262548          | 0.123     | 0.000     | 0.123     | 0.000     | graphics_state.py:141(current_font)
43678           | 0.112     | 0.000     | 0.215     | 0.000     | util.py:18(escape_parens)
1               | 0.111     | 0.111     | 175.562   | 175.562   | pdf_gen.py:38(generate_report)
87356           | 0.11      | 0.000     | 0.172     | 0.000     | fpdf.py:414(is_ttf_font)
174712          | 0.107     | 0.000     | 0.504     | 0.000     | drawing.py:212(<genexpr>)

@diego-insaurralde
Copy link
Author

@diego-insaurralde could you please execute the command pip freeze and send us the output?

Sure! But the main libraries is FLASK and FPDF2.

blinker==1.6.2
click==8.1.7
defusedxml==0.7.1
Flask==2.3.3
fonttools==4.42.1
fpdf2==2.7.5
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.3
numpy==1.25.2
pandas==2.0.3
Pillow==10.0.0
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0
tabulate==0.9.0
tzdata==2023.3
Werkzeug==2.3.7

@Lucas-C
Copy link
Member

Lucas-C commented Aug 30, 2023

Hi @diego-insaurralde

Thank you for taking the time to provide some code reproducing your problem.

First, I want to mention that the script you provided is not MINIMAL:
a minimal reproducing program would be less than 20 lines long, yours is more than 160 lines long.
Taking the time to narrow down the problem to the shortest code possible is very important in order to understand it 😊

Second, when strictly following your instructions, I got an IndexError raised.
You wrote:

I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó".

But when I executed those commands in my Linux shell:

echo 'Ó' > utf8.txt
time python issue_907.py

It seems that your PdfGenerator.header() method does not handle well single-letter files.

Nevertheless I tested your script by feeding it itself (__file__ instead of "utf8.txt"),
and I'm sorry but I do not reproduce the situation you described with an execution time of 2 minutes 😞
On my computer, the script execution takes less than 2 seconds.

There is the single-file Python script I used, based on the code snippets you shared:
issue_907.py.txt

@diego-insaurralde
Copy link
Author

diego-insaurralde commented Aug 31, 2023

Hi @diego-insaurralde

Thank you for taking the time to provide some code reproducing your problem.

First, I want to mention that the script you provided is not MINIMAL: a minimal reproducing program would be less than 20 lines long, yours is more than 160 lines long. Taking the time to narrow down the problem to the shortest code possible is very important in order to understand it 😊

Second, when strictly following your instructions, I got an IndexError raised. You wrote:

I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó".

But when I executed those commands in my Linux shell:

echo 'Ó' > utf8.txt
time python issue_907.py

It seems that your PdfGenerator.header() method does not handle well single-letter files.

Nevertheless I tested your script by feeding it itself (__file__ instead of "utf8.txt"), and I'm sorry but I do not reproduce the situation you described with an execution time of 2 minutes 😞 On my computer, the script execution takes less than 2 seconds.

There is the single-file Python script I used, based on the code snippets you shared: issue_907.py.txt

Hello Lucas, thanks for your answer.
Sorry for my long code, I reproduced other with less lines xD.
You don't have a problem with the execution because you don't change the font to unicode font (DejaVu) and you read the file utf8.txt with encoding ISO-8859-1. The code below is right and ready to execution.
ps: I tried to reduce less 20 lines xD.

from fpdf import FPDF
from typing import List 
from werkzeug.datastructures import FileStorage
import io 

class PdfGenerator(FPDF):
    def __init__(self, registers: List[str]):
        self.registers = registers
        FPDF.__init__(self)
        self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')
 

    def generate_report(self):
        self.set_margins(left = 0.5, right=0.5, top = 0.3)
        self.alias_nb_pages()
        self.add_page()
        self.set_font('DejaVuSans', '', 5)
        self.set_text_color(19,16,196)
        for  register in self.registers:
            self.cell(0, 3, register, 0, 1)


def test_pdf(file_utf8):
    with open(file_utf8, "rb") as f:
        f1 = FileStorage(f)
        f1_buff = io.TextIOWrapper(f1, encoding="utf-8")
        lines = f1_buff.readlines()

        pdf = PdfGenerator(lines )
        pdf.generate_report()
        pdf.output(name="teste.pdf")


if __name__ == "__main__":
    file_utf8 = "utf8.txt"  
    test_pdf(file_utf8)

@gmischler
Copy link
Collaborator

gmischler commented Aug 31, 2023

A patch by @andersonhc has just been merged which should reduce the time required for this specific example by roughly 30%.
It is possible that we may eventually find more potential optimizations, but you really shouldn't expect any wonders.

When using the built-in fonts, then the text pretty much gets written directly to the PDF file, with hardly any processing overhead.
When using TTF fonts, there's quite some work to be done behind the scenes, from actually reading the font file, embedding the font data into the PDF file, and converting Unicode code points to index positions relative to the embedded data. The amount of data that needs to be output will also increase by a factor of up to 6 or 7. There's nothing we can do about this part, that's just how the PDF format works.

Since we're talking about an 11+ MB file with ~100k lines of text and ~7 million characters here, any pure Python solution will eventually reach its limits. Maybe you can get it to work somehow, but it may also be that you'll have to find a package written in a compiled language that does things a lot faster.

As to your original description of the production job running "forever", have you tried to instrument your code so it will give you some progress feedback? If you eg. write some console output for each new page, then this would give you a better idea about how close you are to a working solution, or how far away from it.

@diego-insaurralde
Copy link
Author

diego-insaurralde commented Sep 1, 2023

[gmischler: removed pointless fullquote]

I understand it. Is it not possible to add a Unicode font as a native option, similar to fonts in the ISO-8859-1 encoding? From my perspective, as someone external to the situation, it seems unusual that this library doesn't handle UTF-8 encodings very well, once this encoding is widely used.

@gmischler
Copy link
Collaborator

gmischler commented Sep 2, 2023

Is it not possible to add a Unicode font as a native option, similar to fonts in the ISO-8859-1 encoding?

Short answer: No.
I've outlined some of the practical reasons above. A latin-1 font can contain at most 256 characters via 8-bit addressing. A Unicode font can potentially contain up to 4294967296 characters via 32 bit addressing.
Dealing with the latter format simply takes a lot more processing. There's no way around that.

From my perspective, as someone external to the situation, it seems unusual that this library doesn't handle UTF-8 encodings very well, once this encoding is widely used.

That statement makes no sense. Unicode file encodings like UTF-8 are completely irrelevant in this context and fpdf2 never even sees them in normal operation. At the same time, no other Python based PDF library can handle Unicode text with TTF fonts as completely and correctly as fpdf2, and none of them is significantly faster.

Maybe you should try to get at least some basic understanding of the relevant topics (Unicode, TTF fonts, and the PDF specification) before criticizing the work of other people.

In summary, I repeat: it is likely impossible for any Python library to deliver the performance you wish for.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 4, 2023

I agree overall with @gmischler answer.
However, I think that @diego-insaurralde report is very valuable to know the current performance limitations of fpdf2, based on real-world usages by users.
For that, thank you @diego-insaurralde for reaching out on this subject and opening this issue 😊👍

Thank you also for providing a 2nd, shorter code snippet.
With this code snippet I was able to reproduce on my computer the performance issue you described.

I took the time to investigate a little to try to figure out where is the bottlneck.
This is the performance report I produced:

$ time ./issue_907.py
Font MPDFAA+DejaVuSansBook is missing the following glyphs:

Stage                               | Total exec time (ms) | #execs  | Average single exec duration (ms)
------------------------------------|----------------------|---------|----------------------------------
         _render_styled_text_line() |                77839 |  103555 | 0.75
         frag.render_pdf_text_ttf() |                71336 |  103555 | 0.69
   self.font.subset.pick(ord(char)) |                51185 | 7795892 | 0.01

real    1m21,678s

The last line in the table above is a specific code line that is evaluated almost 8 million times:
https://github.com/py-pdf/fpdf2/blob/2.7.5/fpdf/line_break.py#L217

With an average of 0.01ms per execution, this code is already very fast.
But I wonder if things couldn't be improved a little bit...
For example I see repeated calls to ttfont.getBestCmap() in the code.
Maybe that could be optimized / cached?

And finally, I think it may be interesting to translate this usage scenario into a non-regression performance test,
similar to the ones we already have that uses @ensure_exec_time_below

@Lucas-C
Copy link
Member

Lucas-C commented Sep 4, 2023

A patch by @andersonhc has just been merged which should reduce the time required for this specific example by roughly 30%.

Oh, I missed that the repeated calls to ttfont.getBestCmap() have already been removed by @andersonhc in PR #908 / commit b74a590 😅
Well done 👍

Anyway I opened #911 and I'd be happy to have your reviews on it,
and maybe more suggestions of performance improvements?

@Lucas-C
Copy link
Member

Lucas-C commented Sep 4, 2023

I also compared the perfomances of the script provided by @diego-insaurralde (this one: issue_907.py) between fpdf2 v2.7.4 and fpdf2 v2.7.5.
On my machine, I get those results:

  • with fpdf2 v2.7.4:
$ time ./issue_907.py
real    0m10,501s
  • with fpdf2 v2.7.5 (more precisely the current state of this git repo master branch including the recent patch by @andersonhc):
$ time ./issue_907.py
real    0m30,115s

So I think we got an actual speed regression in fpdf2 v2.7.5 😢

I'm sorry @gmischler, but this shows that your previous statements were maybe too categorical:

Dealing with the latter format simply takes a lot more processing. There's no way around that.
[...]
In summary, I repeat: it is likely impossible for any Python library to deliver the performance you wish for.

Given that, in this specific usage scenario, fpdf2 was faster in v2.7.4, there may be still room for improvement 😊
And having performance non-regression tests would definitively be helpful to avoid that in the future.

@diego-insaurralde
Copy link
Author

diego-insaurralde commented Sep 4, 2023

First of all, I would like to apologize, I didn't mean to criticize anything, but just to provoke you, to get a better undestand. After your explanation @gmischler , now I'm aware how the solution can be delicated, I have a trouble and I looked for a solution, That's why, I'm here.

@Lucas-C Thank you so much for your clarifications, Now I can hope a solution xD.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 4, 2023

I profiled issue_907.py using cProfile:

python -m cProfile -o issue_907.pstats issue_907.py

There is the result as a SVG flame graph (produced using flameprof):
issue_907 pstats

It makes very obvious the places were optimizations would be worthwile 🙂

I also opened #913 to try to make test_cell_speed_with_long_text() faster

@gmischler
Copy link
Collaborator

There is the result as a SVG flame graph (produced using flameprof):

It might be interesting to see the same thing for 2.7.4 for comparison (and after #913, of course).

@Lucas-C
Copy link
Member

Lucas-C commented Sep 4, 2023

It might be interesting to see the same thing for 2.7.4 for comparison

Sure, good idea, there is the flamegraph for issue_907.py being executed with fpdf2 in version 2.7.4:
issue_907-v2 7 4 pstats

@gmischler
Copy link
Collaborator

Comparing the two graphs, a few things jump out at me:
In 2.7.4, get_width() appears to have been the largest single bottleneck (not entirely unexpected).
In 2.7.5, this part has been squished into the smaller left tower, while actually picking glyphs to render takes a lot more time.
It will be interesting to see in which ways and by how much #913 will change that.

I suspect though, that really significant gains can better be reached on an architectural level than by individual local optimizations.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 6, 2023

I suspect though, that really significant gains can better be reached on an architectural level than by individual local optimizations.

I totally agree, but right now I do not see clearly what architectural changes we should make to improve things...

There is the performance flame graph for issue_907.py on develop after merging PR #913:
issue_907-post-PR-913 pstats

Not much has changed in the shape of the stacks...

@Lucas-C
Copy link
Member

Lucas-C commented Sep 6, 2023

FYY I added a section in our docs on how to investigate performance issues:
https://py-pdf.github.io/fpdf2/Development.html#testing-performances

@gmischler
Copy link
Collaborator

I've just been looking through #913 again and it seems that the good idea of caching glyph IDs could be expanded on further.

At the moment, every time a specific glyph needs to be rendered, a new Glyph() instance is created. As far as I can tell, all of the instances for a specific glyph will be identical, which means that they are essentially redundant, and only one of each type is needed. So instead of caching glyph IDs, we should probably rather be caching Glyph() instances. In our current example, that would eliminate the instantiation of around 7 million redundant objects.

@andersonhc, am I missing anything obvious here? Is there any situation where a Glyph() instance for the same sequence of Unicode code points would have different contents than the others, or where one would be modified after creation?
If not, then anyone is free to run with the idea. Unfortunately I don't have time myself to follow up on this since I'm still focusing on getting #897 production ready.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 8, 2023

I've just been looking through #913 again and it seems that the good idea of caching glyph IDs could be expanded on further. [...] So instead of caching glyph IDs, we should probably rather be caching Glyph() instances. In our current example, that would eliminate the instantiation of around 7 million redundant objects.

Sounds great!! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants