Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: FPDF.table() #701

Closed
Lucas-C opened this issue Feb 20, 2023 · 7 comments
Closed

New feature: FPDF.table() #701

Lucas-C opened this issue Feb 20, 2023 · 7 comments

Comments

@Lucas-C
Copy link
Member

Lucas-C commented Feb 20, 2023

Current situation
fpdf2 currently let users employ the cell() & multi_cell() methods to build tables, as demonstrated in part 5 of our tutorial: https://pyfpdf.github.io/fpdf2/Tutorial.html#tuto-5-creating-tables
We also have some recipes regarding building tables in our documentation: https://pyfpdf.github.io/fpdf2/Tables.html

Based on the feedbacks in several table-related issues & discussions opened on this GitHub project, it seems to me that a FPDF.table() method would be very handy for our users.

Features
It would be ideal that the end implementation provides the following set of features:

  • support cells with content wrapping over several lines
  • control over column & row sizes, or by default let them be automatically computed
  • control over text alignment in cells, with rules by column or row
  • allow to set table headings, styled differently, but make this optional
  • control table width
  • honor the initial X / Y current position to render the table, and allow to easily center it in the page
  • handle splitting a table over page breaks, with headings repeated
  • allow to embed images in cells
  • control over borders: color, width & where they are drawn (e.g. allow to not draw the surrounding square, allow to only draw the horizontal line above the headings, etc.) Also: control thickness of border below headings
  • control over cell background, through a callback function to allow maximum customization
  • (bonus) allow for several cells to be merged horizontally (aka colspan)
  • (bonus) replace the table-building logic in fpdf/html.py by a call to this new FPDF.table() method

Method design
In issue #680 I pitched the following API for this feature:

from fpdf import FPDF

pdf = FPDF()
with pdf.table() as table:
    table.col_widths = ...  # optional
    with table.row() as row:
        row.cell(...)  # or row.image(...)

Regarding this, feedbacks and alternative suggestions are very welcome! 😊
Here is what I like about this one:

  • it defers the actual table building & rendering to the end of the table() context, which mean that we'll be able to perform some calculations on the row heights / column widths based on all the table content provided
  • it gives more flexibility to the user than having a huge data object provided in one go to a table() method, while still making it easy to build a table based on such big data dictionary / sequence
  • requiring several method calls will allow us to "split" control parameters between those methods, and limit the number of parameters passed to table(). The image() method for example, with its 11 parameters, is becoming a bit difficult to apprehend.
@Lucas-C
Copy link
Member Author

Lucas-C commented Feb 28, 2023

The PR is almost ready: #703

Lucas-C added a commit that referenced this issue Feb 28, 2023
Lucas-C added a commit that referenced this issue Feb 28, 2023
Lucas-C added a commit that referenced this issue Mar 16, 2023
Lucas-C added a commit that referenced this issue Mar 17, 2023
@MartinThoma
Copy link
Member

Hey! I'm Martin, the maintainer of pypdf and PyPDF2 👋

Do you think the table-feature could be added in a way that it's possible to read the table structure from the PDF (programmatically, without heuristics)?

@MartinThoma
Copy link
Member

I was thinking about "14.6 Marked Content", see https://accessible-pdf.info/basics/general/overview-of-the-pdf-tags

@Lucas-C
Copy link
Member Author

Lucas-C commented Mar 26, 2023

Thank you for reaching out @MartinThoma!

Yes, this is a really good suggestion.
It shouldn't be difficult to add, as we already have the necessary building block: https://github.com/PyFPDF/fpdf2/blob/2.6.1/fpdf/fpdf.py#L3799

However, I am not sure how best to test that we implement this right...
Would you recommend any tool I could use to check that table content can be properly extracted based on marked content?
I only know https://github.com/camelot-dev/camelot, but is is not based on marked content tags.

@MartinThoma
Copy link
Member

MartinThoma commented Mar 26, 2023

Good question! I want to give those capabilities to pypdf in the long run, but right now we are not there yet.

Looking at some libraries:

I've actually asked this several years ago and haven't received an answer: How can I extract all PDF Tags related to content with Python?

Lucas-C added a commit that referenced this issue Mar 27, 2023
Lucas-C added a commit that referenced this issue Mar 27, 2023
@Lucas-C
Copy link
Member Author

Lucas-C commented Mar 27, 2023

Thank you for the detailed answer @MartinThoma!
I have also found this screenshot that illutrates table tagged elements:

I have just added a commit to PR #703 related to this: 46bc617 (#703). It contains:

I was not able to find examples of using pdfminer to extract tables from PDF docs.
Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.
For tika-python, I am going to wait for the answer to the question you asked.

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile...
At least not in a systematical way.
An optional tag=True argument could later be added to FPDF.table(), but I don't think it's necessary in the initial version.

What do you think about this @MartinThoma?

Lucas-C added a commit that referenced this issue Mar 27, 2023
@MartinThoma
Copy link
Member

Wow, you're amazing 😍

Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.

Oops, my bad, I mistyped 🙈

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile

Yes, I understand. It's a bit of a henn-egg-problem. Please don't forget that screen readers / accessibility solutions might use the tags as well. I think the tags were originally designed for them. But here I have no knowledge.

I don't think it's necessary in the initial version

I agree 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants