Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds pandas-friendly formatting for multi-column and multi-index tables #986

Closed
wants to merge 0 commits into from
Closed

Conversation

afriedman412
Copy link

(partially) Fixes #970
• Tables can now handle multi-index formatting similar to how multi-column formatted is currently implemented.
• Code added to format multi-index and multi-header to match Pandas (repeat labels are now omitted but spaced correctly

Checklist:

  • The GitHub pipeline is OK (green),
    meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.

  • A unit test is covering the code added / modified by this PR

  • This PR is ready to be merged

  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder

  • A mention of the change is present in CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

@afriedman412
Copy link
Author

ok what's the best way to include a pandas df for testing (assuming we don't want to require pandas for testing)

@andersonhc
Copy link
Collaborator

andersonhc commented Oct 27, 2023

ok what's the best way to include a pandas df for testing (assuming we don't want to require pandas for testing)

We have camelot.py into test\requirements.txt, and camelot depends on pandas, so pandas is already installed with our test requirements.

@codecov-commenter
Copy link

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (e23fdcc) 93.70% compared to head (a434fe1) 93.40%.
Report is 1 commits behind head on master.

❗ Current head a434fe1 differs from pull request most recent head 955937b. Consider uploading reports for the commit 955937b to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   93.70%   93.40%   -0.30%     
==========================================
  Files          28       28              
  Lines        8276     8309      +33     
  Branches     1514     1522       +8     
==========================================
+ Hits         7755     7761       +6     
- Misses        324      350      +26     
- Partials      197      198       +1     
Files Coverage Δ
fpdf/table.py 87.89% <32.50%> (-6.63%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@afriedman412
Copy link
Author

I added the code to create a table from a dataframe directly to the pdf object because it needs to get the number of index rows and columns from the dataframe ... and it seemed redundant to have to pass all the other kwargs to the initial function for building the table.

also added a test!

Copy link
Collaborator

@gmischler gmischler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afriedman412, I appreciate your enthusiasm and can support the basic purpose of this PR. It is certainly desirable to make the use of data generated by other software packages as easily useable as possible. This applies to almost any data producing software, which happens to include pandas.

BUT.

The way you implemented this is not how it should be done.
There is absolutely no reason why the core classes of fpdf2 should have any knowledge about internal data structures of any one of those data generating packages. Sorry if this sounds unduly harsh, but from a software architecture perspective, this approach is just utterly and horribly wrong. If we allow this to happen and then continue on the the same path, fpdf2 would quickly turn into bloatware and become unmaintainable.

I've outlined the basic reasons for my objection and at least one possible solution in #970, just before and again after you joined the discussion. You seemed to have either not read, not understood, or just ignored this input. Is there a particular reason for that?

@afriedman412
Copy link
Author

I've outlined the basic reasons for my objection and at least one possible solution in #970, just before and again after you joined the discussion. You seemed to have either not read, not understood, or just ignored this input. Is there a particular reason for that?

Apologies for not seeing that you already went through this above. I kind of misread the conversation, as what stuck out to me immediately was the mis-framing of the issue.

While I agree that it's not ideal that Pandas doesn't do this formatting itself, multi-level dataframes are common enough that their formatting is worth accommodating. Automatically grabbing the number of levels is more intuitive than having to pass them as separate variables. It might not be elegant from a software engineer standpoint, but it's a small UX improvement that will save data people (such as myself) countless small headaches. But it's your package and if you don't want it that way, that's up to you!

To the larger point -- and why I missed the earlier conversation -- my goal here was to build out FPDF's ability to understand what index and column labels are, if for no other reason than to make their independent styling less janky!

@gmischler
Copy link
Collaborator

gmischler commented Oct 30, 2023

Apologies for not seeing that you already went through this above. I kind of misread the conversation, as what stuck out to me immediately was the mis-framing of the issue.

Yes, #970 is approaching the topic from a rather particular angle. No worries about that.

my goal here was to build out FPDF's ability to understand what index and column labels are, if for no other reason than to make their independent styling less janky!

That part is very much appreciated! The dilemma is simply about the optimal answer to "what goes where".

What the table module definitively needs internally are a good way to handle row labels (are those called "indexes" outside the pandas ecosystem as well?) and row spans. Adding those two features would be a good foundation for any of the more ambitious ideas (and the latter also helpful for write_html()).

On top of that foundation, a good concept for what you call "multi-level dataframes" will need some careful thought. The trick here is not focus your tunnel vision on pandas. Consider that there are likely other packages out there that may produce similar data, but will deliver them in a slightly different form, using differently named and organized data structures. Ultimately we will want to support all of them (or at least make that possible), so we need some kind of translation layer in between, separate from the core fpdf2 modules.

Whether this translation layer should become a part of the library (eg. in a new subdirectory "adapters"), would be better hosted as its own project, or some other approach entirely, that is up to discussion. But in any case, there needs to be a cleanly structured system with defined levels of abstraction. Data translation and conversion between different systems is a tricky topic, and doesn't get easier by just stirring everything together in one bowl.

If you're still interested in doing this (assuming I haven't chased you away with my criticism 😉), I suggest to restart from a clean slate. You can probably reuse a good chunk of the code already written, but it needs to be organized quite differently, so trying to salvage this PR might more hinder your progress than helping it.

fpdf/fpdf.py Outdated
@@ -4882,6 +4882,23 @@ def table(self, *args, **kwargs):
yield table
table.render()

def create_table_from_dataframe(self, df, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such function should probably live in its own fpdf/pandas.py module.

@gmischler already provided a detailed explanation on why this should not be there, in fpdf2 core source file 😊

@Lucas-C
Copy link
Member

Lucas-C commented Oct 30, 2023

my goal here was to build out FPDF's ability to understand what index and column labels are, if for no other reason than to make their independent styling less janky!

That part is very much appreciated! The dilemma is simply about the optimal answer to "what goes where".

I fully second @gmischler there 😊

I totally agree that this PR cannot be merged "as it is" now,
but I hope you will consider making some refactorings @afriedman412,
because this contribution to fpdf2 would be very welcome!

We sure can find room in fpdf2 for a "small UX improvement that will save data people countless small headaches"!

A good start could be to refactor your code in order minimizes the changes to the existing classes,
and initiate a new fpdf/pandas.py module!

@afriedman412
Copy link
Author

appreciate the feedback!

do you already have adapters set up or would this be the first?

@gmischler
Copy link
Collaborator

do you already have adapters set up or would this be the first?

So far they all live in the "Mixing with other libs" section of the documentation...
Until very recently, fpdf2 didn't have complex higher level APIs like now for tables and text regions, so the question didn't come up in the same way.

...and initiate a new fpdf/pandas.py module!

I'd hope to keep it as generic as possible.

fpdf/adapters/table_base.py
fpdf/adapters/table_pandas.py
fpdf/adapters/table_xyzpackage.py

There may well be other types of adapters in the future covering different aspects of working with fpdf2. I can't think of a good example right now, but users keep surprising me with their ideas... 😉

@afriedman412
Copy link
Author

ok rad, will work on this

@afriedman412
Copy link
Author

What the table module definitively needs internally are a good way to handle row labels (are those called "indexes" outside the pandas ecosystem as well?) and row spans. Adding those two features would be a good foundation for any of the more ambitious ideas (and the latter also helpful for write_html()).

can you clarify what you want the write_html() function to do?

I got down the rabbit hole of heading/index functionality and realized I didn't have a destination!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add automatically merging cells in tables
5 participants