Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing/reading notes to/from excel files #58831

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

diogomsmiranda
Copy link

@diogomsmiranda diogomsmiranda commented May 26, 2024

Me and my colleague @Dacops developed this feature with the use case issued in mind.
We believe the solution can be better understood when looking at the examples we provide in the documentation but also in the tests we made.

Unfortunately, not every engine supported by pandas had the capability to handle notes (calamine, odfreader, pyxlsb, odswriter), on the remaining ones we took the liberty to not only add writing capability as requested in #58070 but also add the complementary operation of reading notes.

In summary, now you can write notes to an excel (using either openpyxl or xlswriter) by doing:

df = pandas.DataFrame(...) # DataFrame holding cell content
df_notes = pandas.DataFrame(...)  # DataFrame holding notes
df.style.set_tooltips(df_notes).to_excel(excel_file.xlsx)

The complementary of this (reading using either xlrd or openpyxl) would look something like:

df_notes = pandas.DataFrame() # empty DataFrame
pandas.read_excel(path, engine="xlrd", notes=df_notes) # df_notes now hold the notes as they were in the excel

We will be available and ready to answer any doubts related to the pr.

Hope everything is to your liking.

@attack68
Copy link
Contributor

I think this looks like a good PR overall but this is not actually what I originally suggested. The ExcelFormatter class in pandas is capable of detecting a Styler object and what I suggested was that that branching could be extended to handle an attached Styler tooltips object. I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

However, what you have built is the generic ability for the to_excel method to handle comments in the forms of notes which is an additional keyword argument to the underlying function. Additionally when looking at first glance you seem to have added the ability to read notes as well. This may well all be quite useful functionality.

So this is a comprehensive PR. It may have been best to segregate these different components into different successive PRs, to make it easier to properly review and consider all the edge cases.

One case that springs to mind is that the way tooltips works for Styler is to reindex_like so that its columns and index align with the underlying dataframe. This does not appear to be the case for notes, so some kind of consideration for alignment, or properly documenting the difference would probably be needed.

Before doing a full review it will be interesting to hear from other core devs. Also in regards to potential spilt of the PR.

@WillAyd
Copy link
Member

WillAyd commented May 28, 2024

I am -.5 on adding this as a keyword argument to the excel I/O functions; the fact that so few of the underlying libraries supports this is a red flag. Generally we have a really high bar to expand our API, and adding this without supporting all underlying libraries is confusing for end users

@diogomsmiranda
Copy link
Author

diogomsmiranda commented May 28, 2024

I think this looks like a good PR overall but this is not actually what I originally suggested. The ExcelFormatter class in pandas is capable of detecting a Styler object and what I suggested was that that branching could be extended to handle an attached Styler tooltips object. I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

@attack68 Thank you for the comment, and sorry for our miss interpretation, I've adjusted the description of the PR accordingly.

One case that springs to mind is that the way tooltips works for Styler is to reindex_like so that its columns and index align with the underlying dataframe. This does not appear to be the case for notes, so some kind of consideration for alignment, or properly documenting the difference would probably be needed.

That's indeed a point we noticed once we were testing the code, however, we didn't know what would be the right way to approach this, we'll await further instructions from core developers as you suggested.

@diogomsmiranda
Copy link
Author

diogomsmiranda commented May 28, 2024

I am -.5 on adding this as a keyword argument to the excel I/O functions; the fact that so few of the underlying libraries supports this is a red flag. Generally we have a really high bar to expand our API, and adding this without supporting all underlying libraries is confusing for end users

@WillAyd Thank you for having a look at our pull request.

We acknowledge that introducing a feature without support for all underlying libraries may not appear ideal. Initially, we did not foresee this would be a problem. However, after a detailed examination of the source code from the relevant libraries, we discovered that many are no longer actively developed and/or have limited functionality, python-calamine, however, is exception to this as it's relatively new still. This likely contributes to the current lack of support for notes.

Despite these limitations, implementing this feature in pandas will enable support for notes in Excel formats such as .xls, .xlsx, .xlsm, .xltx, and .xltm. While this support is somewhat restricted, we believe it will still significantly improve user workflows.

We leave links to the libraries so you can see for yourself.

@rhshadrach
Copy link
Member

I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

@attack68 Thank you for the comment, and sorry for our miss interpretation, I've adjusted the description of the PR accordingly.

But this PR is still modifying the core DataFrame.to_excel method?

However, after a detailed examination of the source code from the relevant libraries, we discovered that many are no longer actively developed and/or have limited functionality, python-calamine, however, is exception to this as it's relatively new still. This likely contributes to the current lack of support for notes.

While python-calamine is new, the Rust library it uses, calamine has been around since 2016. And in the ReadMe, they state

Many (most) part of the specifications are not implemented, the focus has been put on reading cell values and vba code.

From this, it seems unlikely for calamine to gain the ability to read notes. If I'm understanding this PR correctly, only openpyxl and xlrd support reading comments. Coupled with the fact that the API is, in my opinion, somewhat awkward (passing in a DataFrame to store the result), I am rather negative on supporting reading notes as well.

Somewhat of a separate question - are there engines which are no longer maintained and they provide no additional functionality from libraries that are maintained? These could be deprecated - xlrd comes to mind, but I'm not sure if calamine completely replaces it.

@Dacops
Copy link

Dacops commented Jun 3, 2024

But this PR is still modifying the core DataFrame.to_excel method?

No, to_excel is not modified, the notes are kept as an attribute of ExcelFormatter (defaulted to None) that is then received by the writers.

From this, it seems unlikely for calamine to gain the ability to read notes. If I'm understanding this PR correctly, only openpyxl and xlrd support reading comments. Coupled with the fact that the API is, in my opinion, somewhat awkward (passing in a DataFrame to store the result), I am rather negative on supporting reading notes as well.

Reading notes was initially added for testing purposes as an afterthought. It's a more intrusive method and is supported by fewer libraries. Now that writing notes has demonstrated its intended functionality, this feature could be removed.

Somewhat of a separate question - are there engines which are no longer maintained and they provide no additional functionality from libraries that are maintained? These could be deprecated - xlrd comes to mind, but I'm not sure if calamine completely replaces it.

Based on our research, the writers xlswriter and openpyxl both write to .xlsx files (I believe Pandas uses both in parallel due to some compatibility issues). The odswriter writes to .ods files.
For the readers, xlrd reads from old-style Excel formats (.xls), while openpyxl reads from .xlsx, .xlsm, .xltx, and .xltm files. Odfreader handles OpenDocument type files (.odf, .ods, .odt), and pyxlsb reads binary (.xlsb) files.
Calamine is the most interesting package as it reads from both Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) formats. It is also a newer package. I believe some readers could be replaced by Calamine; it’s just a matter of checking for any compatibility issues.

@Dacops
Copy link

Dacops commented Jun 6, 2024

Rebased to address a conflict with this commit

@rhshadrach
Copy link
Member

Reading notes was initially added for testing purposes as an afterthought. It's a more intrusive method and is supported by fewer libraries. Now that writing notes has demonstrated its intended functionality, this feature could be removed.

@attack68 - you're good with just writing notes instead of having reading as well?

@attack68
Copy link
Contributor

attack68 commented Jun 7, 2024

Yes that sounds good. I think the writing mode is a good feature and relatively speaking it is a fairly neglible extension to the overall API. I think this should be accepted to merge.

I understand that the reading was added probably as a 'round trip' test whcih is sensible but reading is a separate process and this could (and should) be added in a separate, atomic PR, albeit that seems like lesser chance of acceptance.

@diogomsmiranda
Copy link
Author

Yes that sounds good. I think the writing mode is a good feature and relatively speaking it is a fairly neglible extension to the overall API. I think this should be accepted to merge.

I understand that the reading was added probably as a 'round trip' test whcih is sensible but reading is a separate process and this could (and should) be added in a separate, atomic PR, albeit that seems like lesser chance of acceptance.

We are also ok with this. @rhshadrach should we now delete the reading from this pr and open a separate one?

@rhshadrach
Copy link
Member

@diogomsmiranda - modifying this PR would be preferable.

@Dacops Dacops force-pushed the excel-notes branch 4 times, most recently from 2913e11 to 7ed8fb5 Compare June 9, 2024 01:10
@Dacops
Copy link

Dacops commented Jun 9, 2024

@rhshadrach We've modified the PR, however we are having a weird issue with the docstrings validation, any idea what might be causing this problem?

pandas/io/formats/style.py Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put remarks on one of the writers - I think this applies to all writers. This also needs tests. While pandas does not have the ability to read the notes, you can use one of the Excel engines that does to read them directly.

What happens in the case of merged cells?

pandas/io/excel/_openpyxl.py Outdated Show resolved Hide resolved
pandas/io/excel/_openpyxl.py Outdated Show resolved Hide resolved
pandas/io/excel/_openpyxl.py Show resolved Hide resolved
@Dacops Dacops force-pushed the excel-notes branch 2 times, most recently from 83896cf to f57b587 Compare June 16, 2024 22:33
@Dacops
Copy link

Dacops commented Jun 16, 2024

@rhshadrach

I put remarks on one of the writers - I think this applies to all writers.

I've updated this in all writers, except the first example, which I didn't understand, could you better explain how do you want us to change that logic?

This also needs tests. While pandas does not have the ability to read the notes, you can use one of the Excel engines that does to read them directly.

I've brought all the tests from the original issue except now the reading is handled directly by the library openpyxl in the testing file.

What happens in the case of merged cells?

The comment writing occurs after all the content is written, so if the notes are passed in mind with cell merging that's how the writing will proceed.


Also some datetime tests started failing on the 32-bits test suites due to overflows, is this a general problem?

raise NotImplementedError(
"""
Notes are not supported by the odswriter engine,
see https://github.com/mmulqueen/odswriter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reference to the linked repo. pandas uses https://github.com/eea/odfpy for writing ODS files.

Copy link
Author

@diogomsmiranda diogomsmiranda Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I might be wrong but I believe we went to pandas documentation when searching for all the engines pandas currently uses and found this for ExcelWriters, in here explicitly links mmulqueen/odswriter as the engine to write ods files.

https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, odfpy is indeed used but for reading Open-Document type files (https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#pandas-read-excel). The writing operation is taken care by that odswriter as my colleague referred.

Copy link
Contributor

@asishm asishm Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @diogomsmiranda @Dacops. I can see why now. The docstring in pd.ExcelWriter is incorrect (looks like was added in #47568).

  1. If you check the engine_kwargs section, it mentions odf.opendocument.OpenDocumentSpreadsheet(**engine_kwargs). None of this construct is available in the linked odswriter library, but is available in the odfpy library instead https://github.com/eea/odfpy/blob/574f0fafad73a15a5b11b115d94821623274b4b0/odf/opendocument.py#L842.

  2. If you check what extra libraries pandas installs when you run pip install pandas[excel], you don't see odswriter listed there.

    excel = ['odfpy>=1.4.1', 'openpyxl>=3.1.0', 'python-calamine>=0.1.7', 'pyxlsb>=1.0.10', 'xlrd>=2.0.1', 'xlsxwriter>=3.0.5']

  3. grepping the codebase for import odswriter gives 0 hits

cc @rhshadrach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I'm glad it could be useful to find the mistake. We'll change the link to the correct one, or even better check if odfpy has compatibility with notes, and adjust accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking through the codebase for mentions of comments or notes in cells I didn't find anything, although I did find a couple discussion about the current state of the project, mentioned as "dead".

In the mean time I've fixed the exception to link the right library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current state of the project, mentioned as "dead".

I think that's fair. The last commit on that repo seems to be 3 years ago.

@attack68
Copy link
Contributor

This looks a lot cleaner now, more targeted and the tests are written to detect the notes. Is there much outstanding other than fixing any failing tests?

@Dacops
Copy link

Dacops commented Jul 12, 2024

This looks a lot cleaner now, more targeted and the tests are written to detect the notes. Is there much outstanding other than fixing any failing tests?

Nope, I believe this is done, the failing tests were something that was failing globally a while back (32bit test suites were suffering from overflows) but I believe this is fixed now. I'll rebase the PR to address the conflict at the whatsnew file and it should pass everything.

…#58070)

Co-Authored-By: diogomsmiranda <diogomsmiranda@tecnico.ulisboa.pt>
@diogomsmiranda
Copy link
Author

Hey, @rhshadrach for some reason Github is still showing that you requested changes, but I believe we already addressed them. Can you confirm everything is ok?

@rhshadrach
Copy link
Member

Can you confirm everything is ok?

I should have some time in the next few days. As a side note, this would be easier on my end if there wasn't force-pushes involved, as then I could just view the diff from the last time I reviewed this PR. However, because there are force-pushes, I need to review the entire PR again.

@diogomsmiranda
Copy link
Author

As a side note, this would be easier on my end if there wasn't force-pushes involved, as then I could just view the diff from the last time I reviewed this PR. However, because there are force-pushes, I need to review the entire PR again.

Thanks for the feedback and sorry for the inconvenience, it's noted for any future contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: styler.to_excel to support excel notes as implementation of tooltips
6 participants