Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_excel (xlrd engine) add parameter for ignore_workbook_corruption #40274

Closed
aster94 opened this issue Mar 6, 2021 · 9 comments · Fixed by #52214
Closed

ENH: read_excel (xlrd engine) add parameter for ignore_workbook_corruption #40274

aster94 opened this issue Mar 6, 2021 · 9 comments · Fixed by #52214
Assignees
Labels
Enhancement IO Excel read_excel, to_excel

Comments

@aster94
Copy link

aster94 commented Mar 6, 2021

Is your feature request related to a problem?

yes, sometimes, expecially when created by 3rd party application old excel file can be corrupted:

https://www.codeforests.com/2020/05/28/fix-the-compdocerror-for-xlrd/
https://stackoverflow.com/questions/12705527/reading-excel-files-with-xlrd
https://stackoverflow.com/questions/34550624/python-xlrd-read-excel-file-error/
https://titanwolf.org/Network/Articles/Article?AID=f1370e16-c9ef-4892-a00e-b1bafee2f6a8#gsc.tab=0

Describe the solution you'd like

xlrd's team recently in version 2.0 (december 2020) solved this with a flag: ignore_workbook_corruption, please see https://xlrd.readthedocs.io/en/latest/changes.html
So my desire would be to just have this flag inside the pandas's function read_excel

API breaking implications

any, I guess

Describe alternatives you've considered

Open the file directly with xlrd and pass it to pandas
I may be unaware of other solutions

@aster94 aster94 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 6, 2021
@ahawryluk
Copy link
Contributor

Since it's an issue unique to XLS files, would ignore_xls_corruption be an appropriate name?

@aster94
Copy link
Author

aster94 commented Mar 7, 2021

Yes, as you know xlrd dropped support to all formats, except xls. So in my opinion it's an appropriate name

Edit: just thinking, maybe also openpyxl and odf have a flag to ignore corrupted files? If yes maybe a more general name could be used

Edit 2: openpyxl seems to not support such a flag, I am not sure but I didn't found anything similar also in odfpy. So yes, ignore_xls_corruption is a nice name

@jbrockmendel jbrockmendel added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2021
@rmhowe425
Copy link
Contributor

take

@rmhowe425
Copy link
Contributor

@rhshadrach Is this issue okay to pick up and implement?

@rhshadrach
Copy link
Member

rhshadrach commented Mar 14, 2023

I'm generally negative on adding engine-specific keywords.

pandas calls open_workbook of xlrd. I think we should add engine_kwargs to read_excel. This is an argument that many other pandas methods enjoy. This would enable the user to pass arguments to engines as they see fit without us having to modify the signature of read_excel for each engine.

This would require:

  • Enabling engine_kwargs for all readers
  • Documenting which method of each engine pandas uses to read an Excel file so that users can find which arguments can be passed through to the engine.

@rmhowe425
Copy link
Contributor

@rhshadrach Okay! I can do that if that's the official route that we want to take for this issue!

@rhshadrach
Copy link
Member

cc @phofl for any thoughts

@phofl
Copy link
Member

phofl commented Mar 14, 2023

agree with @rhshadrach

An engine specific kwarg is not a great idea, engine_kwargs makes more sense

@rhshadrach
Copy link
Member

@rmhowe425 - I think we have a good way forward. Please go ahead with this if you're still interested!

mroeschke added a commit that referenced this issue Apr 12, 2023
* Fixing merge conflicts

* Fixing merge conflict

* Fixing documentation issues

* standardized usage of engine_kwargs, fixed unit tests & doc strings

* Fixing documentation issues

* Fixing implementation logic and unit tests

* Fixing implementation logic

* Fixing formatting issues

* Fixing error for test Docstring validation, typing, and other manual pre-commit hooks

* Fixing documentation error

* Standardizing engine_kwarg types

* Fixing minor issues with unit tests and documentation

* Fixing documentation issue

* Fixing a formatting / documentation error

* Fixing documentation errors

* Fixing documentation errors

* Fixing documentation errors

* Fixing documentation errors

* Fixing documentation errors

* Adding an extra blank line to troubleshoot documentation error

* Adding an extra blank line to troubleshoot documentation error

* Fixing documentation issues

* Fixing formatting errors

* Fixing formatting errors

* Fixing formatting errors

* Fixing logic and formatting issues in unit tests

* Fixing issues with merge conflict

* Fixing formatting issue

* Update pandas/io/excel/_base.py

---------

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants