Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Closed
Themanwithoutaplan opened this issue Nov 1, 2015 · 13 comments · Fixed by #25092
Closed

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Themanwithoutaplan opened this issue Nov 1, 2015 · 13 comments · Fixed by #25092
Labels
Enhancement IO Excel read_excel, to_excel
Milestone

Comments

@Themanwithoutaplan
Copy link
Contributor

I took a brief look at the existing code for importing files with xlrd but it looks tightly coupled with that library and not particularly straightforward to adapt.

I think that the read-only mode of openpyxl would be a good fit for Pandas and would like to work with you to provide the necessary API to go from rows to a dataframe. Data is stored by row so this is the only sensible approach. We will be adding read-support for NumPy types, ie. when passed them, but probably not when reading Excel files.

@jbrockmendel jbrockmendel added the IO Excel read_excel, to_excel label Jul 25, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 13, 2018

@Themanwithoutaplan : Sorry that this went unnoticed for AWHILE ! 😮

Is this something you would still be interested in pursuing?

@WillAyd WillAyd added Duplicate Report Duplicate issue or pull request and removed Duplicate Report Duplicate issue or pull request labels Nov 14, 2018
@Themanwithoutaplan
Copy link
Contributor Author

@gfyoung Sure. We've got a sprint coming up this weekend where I'll be working on this. I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

Seems to me that a telco would make sense.

@gfyoung
Copy link
Member

gfyoung commented Nov 14, 2018

I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

That makes a lot of sense to me. I don't think any of us would object to making the Excel reading process a lot nicer to use and understand.

@Themanwithoutaplan
Copy link
Contributor Author

I suspect that the XLS code will have to stay as it is because xlrd is now effectively abandonware. Personally, I'd think about deprecating it as a format… but anyway.

2.6 will add the values_only parameter to iter_rows. My current tests suggest similar performance when reading worksheets to xlrd but with the added advantage of reading only the sheets you're interested in. Unsurprising really because the underlying code in both cases is ElementTree plus string to types. In addition, Pandas seems to do some heuristics to try and find relevant cell areas for those worksheets which aren't a row of headers followed by rows of data.

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

If you're around later today (I'm +0100) then we could perhaps discuss this over Hangouts or some such.

@gfyoung
Copy link
Member

gfyoung commented Nov 15, 2018

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

I think I see what you're saying, but I suggest opening an issue for this separately so as to explain more what you mean.

@Themanwithoutaplan
Copy link
Contributor Author

FWIW I've now released an alpha of 2.6 and backported the values_only parameter. Some performance numbers using an existing file are available. In particular for anything containing datetimes openpyxl is easier to work with as well as being faster.

I'm not sure whether being able to run in parallel is useful for Pandas, nor whether this would be best using threads or processes: there is a lot of I/O but also quite a lot of CPU work.

I'll also create a separate ticket for what I'd like to see in the Pandas API.

@WillAyd
Copy link
Member

WillAyd commented Jan 29, 2019

@Themanwithoutaplan I've been working to decouple xlrd in this code and have implemented a base class for excel reading which you can see below:

class _BaseExcelReader(object):

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

@Themanwithoutaplan
Copy link
Contributor Author

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

Looks a lot better (especially the writer code that doesn't have to worry about the different ways we've handled styles). The reader code is definitely less tightly coupled to xlrd than it was. You will probably want to use read-only mode with openpyxl.

I notice you've got your own code for converting Excel coords to numerical indices but I think all the libraries their own robust functions you could probably use (ok, openpyxl uses 1-indexing) and your own evil code for handling headers with multiple layers. I'd really like to see this in the dataframe or index API, or anything that openpyl can use directly.

@tdamsma
Copy link
Contributor

tdamsma commented Jan 31, 2019

Well if no one else has started this I'll see if I have some time this weekend to make an openpyxl reader. Hopefully I can reuse all the same tests, just swap out the engine.

@jreback jreback added this to the Contributions Welcome milestone May 19, 2019
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Jun 28, 2019
@Themanwithoutaplan
Copy link
Contributor Author

Congratulations!

@selik
Copy link
Contributor

selik commented Aug 14, 2019

@WillAyd @tdamsma Looks like the docs should be updated to reflect this change. The docstring for the engine parameter says the only acceptable values are None or xlrd.

https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.read_excel.html#pandas.read_excel

@TomAugspurger
Copy link
Contributor

@selik can you open a PR or new issue for that? Otherwise it's likely to not get done.

@selik
Copy link
Contributor

selik commented Nov 22, 2019

@TomAugspurger Looks like someone took care of it, but it's not reflected in v0.25.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants