Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Closed
Themanwithoutaplan opened this issue Nov 1, 2015 · 12 comments

Comments

@Themanwithoutaplan
Copy link
Contributor

commented Nov 1, 2015

I took a brief look at the existing code for importing files with xlrd but it looks tightly coupled with that library and not particularly straightforward to adapt.

I think that the read-only mode of openpyxl would be a good fit for Pandas and would like to work with you to provide the necessary API to go from rows to a dataframe. Data is stored by row so this is the only sensible approach. We will be adding read-support for NumPy types, ie. when passed them, but probably not when reading Excel files.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 13, 2018

@Themanwithoutaplan : Sorry that this went unnoticed for AWHILE ! 😮

Is this something you would still be interested in pursuing?

@Themanwithoutaplan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2018

@gfyoung Sure. We've got a sprint coming up this weekend where I'll be working on this. I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

Seems to me that a telco would make sense.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

That makes a lot of sense to me. I don't think any of us would object to making the Excel reading process a lot nicer to use and understand.

@Themanwithoutaplan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2018

I suspect that the XLS code will have to stay as it is because xlrd is now effectively abandonware. Personally, I'd think about deprecating it as a format… but anyway.

2.6 will add the values_only parameter to iter_rows. My current tests suggest similar performance when reading worksheets to xlrd but with the added advantage of reading only the sheets you're interested in. Unsurprising really because the underlying code in both cases is ElementTree plus string to types. In addition, Pandas seems to do some heuristics to try and find relevant cell areas for those worksheets which aren't a row of headers followed by rows of data.

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

If you're around later today (I'm +0100) then we could perhaps discuss this over Hangouts or some such.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Nov 15, 2018

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

I think I see what you're saying, but I suggest opening an issue for this separately so as to explain more what you mean.

@Themanwithoutaplan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 23, 2018

FWIW I've now released an alpha of 2.6 and backported the values_only parameter. Some performance numbers using an existing file are available. In particular for anything containing datetimes openpyxl is easier to work with as well as being faster.

I'm not sure whether being able to run in parallel is useful for Pandas, nor whether this would be best using threads or processes: there is a lot of I/O but also quite a lot of CPU work.

I'll also create a separate ticket for what I'd like to see in the Pandas API.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

@Themanwithoutaplan I've been working to decouple xlrd in this code and have implemented a base class for excel reading which you can see below:

class _BaseExcelReader(object):

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

@Themanwithoutaplan

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

Looks a lot better (especially the writer code that doesn't have to worry about the different ways we've handled styles). The reader code is definitely less tightly coupled to xlrd than it was. You will probably want to use read-only mode with openpyxl.

I notice you've got your own code for converting Excel coords to numerical indices but I think all the libraries their own robust functions you could probably use (ok, openpyxl uses 1-indexing) and your own evil code for handling headers with multiple layers. I'd really like to see this in the dataframe or index API, or anything that openpyl can use directly.

@tdamsma

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2019

Well if no one else has started this I'll see if I have some time this weekend to make an openpyxl reader. Hopefully I can reuse all the same tests, just swap out the engine.

@Themanwithoutaplan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

Congratulations!

@selik

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

@WillAyd @tdamsma Looks like the docs should be updated to reflect this change. The docstring for the engine parameter says the only acceptable values are None or xlrd.

https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.read_excel.html#pandas.read_excel

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

@selik can you open a PR or new issue for that? Otherwise it's likely to not get done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.