Support for openpyxl when reading XLSX (Excel 2010) files #11499

Themanwithoutaplan · 2015-11-01T14:58:05Z

I took a brief look at the existing code for importing files with xlrd but it looks tightly coupled with that library and not particularly straightforward to adapt.

I think that the read-only mode of openpyxl would be a good fit for Pandas and would like to work with you to provide the necessary API to go from rows to a dataframe. Data is stored by row so this is the only sensible approach. We will be adding read-support for NumPy types, ie. when passed them, but probably not when reading Excel files.

gfyoung · 2018-11-13T22:47:41Z

@Themanwithoutaplan : Sorry that this went unnoticed for AWHILE ! 😮

Is this something you would still be interested in pursuing?

Themanwithoutaplan · 2018-11-14T13:28:06Z

@gfyoung Sure. We've got a sprint coming up this weekend where I'll be working on this. I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

Seems to me that a telco would make sense.

gfyoung · 2018-11-14T19:17:58Z

I'd basically like to see the packages work together using explicit APIs rathe than the current, largely unavoidable, spaghetti code.

That makes a lot of sense to me. I don't think any of us would object to making the Excel reading process a lot nicer to use and understand.

Themanwithoutaplan · 2018-11-15T16:44:44Z

I suspect that the XLS code will have to stay as it is because xlrd is now effectively abandonware. Personally, I'd think about deprecating it as a format… but anyway.

2.6 will add the values_only parameter to iter_rows. My current tests suggest similar performance when reading worksheets to xlrd but with the added advantage of reading only the sheets you're interested in. Unsurprising really because the underlying code in both cases is ElementTree plus string to types. In addition, Pandas seems to do some heuristics to try and find relevant cell areas for those worksheets which aren't a row of headers followed by rows of data.

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

If you're around later today (I'm +0100) then we could perhaps discuss this over Hangouts or some such.

gfyoung · 2018-11-15T18:04:29Z

From our side: it would be nice if DataFrame.columns could give richer information such as how many columns a heading is valid for. This would make writing dataframes to a stream a lot easier.

I think I see what you're saying, but I suggest opening an issue for this separately so as to explain more what you mean.

Themanwithoutaplan · 2018-11-23T15:50:21Z

FWIW I've now released an alpha of 2.6 and backported the values_only parameter. Some performance numbers using an existing file are available. In particular for anything containing datetimes openpyxl is easier to work with as well as being faster.

I'm not sure whether being able to run in parallel is useful for Pandas, nor whether this would be best using threads or processes: there is a lot of I/O but also quite a lot of CPU work.

I'll also create a separate ticket for what I'd like to see in the Pandas API.

WillAyd · 2019-01-29T04:31:08Z

@Themanwithoutaplan I've been working to decouple xlrd in this code and have implemented a base class for excel reading which you can see below:

pandas/pandas/io/excel.py

Line 379 in 3fd47fe

class _BaseExcelReader(object):

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

Themanwithoutaplan · 2019-01-29T18:00:38Z

Still planning to shuffle things around but I'm hoping this represents an improvement over when you first started looking at this. PRs are always welcome and if there's anything I can do to assist feel free to reach out

Looks a lot better (especially the writer code that doesn't have to worry about the different ways we've handled styles). The reader code is definitely less tightly coupled to xlrd than it was. You will probably want to use read-only mode with openpyxl.

I notice you've got your own code for converting Excel coords to numerical indices but I think all the libraries their own robust functions you could probably use (ok, openpyxl uses 1-indexing) and your own evil code for handling headers with multiple layers. I'd really like to see this in the dataframe or index API, or anything that openpyl can use directly.

tdamsma · 2019-01-31T21:10:04Z

Well if no one else has started this I'll see if I have some time this weekend to make an openpyxl reader. Hopefully I can reuse all the same tests, just swap out the engine.

Themanwithoutaplan · 2019-06-28T16:55:17Z

Congratulations!

selik · 2019-08-14T20:40:47Z

@WillAyd @tdamsma Looks like the docs should be updated to reflect this change. The docstring for the engine parameter says the only acceptable values are None or xlrd.

https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.read_excel.html#pandas.read_excel

TomAugspurger · 2019-08-14T21:23:35Z

@selik can you open a PR or new issue for that? Otherwise it's likely to not get done.

selik · 2019-11-22T19:38:43Z

@TomAugspurger Looks like someone took care of it, but it's not reflected in v0.25.3

jbrockmendel added the IO Excel read_excel, to_excel label Jul 25, 2018

gfyoung added the Enhancement label Nov 13, 2018

WillAyd added Duplicate Report Duplicate issue or pull request and removed Duplicate Report Duplicate issue or pull request labels Nov 14, 2018

WillAyd mentioned this issue Nov 14, 2018

Why we cannot use openpyxl to read excel files in pandas? #21099

Closed

WillAyd mentioned this issue Dec 25, 2018

Decouple xlrd reading from ExcelFile class #24423

Merged

WillAyd mentioned this issue Jan 31, 2019

Output excel table objects with to_xlsx() #24862

Closed

tdamsma mentioned this issue Feb 2, 2019

Openpyxl engine for reading excel files #25092

Merged

5 tasks

jreback added this to the Contributions Welcome milestone May 19, 2019

jorisvandenbossche mentioned this issue May 22, 2019

Please stop relying on xlrd #26487

Closed

cjw296 mentioned this issue Jun 13, 2019

BadZipFile error when using read_excel on .xlsx #26813

Closed

jreback modified the milestones: Contributions Welcome, 0.25.0 Jun 28, 2019

WillAyd closed this as completed in #25092 Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Themanwithoutaplan commented Nov 1, 2015

gfyoung commented Nov 13, 2018

Themanwithoutaplan commented Nov 14, 2018

gfyoung commented Nov 14, 2018 •

edited

Loading

Themanwithoutaplan commented Nov 15, 2018

gfyoung commented Nov 15, 2018

Themanwithoutaplan commented Nov 23, 2018

WillAyd commented Jan 29, 2019

Themanwithoutaplan commented Jan 29, 2019

tdamsma commented Jan 31, 2019

Themanwithoutaplan commented Jun 28, 2019

selik commented Aug 14, 2019

TomAugspurger commented Aug 14, 2019

selik commented Nov 22, 2019

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Support for openpyxl when reading XLSX (Excel 2010) files #11499

Comments

Themanwithoutaplan commented Nov 1, 2015

gfyoung commented Nov 13, 2018

Themanwithoutaplan commented Nov 14, 2018

gfyoung commented Nov 14, 2018 • edited Loading

Themanwithoutaplan commented Nov 15, 2018

gfyoung commented Nov 15, 2018

Themanwithoutaplan commented Nov 23, 2018

WillAyd commented Jan 29, 2019

Themanwithoutaplan commented Jan 29, 2019

tdamsma commented Jan 31, 2019

Themanwithoutaplan commented Jun 28, 2019

selik commented Aug 14, 2019

TomAugspurger commented Aug 14, 2019

selik commented Nov 22, 2019

gfyoung commented Nov 14, 2018 •

edited

Loading