Read XLSX and XLSB files natively #11181

SaelKimberly · 2023-09-18T13:48:34Z

Description

Hello!

This project is absolutely amazing and I really appreciate when I get notifications about updates :)
So, I created a small python module to read XLSX and XLSB initially, in pure python, into PyArrow arrays, and then into dataframe, Polars or Pandas.
This is just a module concept, but it already works on my machine. It is very fast and consumes very little RAM
I started working on this when Polars didn't have built-in Excel support - only via xlsx2csv.
Firstly, I've also worked with openpyxl and it just works, but I've found that it's very slow for me and sometimes very inefficient.
I would like to participate in your project by investing a little with my code or idea.
The module is codenamed RXLS (Read XLS[X|B]), and although I have left almost no comments there, I am ready to answer any questions. There may be some issues, and I will resolve them all...
I also worked with Rust at one time, but that was quite a long time ago.

https://github.com/SaelKimberly/rxls

Thank you very much, I really appreciate your work 😃

alexander-beedie · 2023-09-18T14:00:41Z

Can you provide some more details? For example, benchmarks comparing some reasonable sized loads using your code vs xlsx2csv & openpyxl would be useful.

Assuming you are planning to put your module up on pypi when it is finished, we could perhaps add it as one of the optional engines that our existing read_excel method knows about (currently the two above, alongside a new read_ods function for dealing with OpenOffice spreadsheets). Would be good to see it get some traction or have a compelling speed advantage first though ;)

avimallu · 2023-09-18T16:57:11Z

@SaelKimberly, if your package can read XLSB rather fast, you should seriously consider making it available on pypi. There's no package that isn't painfully slow in reading XLSB, and it would be a game changer, even outside Polars and high performance DataFrame communities.

SaelKimberly · 2023-09-18T19:08:39Z

Sure. I will provide some benchmarks, but it will take a couple of days.

At this point, on my pc, I scanned ~30 files 30x12400 with average rate about 3200 rows per second (just used tqdm). Sometimes about 4000 rows/sec.

Strange but true - I have not any benchmark! Also interested now :)

The main idea of my project - to read data from excel files to columnar format directly, there is no analogue in the web (I haven't found them).

Cells are stored in pyarrow.StringArrays, and their original type just increase type counter in it's column. Empty space between cells fills with nulls. After filling stage, type of column will be inferred, using most frequent datatype - just convert whole column then...

I've never had time to compare speed and memory usage of my solution with others, but I will solve it as soon as possible :)

SaelKimberly · 2023-09-19T07:26:01Z

I've create simple benchmark. Results are optimistic 😏

https://github.com/SaelKimberly/rxls/blob/main/benchmarks.ipynb

avimallu · 2023-09-19T14:09:04Z

I really like the benchmark tests you've written - they seem rather comprehensive. Can you perhaps change the file you're testing with to be larger than ~10,000 rows? I don't use Excel as frequently now, but when I did, I do remember reaching around 100,000 to 200,000 rows (not great use of Excel, I'm aware, but I'm the person who needed to process those files, not make them) - a speed advantage at that scale might actually demonstrate the benchmarks that @alexander-beedie is looking for.

SaelKimberly · 2023-09-19T15:45:03Z

Yes, I will try to find better datasource. OWID, maybe 🤔

SaelKimberly · 2023-09-21T07:06:45Z

So, I've rewrited benchmarks for a much larger dataset from OWID (COVID-19 cases for USA, GBR, CAN, DEU, ITA, IND).

341,376 rows, 67 columns

Maybe too large ;)

My RXLS take 2nd place by speed of XLSB parsing, in comparison with PyXLSB 🤣

But, I think, my solution is much more stronger, because of dates correct recognition and relatively small memory footprint.
Also, there is many optimisation points, I think.

In the future, I would like to rewrite RXLS to Rust, maybe it's not difficult. XLSX - with zip-rs and some kind of fast streaming xml parser, XLSB - with zip-rs and simple BIFF12 parser in Rust, same as my BIFF12 parser 😃.

Maybe, Polars openpyxl adapter needs some optimisations - see benchmark results 🤔

alexander-beedie · 2023-09-21T20:17:57Z

Maybe, Polars openpyxl adapter needs some optimisations

Well, their core speed is up to them, heh. We always open workbooks with openpyxl using the "data_only=True" mode, but the "read_only" mode (strangely) disables several other features that we actually make use of.

As it is available (and seems to scale reasonably well) I'll probably integrate support for the pyxlsb engine for now; if you continue to speed up your engine (and add it to pypi when you're happy with it) then we can consider adding yours too (parsing datetimes/dates properly would be a really nice win!) 👍

SaelKimberly · 2024-01-20T12:22:26Z

Hello!

There is some news about my RXLS engine :)

Finally, I've upload it on PyPi, so it's ready for testing 😃

Also, I've update benchmarks, and upload original datasets...

SaelKimberly added the enhancement New feature or an improvement of an existing feature label Sep 18, 2023

alexander-beedie added the accepted Ready for implementation label Sep 22, 2023

alexander-beedie self-assigned this Sep 22, 2023

alexander-beedie mentioned this issue Sep 22, 2023

feat(python): add "pyxlsb" engine support to read_excel (for excel binary workbook files) #11248

Merged

stinodego closed this as completed in #11248 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read XLSX and XLSB files natively #11181

Read XLSX and XLSB files natively #11181

SaelKimberly commented Sep 18, 2023

alexander-beedie commented Sep 18, 2023 •

edited

avimallu commented Sep 18, 2023

SaelKimberly commented Sep 18, 2023

SaelKimberly commented Sep 19, 2023

avimallu commented Sep 19, 2023

SaelKimberly commented Sep 19, 2023

SaelKimberly commented Sep 21, 2023

alexander-beedie commented Sep 21, 2023 •

edited

SaelKimberly commented Jan 20, 2024

Read XLSX and XLSB files natively #11181

Read XLSX and XLSB files natively #11181

Comments

SaelKimberly commented Sep 18, 2023

Description

alexander-beedie commented Sep 18, 2023 • edited

avimallu commented Sep 18, 2023

SaelKimberly commented Sep 18, 2023

SaelKimberly commented Sep 19, 2023

avimallu commented Sep 19, 2023

SaelKimberly commented Sep 19, 2023

SaelKimberly commented Sep 21, 2023

alexander-beedie commented Sep 21, 2023 • edited

SaelKimberly commented Jan 20, 2024

alexander-beedie commented Sep 18, 2023 •

edited

alexander-beedie commented Sep 21, 2023 •

edited