Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read XLSX and XLSB files natively #11181

Closed
SaelKimberly opened this issue Sep 18, 2023 · 9 comments · Fixed by #11248
Closed

Read XLSX and XLSB files natively #11181

SaelKimberly opened this issue Sep 18, 2023 · 9 comments · Fixed by #11248
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@SaelKimberly
Copy link

Description

Hello!

This project is absolutely amazing and I really appreciate when I get notifications about updates :)
So, I created a small python module to read XLSX and XLSB initially, in pure python, into PyArrow arrays, and then into dataframe, Polars or Pandas.
This is just a module concept, but it already works on my machine. It is very fast and consumes very little RAM
I started working on this when Polars didn't have built-in Excel support - only via xlsx2csv.
Firstly, I've also worked with openpyxl and it just works, but I've found that it's very slow for me and sometimes very inefficient.
I would like to participate in your project by investing a little with my code or idea.
The module is codenamed RXLS (Read XLS[X|B]), and although I have left almost no comments there, I am ready to answer any questions. There may be some issues, and I will resolve them all...
I also worked with Rust at one time, but that was quite a long time ago.

https://github.com/SaelKimberly/rxls

Thank you very much, I really appreciate your work 😃

@SaelKimberly SaelKimberly added the enhancement New feature or an improvement of an existing feature label Sep 18, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 18, 2023

Can you provide some more details? For example, benchmarks comparing some reasonable sized loads using your code vs xlsx2csv & openpyxl would be useful.

Assuming you are planning to put your module up on pypi when it is finished, we could perhaps add it as one of the optional engines that our existing read_excel method knows about (currently the two above, alongside a new read_ods function for dealing with OpenOffice spreadsheets). Would be good to see it get some traction or have a compelling speed advantage first though ;)

@avimallu
Copy link
Contributor

@SaelKimberly, if your package can read XLSB rather fast, you should seriously consider making it available on pypi. There's no package that isn't painfully slow in reading XLSB, and it would be a game changer, even outside Polars and high performance DataFrame communities.

@SaelKimberly
Copy link
Author

Sure. I will provide some benchmarks, but it will take a couple of days.

At this point, on my pc, I scanned ~30 files 30x12400 with average rate about 3200 rows per second (just used tqdm). Sometimes about 4000 rows/sec.

Strange but true - I have not any benchmark! Also interested now :)

The main idea of my project - to read data from excel files to columnar format directly, there is no analogue in the web (I haven't found them).

Cells are stored in pyarrow.StringArrays, and their original type just increase type counter in it's column. Empty space between cells fills with nulls. After filling stage, type of column will be inferred, using most frequent datatype - just convert whole column then...

I've never had time to compare speed and memory usage of my solution with others, but I will solve it as soon as possible :)

@SaelKimberly
Copy link
Author

I've create simple benchmark. Results are optimistic 😏

https://github.com/SaelKimberly/rxls/blob/main/benchmarks.ipynb

@avimallu
Copy link
Contributor

I really like the benchmark tests you've written - they seem rather comprehensive. Can you perhaps change the file you're testing with to be larger than ~10,000 rows? I don't use Excel as frequently now, but when I did, I do remember reaching around 100,000 to 200,000 rows (not great use of Excel, I'm aware, but I'm the person who needed to process those files, not make them) - a speed advantage at that scale might actually demonstrate the benchmarks that @alexander-beedie is looking for.

@SaelKimberly
Copy link
Author

Yes, I will try to find better datasource. OWID, maybe 🤔

@SaelKimberly
Copy link
Author

So, I've rewrited benchmarks for a much larger dataset from OWID (COVID-19 cases for USA, GBR, CAN, DEU, ITA, IND).

341,376 rows, 67 columns

Maybe too large ;)

My RXLS take 2nd place by speed of XLSB parsing, in comparison with PyXLSB 🤣

But, I think, my solution is much more stronger, because of dates correct recognition and relatively small memory footprint.
Also, there is many optimisation points, I think.

In the future, I would like to rewrite RXLS to Rust, maybe it's not difficult. XLSX - with zip-rs and some kind of fast streaming xml parser, XLSB - with zip-rs and simple BIFF12 parser in Rust, same as my BIFF12 parser 😃.

Maybe, Polars openpyxl adapter needs some optimisations - see benchmark results 🤔

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 21, 2023

Maybe, Polars openpyxl adapter needs some optimisations

Well, their core speed is up to them, heh. We always open workbooks with openpyxl using the "data_only=True" mode, but the "read_only" mode (strangely) disables several other features that we actually make use of.

As it is available (and seems to scale reasonably well) I'll probably integrate support for the pyxlsb engine for now; if you continue to speed up your engine (and add it to pypi when you're happy with it) then we can consider adding yours too (parsing datetimes/dates properly would be a really nice win!) 👍

@SaelKimberly
Copy link
Author

Hello!

There is some news about my RXLS engine :)

Finally, I've upload it on PyPi, so it's ready for testing 😃

Also, I've update benchmarks, and upload original datasets...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
3 participants