New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read XLSX and XLSB files natively #11181
Comments
Can you provide some more details? For example, benchmarks comparing some reasonable sized loads using your code vs Assuming you are planning to put your module up on |
@SaelKimberly, if your package can read XLSB rather fast, you should seriously consider making it available on |
Sure. I will provide some benchmarks, but it will take a couple of days. At this point, on my pc, I scanned ~30 files 30x12400 with average rate about 3200 rows per second (just used tqdm). Sometimes about 4000 rows/sec. Strange but true - I have not any benchmark! Also interested now :) The main idea of my project - to read data from excel files to columnar format directly, there is no analogue in the web (I haven't found them). Cells are stored in pyarrow.StringArrays, and their original type just increase type counter in it's column. Empty space between cells fills with nulls. After filling stage, type of column will be inferred, using most frequent datatype - just convert whole column then... I've never had time to compare speed and memory usage of my solution with others, but I will solve it as soon as possible :) |
I've create simple benchmark. Results are optimistic 😏 https://github.com/SaelKimberly/rxls/blob/main/benchmarks.ipynb |
I really like the benchmark tests you've written - they seem rather comprehensive. Can you perhaps change the file you're testing with to be larger than ~10,000 rows? I don't use Excel as frequently now, but when I did, I do remember reaching around 100,000 to 200,000 rows (not great use of Excel, I'm aware, but I'm the person who needed to process those files, not make them) - a speed advantage at that scale might actually demonstrate the benchmarks that @alexander-beedie is looking for. |
Yes, I will try to find better datasource. OWID, maybe 🤔 |
So, I've rewrited benchmarks for a much larger dataset from OWID (COVID-19 cases for USA, GBR, CAN, DEU, ITA, IND). 341,376 rows, 67 columns Maybe too large ;) My RXLS take 2nd place by speed of XLSB parsing, in comparison with PyXLSB 🤣 But, I think, my solution is much more stronger, because of dates correct recognition and relatively small memory footprint. In the future, I would like to rewrite RXLS to Rust, maybe it's not difficult. XLSX - with zip-rs and some kind of fast streaming xml parser, XLSB - with zip-rs and simple BIFF12 parser in Rust, same as my BIFF12 parser 😃. Maybe, Polars openpyxl adapter needs some optimisations - see benchmark results 🤔 |
Well, their core speed is up to them, heh. We always open workbooks with As it is available (and seems to scale reasonably well) I'll probably integrate support for the |
Hello! There is some news about my RXLS engine :) Finally, I've upload it on PyPi, so it's ready for testing 😃 Also, I've update benchmarks, and upload original datasets... |
Description
Hello!
This project is absolutely amazing and I really appreciate when I get notifications about updates :)
So, I created a small python module to read XLSX and XLSB initially, in pure python, into PyArrow arrays, and then into dataframe, Polars or Pandas.
This is just a module concept, but it already works on my machine. It is very fast and consumes very little RAM
I started working on this when Polars didn't have built-in Excel support - only via xlsx2csv.
Firstly, I've also worked with openpyxl and it just works, but I've found that it's very slow for me and sometimes very inefficient.
I would like to participate in your project by investing a little with my code or idea.
The module is codenamed RXLS (Read XLS[X|B]), and although I have left almost no comments there, I am ready to answer any questions. There may be some issues, and I will resolve them all...
I also worked with Rust at one time, but that was quite a long time ago.
https://github.com/SaelKimberly/rxls
Thank you very much, I really appreciate your work 😃
The text was updated successfully, but these errors were encountered: