New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance when serializing pandas DataFrames #107
Comments
Can you link to some discussions that this is something pandas users are hitting? I would like to see use cases where users are directly exporting data (ie not touching the pandas dataframe) instead of modifying the data frame before exporting. I have seen this come up before so this makes me wonder if it's worth having a "direct to xlsx" function that encourages not touching the data. Superficially the feature sounds reasonable though and a nice pinhole optimization. We have something similar where a bulk import of data stores the data as a dense table and then future edits are based off this initial table. I could see the feature here being if the passed data table is typed, then we store the column types and null them out if cells are edited of a different type, then using this as a lookup table. I think we wouldn't be able to leverage apply/applymap though, I think that is too specific and would dramatically affect how PyExcelerate stores data. |
What do you mean by "modifying the data frame before exporting"? |
If what you mean is that the user does not export directly a data frame to excel but writes it to excel and then modify the excel before saving, it is indeed not the case here. The user uses simply the DataFrame.to_excel function that writes directly the excel file and then open it in excel. |
For an approach based on a dense data table+a layer to capture the changes (this is what you do currently if I read correctly between your lines), 8t would be great to have it also using a day frame as "dense table" yet for the specific case of proposing a faster alternative to the to_excel function, it is not required from day one. |
Some discussion on the topic of excel/python performance tfussell/xlnt#184 |
My understanding is that pandas doesn't support PyExcelerate as a writing engine? |
Indeed, pandas does not support pyexcelerate as engine. |
I am not too keen on adding pandas-specific functionality because PyExcelerate isn't an officially supported engine. Ultimately PyExcelerate is used a lot outside pandas as well, and integrating pandas is a very heavy dependency for a use case that they do not even directly support. I think adding type hints for columns is appropriate, I'll look into a way to add those soon. But given that pandas does not support PyExcelerate, going in that direction would pigeonhole us too much into a specific use case. |
Ok. |
I quick test (not 100% bulletproof) shows a 7.5x speedup (reality when finished may be a bit lower) when taking a columnar approach & leveraging pandas.apply. very promising and a game changer for pandas excel exports ! |
Right, I understand, but PyExcelerate is not a pandas to Excel library. Additionally, the pandas team has indicated they don't wish to add PyExcelerate support. I am not particularly inclined to couple PyExcelerate to pandas for a very specific use case especially without any official pandas integration. As I mentioned above, we can add column type hints, but as I think about this I think it will be less useful than it seems because column headers will ruin the type consistency of the data table. Additionally, the type hint has to be computed at data insertion time anyways, so it wouldn't yield much speedup. I'm more than happy to work with the pandas team to add integration, but without their support this isn't a good idea for the library despite the promising speedup. |
the function __get_cell_data (https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Worksheet.py#L227) operates on each cell individually.
when serializing a pandas.DataFrame, most of the time, the columns are of a unique type (dtype) and could benefit from some "columnar" approach (instead of row by row, cell by cell approach) to speed up things:
have you already thought about ways to improve this by keeping the "columnar" info further down the pipe (vs transforming everything to cells) for DataFrames ? it is quite specific yet it is a case lot of pandas users are hitting (slowness in exporting to excel).
The text was updated successfully, but these errors were encountered: