Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance when serializing pandas DataFrames #107

Open
sdementen opened this issue Mar 11, 2021 · 11 comments
Open

performance when serializing pandas DataFrames #107

sdementen opened this issue Mar 11, 2021 · 11 comments

Comments

@sdementen
Copy link
Contributor

the function __get_cell_data (https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Worksheet.py#L227) operates on each cell individually.
when serializing a pandas.DataFrame, most of the time, the columns are of a unique type (dtype) and could benefit from some "columnar" approach (instead of row by row, cell by cell approach) to speed up things:

  • the ´if´ statements could be evaluated only once per column
  • the conversion to string/xml could leverage some "apply / applymap" from pandas
  • ...
    have you already thought about ways to improve this by keeping the "columnar" info further down the pipe (vs transforming everything to cells) for DataFrames ? it is quite specific yet it is a case lot of pandas users are hitting (slowness in exporting to excel).
@kevmo314
Copy link
Collaborator

kevmo314 commented Mar 11, 2021

Can you link to some discussions that this is something pandas users are hitting? I would like to see use cases where users are directly exporting data (ie not touching the pandas dataframe) instead of modifying the data frame before exporting. I have seen this come up before so this makes me wonder if it's worth having a "direct to xlsx" function that encourages not touching the data.

Superficially the feature sounds reasonable though and a nice pinhole optimization. We have something similar where a bulk import of data stores the data as a dense table and then future edits are based off this initial table. I could see the feature here being if the passed data table is typed, then we store the column types and null them out if cells are edited of a different type, then using this as a lookup table.

I think we wouldn't be able to leverage apply/applymap though, I think that is too specific and would dramatically affect how PyExcelerate stores data.

@sdementen
Copy link
Contributor Author

You can google like https://www.google.com/search?safe=active&client=ms-android-samsung-ss&sxsrf=ALeKk0148RZ-MDnFz2BlbdOKsijf-Lnbbg%3A1615445522653&ei=Er5JYOqaJ5D7kwXBs6ToAw&q=pandas+to_excel+performance&oq=pandas+to_excel+performance

What do you mean by "modifying the data frame before exporting"?
Typically, a user (like me) works with data frames and wishes to save the final result to excel so other users or himself can simply open excel to see the export (or offer a "download excel" in a webapp/dashboard).

@sdementen
Copy link
Contributor Author

If what you mean is that the user does not export directly a data frame to excel but writes it to excel and then modify the excel before saving, it is indeed not the case here. The user uses simply the DataFrame.to_excel function that writes directly the excel file and then open it in excel.

@sdementen
Copy link
Contributor Author

For an approach based on a dense data table+a layer to capture the changes (this is what you do currently if I read correctly between your lines), 8t would be great to have it also using a day frame as "dense table" yet for the specific case of proposing a faster alternative to the to_excel function, it is not required from day one.
I like your thinking in terms of fastest,faster,fast,slow and having a nice degradation of performance in function of what the user needs as features. Here it would look more like a ultrafast(works only on data frames) that would fall back on the fastest track if the user needs more than the standard excel export feature for dataframes. And in this case, the dataframes could be simply converted to the current dense table and use the current system.

@sdementen
Copy link
Contributor Author

Some discussion on the topic of excel/python performance tfussell/xlnt#184

@kevmo314
Copy link
Collaborator

If what you mean is that the user does not export directly a data frame to excel but writes it to excel and then modify the excel before saving, it is indeed not the case here. The user uses simply the DataFrame.to_excel function that writes directly the excel file and then open it in excel.

My understanding is that pandas doesn't support PyExcelerate as a writing engine?

@sdementen
Copy link
Contributor Author

Indeed, pandas does not support pyexcelerate as engine.
I have written a simple to_excel alike functionality based on pyexcelerate that is more than twice faster than pandas best engine and I would hope that we some tweaking on pyexcelerate, we could speed this up by another factor of 2. Preparing an excel file in 2 seconds vs 8 seconds makes a huge difference in user experience (for instance if the user clicks on a "download excel" button in an app or webapp)

@kevmo314
Copy link
Collaborator

I am not too keen on adding pandas-specific functionality because PyExcelerate isn't an officially supported engine. Ultimately PyExcelerate is used a lot outside pandas as well, and integrating pandas is a very heavy dependency for a use case that they do not even directly support.

I think adding type hints for columns is appropriate, I'll look into a way to add those soon. But given that pandas does not support PyExcelerate, going in that direction would pigeonhole us too much into a specific use case.

@sdementen
Copy link
Contributor Author

Ok.
I will try in parallel a proof of concept with pandas to see if it brings the speed benefits I expect and will keep you uodated

@sdementen
Copy link
Contributor Author

I quick test (not 100% bulletproof) shows a 7.5x speedup (reality when finished may be a bit lower) when taking a columnar approach & leveraging pandas.apply. very promising and a game changer for pandas excel exports !

@kevmo314
Copy link
Collaborator

kevmo314 commented Mar 13, 2021

Right, I understand, but PyExcelerate is not a pandas to Excel library. Additionally, the pandas team has indicated they don't wish to add PyExcelerate support. I am not particularly inclined to couple PyExcelerate to pandas for a very specific use case especially without any official pandas integration.

As I mentioned above, we can add column type hints, but as I think about this I think it will be less useful than it seems because column headers will ruin the type consistency of the data table. Additionally, the type hint has to be computed at data insertion time anyways, so it wouldn't yield much speedup.

I'm more than happy to work with the pandas team to add integration, but without their support this isn't a good idea for the library despite the promising speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants