New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some benchmark optimisations #13
Conversation
Also, in the This would only be slightly slower but would reflect that fact that in the existing benchmark the data array is created outside the loop and that it only contains references rather than values so it's creation time is very small. In the comparable benchmarks the data is create inside the loops. Or perhaps, this is a valid usecase if the user has preprepared data. Your call. |
With regards to the I've found the source of the slowdown -- the range object itself is highly optimized and isn't the issue, rather it's the compression of styles. A primary difficulty is that identifying duplicate styles is comparably expensive. |
Either way creating 65,000 unnecessary objects in an inner loop isn't a realistic benchmark. |
Ok. Fixed and pushed. |
That's an O(2^n) solution. Are you sure you'd rather use that? O(2^n) will grow much faster than O(r*c). |
In this case n is small and bounded so it should be fine. Do you have any benchmarks to share from the extended benchmark tests. |
Small, yes, but bounded? Most definitely not. The number of potential style combinations is infinite, consider font sizes, font families, and text rotation, on top of 256^3 background colors. I could easily add another sixteen styles and 2^n would exceed r*c by a rather large amount, and sixteen isn't even really all that much. The Excel default styles consist of 47 styles already, six of which are non-overlapping. Background color, 4 borders, font color, bold, italic, underline, text rotation, font size, font family, and we're already at 2^12 style combinations minimum. Not yet, I'm still working on a couple of speed improvements with the style compressor. |
Theoretically yes, but Excel only allows 64, 000 unique formats. There are some other limitations as well on the number of fills and number formats.
Good stuff. The Pandas guys are looking to add some additional Excel writer engines as output options and and want to integrate PyExcelerator for speed. |
Yes, but in a scope of infinite styles, if I randomly picked 64000 with no known prior distribution and applied them to n > 64k cells, you wouldn't know each style until it was applied. Surely, you can't expect the user to know that they should generate the 64k styles before applying them -- that would be offloading the work of style compression to the user, when style compression itself is a functionality of the library, not to mention the difficulty of actually managing the 64k styles because now that has to be done on your own. Of course, if they're seasoned programmers, they'll think of that, but especially with Pandas, you're looking at mostly intro programmers who are experts in other fields that think in a much different way. I'd guess that a very large chunk of non-cs researchers don't know what a bitmask is, let alone how to apply it in this context. The library should really be handling all of this to make the code as intuitive as possible. But in any case, we can do that too, I just wouldn't consider it a valid solution as it's a ludicrous expectation of the user:
|
Ok. Thanks. |
The
ws[row][col]
changes in therun_pyexcelerate_optimization
formatting benchmark slow it down significantly. Probably due to the fact that it is creating a Range object for each cell. Using theset_cell_value()
method improves the performance.The majority of the rest of the slowdown could probably be avoided by using similar methods for setting the cell styles. (I see that there are some methods available but I didn't try them).
Also, the
run_xlsxwriter_optimization()
benchmark is sub-optimal since it creates aformat
object each time through the inner loop. I've modified that benchmark to reflect how it would be done in practice by moving the format creation out of the loop.