Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmark optimisations #13

Closed
wants to merge 1 commit into from
Closed

Some benchmark optimisations #13

wants to merge 1 commit into from

Conversation

jmcnamara
Copy link
Contributor

The ws[row][col] changes in the run_pyexcelerate_optimization formatting benchmark slow it down significantly. Probably due to the fact that it is creating a Range object for each cell. Using the set_cell_value() method improves the performance.

The majority of the rest of the slowdown could probably be avoided by using similar methods for setting the cell styles. (I see that there are some methods available but I didn't try them).

Also, the run_xlsxwriter_optimization() benchmark is sub-optimal since it creates a format object each time through the inner loop. I've modified that benchmark to reflect how it would be done in practice by moving the format creation out of the loop.

@jmcnamara
Copy link
Contributor Author

Also, in the run_pyexcelerate() benchmark I think it would be fairer to use row/column loops and set the value using set_cell_value() like in run_pyexcelerate_optimization .

This would only be slightly slower but would reflect that fact that in the existing benchmark the data array is created outside the loop and that it only contains references rather than values so it's creation time is very small. In the comparable benchmarks the data is create inside the loops.

Or perhaps, this is a valid usecase if the user has preprepared data. Your call.

@kevmo314
Copy link
Collaborator

With regards to the *_optimization benchmarks, they're intended to test how effectively the packages can optimize poorly supplied code, as not every user (and probably lesser so outside of programmers) will realize that moving the formatting outside would be faster.

I've found the source of the slowdown -- the range object itself is highly optimized and isn't the issue, rather it's the compression of styles. A primary difficulty is that identifying duplicate styles is comparably expensive.

@jmcnamara
Copy link
Contributor Author

not every user (and probably lesser so outside of programmers) will realize that moving the formatting outside would be faster

Either way creating 65,000 unnecessary objects in an inner loop isn't a realistic benchmark.

@kevmo314
Copy link
Collaborator

Also, your pull request does not produce the same output as the original code. Consider a bitmask of BOLD | ITALIC | UNDERLINE, your cell_format will equal underline, overwriting bold and italic.

Images attached of the output files.

pyexcelerate
xlsxwriter

@jmcnamara
Copy link
Contributor Author

your pull request does not produce the same output as the original code. Consider a bitmask of BOLD | ITALIC | UNDERLINE, your cell_format will equal underline, overwriting bold and italic

Ok. Fixed and pushed.

@kevmo314
Copy link
Collaborator

That's an O(2^n) solution. Are you sure you'd rather use that? O(2^n) will grow much faster than O(r*c).

@jmcnamara
Copy link
Contributor Author

That's an O(2^n) solution. Are you sure you'd rather use that? O(2^n) will grow much faster than O(r*c).

In this case n is small and bounded so it should be fine.

Do you have any benchmarks to share from the extended benchmark tests.

@kevmo314
Copy link
Collaborator

Small, yes, but bounded? Most definitely not. The number of potential style combinations is infinite, consider font sizes, font families, and text rotation, on top of 256^3 background colors.

I could easily add another sixteen styles and 2^n would exceed r*c by a rather large amount, and sixteen isn't even really all that much. The Excel default styles consist of 47 styles already, six of which are non-overlapping. Background color, 4 borders, font color, bold, italic, underline, text rotation, font size, font family, and we're already at 2^12 style combinations minimum.

Not yet, I'm still working on a couple of speed improvements with the style compressor.

@jmcnamara
Copy link
Contributor Author

Small, yes, but bounded? Most definitely not. The number of potential style combinations is infinite, consider font sizes, font families, and text rotation, on top of 256^3 background colors.

Theoretically yes, but Excel only allows 64, 000 unique formats. There are some other limitations as well on the number of fills and number formats.

Not yet, I'm still working on a couple of speed improvements with the style compressor.

Good stuff. The Pandas guys are looking to add some additional Excel writer engines as output options and and want to integrate PyExcelerator for speed.

@kevmo314
Copy link
Collaborator

Theoretically yes, but Excel only allows 64, 000 unique formats. There are some other limitations as well on the number of fills and number formats.

Yes, but in a scope of infinite styles, if I randomly picked 64000 with no known prior distribution and applied them to n > 64k cells, you wouldn't know each style until it was applied. Surely, you can't expect the user to know that they should generate the 64k styles before applying them -- that would be offloading the work of style compression to the user, when style compression itself is a functionality of the library, not to mention the difficulty of actually managing the 64k styles because now that has to be done on your own.

Of course, if they're seasoned programmers, they'll think of that, but especially with Pandas, you're looking at mostly intro programmers who are experts in other fields that think in a much different way. I'd guess that a very large chunk of non-cs researchers don't know what a bitmask is, let alone how to apply it in this context. The library should really be handling all of this to make the code as intuitive as possible.

But in any case, we can do that too, I just wouldn't consider it a valid solution as it's a ludicrous expectation of the user:

pyexcelerate value fastest, 650, 100, 0.6847564425087547
pyexcelerate value faster, 650, 100, 0.7485500950539803
pyexcelerate value fast, 650, 100, 2.153594451250089
xlsxwriter, 650, 100, 1.2844041630989418
pyexcelerate style cheating, 650, 100, 1.3979011997334858
pyexcelerate style fastest, 650, 100, 2.9789260227207652
pyexcelerate style faster, 650, 100, 3.5984172569418735
pyexcelerate style fast, 650, 100, 8.74910337766392
xlsxwriter style cheating, 650, 100, 1.6390989636951083
xlsxwriter style, 650, 100, 7.596218653488084
.
----------------------------------------------------------------------
Ran 1 test in 31.239s

OK

@jmcnamara
Copy link
Contributor Author

Ok. Thanks.

@jmcnamara jmcnamara closed this Oct 11, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants