Some benchmark optimisations #13

jmcnamara · 2013-10-10T18:20:59Z

The ws[row][col] changes in the run_pyexcelerate_optimization formatting benchmark slow it down significantly. Probably due to the fact that it is creating a Range object for each cell. Using the set_cell_value() method improves the performance.

The majority of the rest of the slowdown could probably be avoided by using similar methods for setting the cell styles. (I see that there are some methods available but I didn't try them).

Also, the run_xlsxwriter_optimization() benchmark is sub-optimal since it creates a format object each time through the inner loop. I've modified that benchmark to reflect how it would be done in practice by moving the format creation out of the loop.

jmcnamara · 2013-10-10T18:28:07Z

Also, in the run_pyexcelerate() benchmark I think it would be fairer to use row/column loops and set the value using set_cell_value() like in run_pyexcelerate_optimization .

This would only be slightly slower but would reflect that fact that in the existing benchmark the data array is created outside the loop and that it only contains references rather than values so it's creation time is very small. In the comparable benchmarks the data is create inside the loops.

Or perhaps, this is a valid usecase if the user has preprepared data. Your call.

kevmo314 · 2013-10-10T18:39:43Z

With regards to the *_optimization benchmarks, they're intended to test how effectively the packages can optimize poorly supplied code, as not every user (and probably lesser so outside of programmers) will realize that moving the formatting outside would be faster.

I've found the source of the slowdown -- the range object itself is highly optimized and isn't the issue, rather it's the compression of styles. A primary difficulty is that identifying duplicate styles is comparably expensive.

jmcnamara · 2013-10-10T18:50:56Z

not every user (and probably lesser so outside of programmers) will realize that moving the formatting outside would be faster

Either way creating 65,000 unnecessary objects in an inner loop isn't a realistic benchmark.

kevmo314 · 2013-10-10T19:37:36Z

Also, your pull request does not produce the same output as the original code. Consider a bitmask of BOLD | ITALIC | UNDERLINE, your cell_format will equal underline, overwriting bold and italic.

Images attached of the output files.

jmcnamara · 2013-10-10T23:05:48Z

your pull request does not produce the same output as the original code. Consider a bitmask of BOLD | ITALIC | UNDERLINE, your cell_format will equal underline, overwriting bold and italic

Ok. Fixed and pushed.

kevmo314 · 2013-10-11T00:07:09Z

That's an O(2^n) solution. Are you sure you'd rather use that? O(2^n) will grow much faster than O(r*c).

jmcnamara · 2013-10-11T09:37:07Z

That's an O(2^n) solution. Are you sure you'd rather use that? O(2^n) will grow much faster than O(r*c).

In this case n is small and bounded so it should be fine.

Do you have any benchmarks to share from the extended benchmark tests.

kevmo314 · 2013-10-11T14:10:15Z

Small, yes, but bounded? Most definitely not. The number of potential style combinations is infinite, consider font sizes, font families, and text rotation, on top of 256^3 background colors.

I could easily add another sixteen styles and 2^n would exceed r*c by a rather large amount, and sixteen isn't even really all that much. The Excel default styles consist of 47 styles already, six of which are non-overlapping. Background color, 4 borders, font color, bold, italic, underline, text rotation, font size, font family, and we're already at 2^12 style combinations minimum.

Not yet, I'm still working on a couple of speed improvements with the style compressor.

jmcnamara · 2013-10-11T14:27:33Z

Small, yes, but bounded? Most definitely not. The number of potential style combinations is infinite, consider font sizes, font families, and text rotation, on top of 256^3 background colors.

Theoretically yes, but Excel only allows 64, 000 unique formats. There are some other limitations as well on the number of fills and number formats.

Not yet, I'm still working on a couple of speed improvements with the style compressor.

Good stuff. The Pandas guys are looking to add some additional Excel writer engines as output options and and want to integrate PyExcelerator for speed.

kevmo314 · 2013-10-11T16:04:50Z

Theoretically yes, but Excel only allows 64, 000 unique formats. There are some other limitations as well on the number of fills and number formats.

Yes, but in a scope of infinite styles, if I randomly picked 64000 with no known prior distribution and applied them to n > 64k cells, you wouldn't know each style until it was applied. Surely, you can't expect the user to know that they should generate the 64k styles before applying them -- that would be offloading the work of style compression to the user, when style compression itself is a functionality of the library, not to mention the difficulty of actually managing the 64k styles because now that has to be done on your own.

Of course, if they're seasoned programmers, they'll think of that, but especially with Pandas, you're looking at mostly intro programmers who are experts in other fields that think in a much different way. I'd guess that a very large chunk of non-cs researchers don't know what a bitmask is, let alone how to apply it in this context. The library should really be handling all of this to make the code as intuitive as possible.

But in any case, we can do that too, I just wouldn't consider it a valid solution as it's a ludicrous expectation of the user:

pyexcelerate value fastest, 650, 100, 0.6847564425087547
pyexcelerate value faster, 650, 100, 0.7485500950539803
pyexcelerate value fast, 650, 100, 2.153594451250089
xlsxwriter, 650, 100, 1.2844041630989418
pyexcelerate style cheating, 650, 100, 1.3979011997334858
pyexcelerate style fastest, 650, 100, 2.9789260227207652
pyexcelerate style faster, 650, 100, 3.5984172569418735
pyexcelerate style fast, 650, 100, 8.74910337766392
xlsxwriter style cheating, 650, 100, 1.6390989636951083
xlsxwriter style, 650, 100, 7.596218653488084
.
----------------------------------------------------------------------
Ran 1 test in 31.239s

OK

jmcnamara · 2013-10-11T16:18:03Z

Ok. Thanks.

Some benchmark optimisations.

b1528cd

jmcnamara closed this Oct 11, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmark optimisations #13

Some benchmark optimisations #13

jmcnamara commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013

Some benchmark optimisations #13

Some benchmark optimisations #13

Conversation

jmcnamara commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 10, 2013

jmcnamara commented Oct 10, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013

kevmo314 commented Oct 11, 2013

jmcnamara commented Oct 11, 2013