PERF: to speed up rendering of styler #34863

jihwans · 2020-06-18T18:59:20Z

see #19917 (comment)

closes Styler extremely slow #19917
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

TomAugspurger · 2020-06-18T19:31:11Z

Seems to be failing some tests. Is this supposed to produce an equivalent output?

jihwans · 2020-06-18T20:09:39Z

Yes it does.
Test cases seems expect empty values because of the original code's 'misbehavior' that caused the slowdown.

For example:

pandas/pandas/tests/io/formats/test_style.py

Lines 986 to 1013 in 3c959fc

    
           def test_bar_align_zero_nans(self): 
        
               df = pd.DataFrame({"A": [1, None], "B": [-1, 2]}) 
        
               result = df.style.bar(align="zero", axis=None)._compute().ctx 
        
               expected = { 
        
                   (0, 0): [ 
        
                       "width: 10em", 
        
                       " height: 80%", 
        
                       "background: linear-gradient(90deg, " 
        
                       "transparent 50.0%, #d65f5f 50.0%, " 
        
                       "#d65f5f 75.0%, transparent 75.0%)", 
        
                   ], 
        
                   (1, 0): [""], 
        
                   (0, 1): [ 
        
                       "width: 10em", 
        
                       " height: 80%", 
        
                       "background: linear-gradient(90deg, " 
        
                       "transparent 25.0%, #d65f5f 25.0%, " 
        
                       "#d65f5f 50.0%, transparent 50.0%)", 
        
                   ], 
        
                   (1, 1): [ 
        
                       "width: 10em", 
        
                       " height: 80%", 
        
                       "background: linear-gradient(90deg, " 
        
                       "transparent 50.0%, #d65f5f 50.0%, " 
        
                       "#d65f5f 100.0%, transparent 100.0%)", 
        
                   ], 
        
               } 
        
               assert result == expected

This entry can be removed from the expected test result in above code:

        (1, 0): [""],

pandas/io/formats/style.py

WillAyd · 2020-06-19T01:43:21Z

The best way is to typically write benchmarks and run them in the performance suite.

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

TomAugspurger · 2020-06-19T11:34:24Z

As long as the new implementation produces roughly equivalent HTML then we can consider updating intermediate representation expected in the tests.

jihwans · 2020-06-19T13:28:19Z

It does produce exactly same HTML. As you can see, the only things missed in intermediate steps are the empty strings.
Adding empty strings did not change HTML. The only chance that can be different (I have not actually seen anything like this but) is only if the original code produced style="" which, even when omitted. will have no effect.

TomAugspurger · 2020-06-19T14:47:49Z

OK, thanks. Can you update the tests then and add an ASV?

jihwans · 2020-06-19T18:03:03Z

I can update tests but what is an ASV?

gfyoung · 2020-06-20T05:54:35Z

@jihwans : ASV refers to https://github.com/airspeed-velocity/asv

This directory is where you add such a test: https://github.com/pandas-dev/pandas/tree/master/asv_bench/benchmarks

jihwans · 2020-06-20T12:13:32Z

Thank you all.
I guess it's going to take some time to figure out these.
I will make changes to test cases first.

WillAyd · 2020-06-23T20:21:25Z

/azp run

azure-pipelines · 2020-06-23T20:21:34Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2020-06-23T23:01:41Z

Commenter does not have sufficient privileges for PR 34863 in repo pandas-dev/pandas

jihwans · 2020-06-24T01:20:36Z

ASV files was added

TomAugspurger · 2020-06-24T12:53:38Z

Run black on that file

would reformat /home/runner/work/pandas/pandas/asv_bench/benchmarks/io/style.py

jihwans · 2020-06-24T13:55:44Z

Thank you Tom, Will and Young, for your helps - I learned a lot along the way.
Hopefully this commit may help a few people who might had same situation as mine.

asv_bench/benchmarks/io/style.py

jihwans · 2020-06-24T14:43:14Z

jreback - I submitted the code again with that test removed -- It's really not necessary since there really is no impact time wise. It does have memory impact.
Time wise impact is only with render, not with apply.

Thanks for your valuable comment.

WillAyd

lgtm

asv_bench/benchmarks/io/style.py

jreback

also pls add a whatsnew note in performance section in 1.1

jihwans · 2020-06-25T00:02:45Z

whatsnew added in 1.1 performance

see pandas-dev#19917

jreback · 2020-06-25T14:30:00Z

pandas/io/formats/style.py

-                i = self.index.get_indexer([row_label])[0]
-                j = self.columns.get_indexer([col_label])[0]
-                for pair in col.rstrip(";").split(";"):
+        rows = [(row_label, v) for row_label, v in attrs.iterrows()]


you should use itertuples here (its actually much faster)

@jeff - that's a good thing to know and I tried it but could not figure out doing the same thing with itertuples.
To get the col_label within the inner loop, I need to use ._fields(), getattr(), list slicing, etc to separate index, ... basically many extra steps. I am not sure how much we can save here.

However, it seems that .get_indexer is the one that caused much delay. So real solution should be something that will eliminate get_indexer entirely or some acceleration effort done on get_indexer.

I can think of one way to avoid get_indexer -- simply taking index & columns as list and use it to get integer index # of given label. However, I was not sure if I could do that safely because I am not sure all the labels given in attrs always matches that of self.index and self.columns. probably not.

ahh i see, you are doing get indexer once for the rows, you can do the same once for the columns. you can throw this in a dict {label -> int}. This will vastly speed up things.

So you mean, each row will have same columns ..?
how about 4x4 table that has some attr assignments on column A, B on row 1, 2 but on column C, D only on row 3, 4 ..?
It would be great if the original author of this function could answer this -- or, are you may be?
I admit that I did the patch much relying on guesses based on common sense ( or my version of common sense :) )

If it is certain that we really do not have to use get_indexer method, probably something this should work, outside the loops:
rowmap = { label: i for i, label in enumerate(self.index) }
colmap = { label: i for i, label in enumerate(self.columns) }

This is the exact code that worked for my app:

def _update_ctx(self, attrs: DataFrame) -> None: coli = {k: i for i, k in enumerate(self.columns)} rowi = {k: i for i, k in enumerate(self.index)} for jj in range(len(attrs.columns)): cn = attrs.columns[jj] j = coli[cn] for rn, c in attrs[[cn]].itertuples(): if not c: continue c = c.rstrip(";") if not c: continue i = rowi[rn] for pair in c.split(";"): self.ctx[(i, j)].append(pair)

However, it outperform the current patch only slightly with benchmark.
for 1200 case, it took 3.02s where it took 3.13 with the current patch.
Plus we're not sure if this is okay with all occasions.

can you avoid the append? I think if this was a comprehension (or at least the last append) would be much better

I don't see how how it can be better.
Let's say we have a defaultdict out of comprehension with new pairs to be added to existing ctx. What should happen afterward is basically same things as this one, I would assume.

ok i actually like your code above a little better, its very idiomatic and easy to understand. push it up and ping on green.

I really don't feel safe with this code. It might break someone's code.
What if the attrs's column is specified different way other than plain column label?

jihwans · 2020-06-25T23:19:49Z

Hmm... are we sure this won’t break someone’s existing application?

…

On Thu, Jun 25, 2020 at 7:03 PM Jeff Reback ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/io/formats/style.py <#34863 (comment)>: > @@ -561,11 +561,15 @@ def _update_ctx(self, attrs: DataFrame) -> None: Whitespace shouldn't matter and the final trailing ';' shouldn't matter. """ - for row_label, v in attrs.iterrows(): - for col_label, col in v.items(): - i = self.index.get_indexer([row_label])[0] - j = self.columns.get_indexer([col_label])[0] - for pair in col.rstrip(";").split(";"): + rows = [(row_label, v) for row_label, v in attrs.iterrows()] ok i actually like your code above a little better, its very idiomatic and easy to understand. push it up and ping on green. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34863 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXKQVXJRCGJSMII3YHZQCTRYPJU5ANCNFSM4OCB3SGQ> .

jreback · 2020-06-27T01:38:03Z

@jihwans we have tests ; if we get a report then we can further update but ok pushing this change

jreback

if you want to update to the simpler version would be fine with that

jihwans · 2020-06-27T10:28:02Z

@jeff I tried to come up with code that may break with the new patch we discussed but I could not. I guess I will update it with the new one.

- experimental, 10% further improvement by eliminating get_indexer call see pandas-dev#19917

jreback · 2020-07-09T23:36:04Z

thanks @jihwans very nice

WillAyd reviewed Jun 18, 2020

View reviewed changes

pandas/io/formats/style.py Outdated Show resolved Hide resolved

gfyoung added Performance Memory or execution speed performance Visualization plotting labels Jun 20, 2020

This comment has been minimized.

Sign in to view

jihwans force-pushed the jihwans-patch-1 branch from e29bd98 to 697ac5b Compare June 23, 2020 22:29

This comment has been minimized.

Sign in to view

jihwans force-pushed the jihwans-patch-1 branch 2 times, most recently from 0e68954 to e36e03b Compare June 24, 2020 01:18

jihwans requested a review from WillAyd June 24, 2020 01:22

This comment has been minimized.

Sign in to view

jihwans marked this pull request as draft June 24, 2020 01:30

jihwans force-pushed the jihwans-patch-1 branch 2 times, most recently from 8b1825a to 69c2bc2 Compare June 24, 2020 11:25

This comment has been minimized.

Sign in to view

jihwans marked this pull request as ready for review June 24, 2020 13:59

jreback requested changes Jun 24, 2020

View reviewed changes

asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved

asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved

jihwans force-pushed the jihwans-patch-1 branch from 776c3ea to 7698070 Compare June 24, 2020 14:39

jihwans requested a review from jreback June 24, 2020 14:41

WillAyd approved these changes Jun 24, 2020

View reviewed changes

jreback requested changes Jun 24, 2020

View reviewed changes

asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved

jihwans mentioned this pull request Jun 24, 2020

Styler extremely slow #19917

Closed

jreback requested changes Jun 24, 2020

View reviewed changes

jreback changed the title ~~to speed up rendering of styler~~ PERF: to speed up rendering of styler Jun 24, 2020

jreback added IO HTML read_html, to_html, Styler.apply, Styler.applymap Styler conditional formatting using DataFrame.style and removed Visualization plotting labels Jun 24, 2020

jreback added this to the 1.1 milestone Jun 24, 2020

jihwans force-pushed the jihwans-patch-1 branch from 7698070 to 390b44b Compare June 25, 2020 00:01

jihwans force-pushed the jihwans-patch-1 branch from 390b44b to 5d58235 Compare June 25, 2020 00:43

PERF: speed up rendering of styler (pandas-dev#19917)

89b2d70

see pandas-dev#19917

jihwans force-pushed the jihwans-patch-1 branch from 5d58235 to 89b2d70 Compare June 25, 2020 11:53

jreback requested changes Jun 25, 2020

View reviewed changes

jreback reviewed Jun 27, 2020

View reviewed changes

PERF: speed up rendering of styler (pandas-dev#19917)

e4d4370

- experimental, 10% further improvement by eliminating get_indexer call see pandas-dev#19917

jihwans force-pushed the jihwans-patch-1 branch from adc0c00 to e4d4370 Compare June 27, 2020 12:35

jreback approved these changes Jul 9, 2020

View reviewed changes

jreback merged commit 4274b11 into pandas-dev:master Jul 9, 2020

PERF: to speed up rendering of styler #34863

PERF: to speed up rendering of styler #34863

Conversation

jihwans commented Jun 18, 2020 • edited Loading

TomAugspurger commented Jun 18, 2020

jihwans commented Jun 18, 2020 • edited Loading

WillAyd commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020

jihwans commented Jun 19, 2020 • edited Loading

TomAugspurger commented Jun 19, 2020

jihwans commented Jun 19, 2020

gfyoung commented Jun 20, 2020

jihwans commented Jun 20, 2020

This comment has been minimized.

WillAyd commented Jun 23, 2020

azure-pipelines bot commented Jun 23, 2020

This comment has been minimized.

azure-pipelines bot commented Jun 23, 2020

jihwans commented Jun 24, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

TomAugspurger commented Jun 24, 2020 • edited Loading

This comment has been minimized.

jihwans commented Jun 24, 2020

jihwans commented Jun 24, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jihwans commented Jun 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihwans Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihwans commented Jun 25, 2020 via email

jreback commented Jun 27, 2020

jreback left a comment

Choose a reason for hiding this comment

jihwans commented Jun 27, 2020

jreback commented Jul 9, 2020

jihwans commented Jun 18, 2020 •

edited

Loading

jihwans commented Jun 18, 2020 •

edited

Loading

jihwans commented Jun 19, 2020 •

edited

Loading

TomAugspurger commented Jun 24, 2020 •

edited

Loading

jihwans commented Jun 24, 2020 •

edited

Loading

jihwans Jun 25, 2020 •

edited

Loading