Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: to speed up rendering of styler #34863

Merged
merged 2 commits into from
Jul 9, 2020

Conversation

jihwans
Copy link
Contributor

@jihwans jihwans commented Jun 18, 2020

see #19917 (comment)

  • closes Styler extremely slow #19917
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@TomAugspurger
Copy link
Contributor

Seems to be failing some tests. Is this supposed to produce an equivalent output?

@jihwans
Copy link
Contributor Author

jihwans commented Jun 18, 2020

Yes it does.
Test cases seems expect empty values because of the original code's 'misbehavior' that caused the slowdown.

For example:

def test_bar_align_zero_nans(self):
df = pd.DataFrame({"A": [1, None], "B": [-1, 2]})
result = df.style.bar(align="zero", axis=None)._compute().ctx
expected = {
(0, 0): [
"width: 10em",
" height: 80%",
"background: linear-gradient(90deg, "
"transparent 50.0%, #d65f5f 50.0%, "
"#d65f5f 75.0%, transparent 75.0%)",
],
(1, 0): [""],
(0, 1): [
"width: 10em",
" height: 80%",
"background: linear-gradient(90deg, "
"transparent 25.0%, #d65f5f 25.0%, "
"#d65f5f 50.0%, transparent 50.0%)",
],
(1, 1): [
"width: 10em",
" height: 80%",
"background: linear-gradient(90deg, "
"transparent 50.0%, #d65f5f 50.0%, "
"#d65f5f 100.0%, transparent 100.0%)",
],
}
assert result == expected

This entry can be removed from the expected test result in above code:

        (1, 0): [""],

pandas/io/formats/style.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Jun 19, 2020

The best way is to typically write benchmarks and run them in the performance suite.

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

@TomAugspurger
Copy link
Contributor

As long as the new implementation produces roughly equivalent HTML then we can consider updating intermediate representation expected in the tests.

@jihwans
Copy link
Contributor Author

jihwans commented Jun 19, 2020

It does produce exactly same HTML. As you can see, the only things missed in intermediate steps are the empty strings.
Adding empty strings did not change HTML. The only chance that can be different (I have not actually seen anything like this but) is only if the original code produced style="" which, even when omitted. will have no effect.

@TomAugspurger
Copy link
Contributor

OK, thanks. Can you update the tests then and add an ASV?

@jihwans
Copy link
Contributor Author

jihwans commented Jun 19, 2020

I can update tests but what is an ASV?

@gfyoung gfyoung added Performance Memory or execution speed performance Visualization plotting labels Jun 20, 2020
@gfyoung
Copy link
Member

gfyoung commented Jun 20, 2020

@jihwans : ASV refers to https://github.com/airspeed-velocity/asv

This directory is where you add such a test: https://github.com/pandas-dev/pandas/tree/master/asv_bench/benchmarks

@jihwans
Copy link
Contributor Author

jihwans commented Jun 20, 2020

Thank you all.
I guess it's going to take some time to figure out these.
I will make changes to test cases first.

@jihwans

This comment has been minimized.

@WillAyd
Copy link
Member

WillAyd commented Jun 23, 2020

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@jihwans

This comment has been minimized.

@azure-pipelines
Copy link
Contributor

Commenter does not have sufficient privileges for PR 34863 in repo pandas-dev/pandas

@jihwans jihwans force-pushed the jihwans-patch-1 branch 2 times, most recently from 0e68954 to e36e03b Compare June 24, 2020 01:18
@jihwans
Copy link
Contributor Author

jihwans commented Jun 24, 2020

ASV files was added

@jihwans jihwans requested a review from WillAyd June 24, 2020 01:22
@jihwans

This comment has been minimized.

@jihwans jihwans marked this pull request as draft June 24, 2020 01:30
@jihwans jihwans force-pushed the jihwans-patch-1 branch 2 times, most recently from 8b1825a to 69c2bc2 Compare June 24, 2020 11:25
@jihwans

This comment has been minimized.

@jihwans

This comment has been minimized.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 24, 2020

Run black on that file

would reformat /home/runner/work/pandas/pandas/asv_bench/benchmarks/io/style.py

@jihwans

This comment has been minimized.

@jihwans
Copy link
Contributor Author

jihwans commented Jun 24, 2020

Thank you Tom, Will and Young, for your helps - I learned a lot along the way.
Hopefully this commit may help a few people who might had same situation as mine.

@jihwans jihwans marked this pull request as ready for review June 24, 2020 13:59
asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved
@jihwans
Copy link
Contributor Author

jihwans commented Jun 24, 2020

jreback - I submitted the code again with that test removed -- It's really not necessary since there really is no impact time wise. It does have memory impact.
Time wise impact is only with render, not with apply.

Thanks for your valuable comment.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

asv_bench/benchmarks/io/style.py Outdated Show resolved Hide resolved
@jihwans jihwans mentioned this pull request Jun 24, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls add a whatsnew note in performance section in 1.1

@jreback jreback changed the title to speed up rendering of styler PERF: to speed up rendering of styler Jun 24, 2020
@jreback jreback added IO HTML read_html, to_html, Styler.apply, Styler.applymap Styler conditional formatting using DataFrame.style and removed Visualization plotting labels Jun 24, 2020
@jreback jreback added this to the 1.1 milestone Jun 24, 2020
@jihwans
Copy link
Contributor Author

jihwans commented Jun 25, 2020

whatsnew added in 1.1 performance

i = self.index.get_indexer([row_label])[0]
j = self.columns.get_indexer([col_label])[0]
for pair in col.rstrip(";").split(";"):
rows = [(row_label, v) for row_label, v in attrs.iterrows()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use itertuples here (its actually much faster)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeff - that's a good thing to know and I tried it but could not figure out doing the same thing with itertuples.
To get the col_label within the inner loop, I need to use ._fields(), getattr(), list slicing, etc to separate index, ... basically many extra steps. I am not sure how much we can save here.

However, it seems that .get_indexer is the one that caused much delay. So real solution should be something that will eliminate get_indexer entirely or some acceleration effort done on get_indexer.

I can think of one way to avoid get_indexer -- simply taking index & columns as list and use it to get integer index # of given label. However, I was not sure if I could do that safely because I am not sure all the labels given in attrs always matches that of self.index and self.columns. probably not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i see, you are doing get indexer once for the rows, you can do the same once for the columns. you can throw this in a dict {label -> int}. This will vastly speed up things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean, each row will have same columns ..?
how about 4x4 table that has some attr assignments on column A, B on row 1, 2 but on column C, D only on row 3, 4 ..?
It would be great if the original author of this function could answer this -- or, are you may be?
I admit that I did the patch much relying on guesses based on common sense ( or my version of common sense :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is certain that we really do not have to use get_indexer method, probably something this should work, outside the loops:
rowmap = { label: i for i, label in enumerate(self.index) }
colmap = { label: i for i, label in enumerate(self.columns) }

Copy link
Contributor Author

@jihwans jihwans Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exact code that worked for my app:

    def _update_ctx(self, attrs: DataFrame) -> None:
        coli = {k: i for i, k in enumerate(self.columns)}
        rowi = {k: i for i, k in enumerate(self.index)}
        for jj in range(len(attrs.columns)):
            cn = attrs.columns[jj]
            j = coli[cn]
            for rn, c in attrs[[cn]].itertuples():
                if not c: continue
                c = c.rstrip(";")
                if not c: continue
                i = rowi[rn]
                for pair in c.split(";"):
                    self.ctx[(i, j)].append(pair)

However, it outperform the current patch only slightly with benchmark.
for 1200 case, it took 3.02s where it took 3.13 with the current patch.
Plus we're not sure if this is okay with all occasions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you avoid the append? I think if this was a comprehension (or at least the last append) would be much better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how how it can be better.
Let's say we have a defaultdict out of comprehension with new pairs to be added to existing ctx. What should happen afterward is basically same things as this one, I would assume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i actually like your code above a little better, its very idiomatic and easy to understand. push it up and ping on green.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't feel safe with this code. It might break someone's code.
What if the attrs's column is specified different way other than plain column label?

@jihwans
Copy link
Contributor Author

jihwans commented Jun 25, 2020 via email

@jreback
Copy link
Contributor

jreback commented Jun 27, 2020

@jihwans we have tests ; if we get a report then we can further update but ok pushing this change

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to update to the simpler version would be fine with that

@jihwans
Copy link
Contributor Author

jihwans commented Jun 27, 2020

@jeff I tried to come up with code that may break with the new patch we discussed but I could not. I guess I will update it with the new one.

- experimental, 10% further improvement by eliminating get_indexer call

see pandas-dev#19917
@jreback jreback merged commit 4274b11 into pandas-dev:master Jul 9, 2020
@jreback
Copy link
Contributor

jreback commented Jul 9, 2020

thanks @jihwans very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Styler extremely slow
6 participants