-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: to speed up rendering of styler #34863
Conversation
Seems to be failing some tests. Is this supposed to produce an equivalent output? |
Yes it does. For example: pandas/pandas/tests/io/formats/test_style.py Lines 986 to 1013 in 3c959fc
This entry can be removed from the expected test result in above code:
|
The best way is to typically write benchmarks and run them in the performance suite. |
As long as the new implementation produces roughly equivalent HTML then we can consider updating intermediate representation expected in the tests. |
It does produce exactly same HTML. As you can see, the only things missed in intermediate steps are the empty strings. |
OK, thanks. Can you update the tests then and add an ASV? |
I can update tests but what is an ASV? |
@jihwans : ASV refers to https://github.com/airspeed-velocity/asv This directory is where you add such a test: https://github.com/pandas-dev/pandas/tree/master/asv_bench/benchmarks |
Thank you all. |
This comment has been minimized.
This comment has been minimized.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
This comment has been minimized.
This comment has been minimized.
Commenter does not have sufficient privileges for PR 34863 in repo pandas-dev/pandas |
0e68954
to
e36e03b
Compare
ASV files was added |
This comment has been minimized.
This comment has been minimized.
8b1825a
to
69c2bc2
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Run black on that file
|
This comment has been minimized.
This comment has been minimized.
Thank you Tom, Will and Young, for your helps - I learned a lot along the way. |
jreback - I submitted the code again with that test removed -- It's really not necessary since there really is no impact time wise. It does have memory impact. Thanks for your valuable comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls add a whatsnew note in performance section in 1.1
whatsnew added in 1.1 performance |
pandas/io/formats/style.py
Outdated
i = self.index.get_indexer([row_label])[0] | ||
j = self.columns.get_indexer([col_label])[0] | ||
for pair in col.rstrip(";").split(";"): | ||
rows = [(row_label, v) for row_label, v in attrs.iterrows()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use itertuples here (its actually much faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeff - that's a good thing to know and I tried it but could not figure out doing the same thing with itertuples.
To get the col_label within the inner loop, I need to use ._fields(), getattr(), list slicing, etc to separate index, ... basically many extra steps. I am not sure how much we can save here.
However, it seems that .get_indexer is the one that caused much delay. So real solution should be something that will eliminate get_indexer entirely or some acceleration effort done on get_indexer.
I can think of one way to avoid get_indexer -- simply taking index & columns as list and use it to get integer index # of given label. However, I was not sure if I could do that safely because I am not sure all the labels given in attrs always matches that of self.index and self.columns. probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i see, you are doing get indexer once for the rows, you can do the same once for the columns. you can throw this in a dict {label -> int}. This will vastly speed up things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean, each row will have same columns ..?
how about 4x4 table that has some attr assignments on column A, B on row 1, 2 but on column C, D only on row 3, 4 ..?
It would be great if the original author of this function could answer this -- or, are you may be?
I admit that I did the patch much relying on guesses based on common sense ( or my version of common sense :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is certain that we really do not have to use get_indexer method, probably something this should work, outside the loops:
rowmap = { label: i for i, label in enumerate(self.index) }
colmap = { label: i for i, label in enumerate(self.columns) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the exact code that worked for my app:
def _update_ctx(self, attrs: DataFrame) -> None:
coli = {k: i for i, k in enumerate(self.columns)}
rowi = {k: i for i, k in enumerate(self.index)}
for jj in range(len(attrs.columns)):
cn = attrs.columns[jj]
j = coli[cn]
for rn, c in attrs[[cn]].itertuples():
if not c: continue
c = c.rstrip(";")
if not c: continue
i = rowi[rn]
for pair in c.split(";"):
self.ctx[(i, j)].append(pair)
However, it outperform the current patch only slightly with benchmark.
for 1200 case, it took 3.02s where it took 3.13 with the current patch.
Plus we're not sure if this is okay with all occasions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you avoid the append? I think if this was a comprehension (or at least the last append) would be much better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how how it can be better.
Let's say we have a defaultdict out of comprehension with new pairs to be added to existing ctx. What should happen afterward is basically same things as this one, I would assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i actually like your code above a little better, its very idiomatic and easy to understand. push it up and ping on green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't feel safe with this code. It might break someone's code.
What if the attrs's column is specified different way other than plain column label?
Hmm... are we sure this won’t break someone’s existing application?
…On Thu, Jun 25, 2020 at 7:03 PM Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/io/formats/style.py
<#34863 (comment)>:
> @@ -561,11 +561,15 @@ def _update_ctx(self, attrs: DataFrame) -> None:
Whitespace shouldn't matter and the final trailing ';' shouldn't
matter.
"""
- for row_label, v in attrs.iterrows():
- for col_label, col in v.items():
- i = self.index.get_indexer([row_label])[0]
- j = self.columns.get_indexer([col_label])[0]
- for pair in col.rstrip(";").split(";"):
+ rows = [(row_label, v) for row_label, v in attrs.iterrows()]
ok i actually like your code above a little better, its very idiomatic and
easy to understand. push it up and ping on green.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34863 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXKQVXJRCGJSMII3YHZQCTRYPJU5ANCNFSM4OCB3SGQ>
.
|
@jihwans we have tests ; if we get a report then we can further update but ok pushing this change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to update to the simpler version would be fine with that
@jeff I tried to come up with code that may break with the new patch we discussed but I could not. I guess I will update it with the new one. |
- experimental, 10% further improvement by eliminating get_indexer call see pandas-dev#19917
thanks @jihwans very nice |
see #19917 (comment)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff