Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make Styler compatible with non-unique indexes #41269

Merged
merged 22 commits into from
May 6, 2021

Conversation

attack68
Copy link
Contributor

@attack68 attack68 commented May 2, 2021

The PR aims to make Styler compatible with non-unique indexes/columns, for the purpose of rendering all DataFrame types (even if no styling is applied)

  • Styler.format: made FULLY compatible with some modifications to the loops, inc TESTS.
  • Styler.apply and Styler.applymap: made PARTIALLY compatible:
    • if subsets are unique then compatible, inc TESTS.
    • if subsets are non-unique slices will raise a not compatible KeyError, inc. TESTS
  • Styler.apply and applymap are NOT compatible. Raises KeyError.
  • Styler.set_table_styles: made FULLY compatible and will style multiple rows/columns from a non-unique key, inc TESTS.
  • Styler.set_td_classes uses reindex so is PARTIALLY compatible where classes has unique index/columns: now returns a KeyError in non-unique case, inc TESTS.
  • Styler.set_tooltips uses reindex so is PARTIALLY compatible where ttips has unique index/columns: now returns a KeyError in non-unique case, inc TESTS.
  • Styler.hide_index and .hide_columns are already FULLY compatible through existing code (inc TESTS)
  • all the built-in styling functions use some version of apply or applymap so are captured by the above cases.

I believe this is all relevant functionality reviewed.

@attack68 attack68 changed the title ENH: make Styler.format compatible with non-unique indexes ENH: make Styler compatible with non-unique indexes May 2, 2021
if not c:
continue
css_list = maybe_convert_css_to_tuples(c)
i, j = self.index.get_loc(rn), self.columns.get_loc(cn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can instead we just not use the indices to look up locations? and instead just use indexers (e.g. iterate over the number of columns and use iloc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because attrs may be a subset of the main self.data, so to maintain performance and ensure the mapped css is placed in the right integer locations it needs to lookup - and lookup doesn't work (without ambiguity) for non-unique.

@jreback jreback added the Styler conditional formatting using DataFrame.style label May 3, 2021
@attack68 attack68 marked this pull request as ready for review May 3, 2021 06:57
pandas/io/formats/style.py Show resolved Hide resolved
Comment on lines 481 to 483
try:
for rn, c in attrs[[cn]].itertuples():
if not c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you limit the try/except not to the entire loop here, e.g. just put it around the .get_loc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be even better to simply refuse to render non-uniques entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed the separate cases, as suggested.

css_list = maybe_convert_css_to_tuples(c)
i, j = self.index.get_loc(rn), self.columns.get_loc(cn)
self.ctx[(i, j)].extend(css_list)
except ValueError as ve:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to be much more fine grained

self._display_funcs[(i, j)] = format_func
for row in data[[col]].itertuples():
i_ = self.index.get_indexer_for([row[0]]) # handle duplicate keys in
j_ = self.columns.get_indexer_for([col]) # non-unique indexes
Copy link
Contributor

@jreback jreback May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do this outside of the loop right? (as col doesn't change), for j_

does this change perf at all? (I don't think so, but checking).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.. the multiple loops really killed it for the unique case (which is benchmarked)..

       before           after         ratio
     [4af3eed5]       [3a8f11e5]
     <styler_non_unique~1^2>       <styler_non_unique>
+        57.8±2ms          163±4ms     2.82  io.style.Render.time_format_render(24, 120)
+        87.1±5ms          242±4ms     2.78  io.style.Render.time_format_render(36, 120)
+      30.5±0.9ms       82.2±0.4ms     2.69  io.style.Render.time_format_render(12, 120)
+     8.53±0.08ms       14.7±0.2ms     1.72  io.style.Render.time_format_render(12, 12)
+      16.2±0.3ms       27.7±0.7ms     1.71  io.style.Render.time_format_render(24, 12)
+        25.3±1ms       40.9±0.7ms     1.62  io.style.Render.time_format_render(36, 12)
+      16.6±0.2ms       18.5±0.2ms     1.11  io.style.Render.time_classes_render(36, 12)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

So I had to separate out the non-unique and unique cases with a conditional, then performance was the same...

BENCHMARKS NOT SIGNIFICANTLY CHANGED.


j = self.columns.get_loc(col) # single value
for row, value in data[[col]].itertuples():
i = self.index.get_loc(row) # single value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pull the col indexer out does this still have a perf hit? e.g. get_indexer_for calls get_loc if its unique (which is cached) anyhow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good reviewing! Third time's a charm, the code is simpler, works for non-unique and unique and performance improvement:

       before           after         ratio
     [4af3eed5]       [c3b7af82]
     <text_gradient^2>       <styler_non_unique>
-      33.1±0.4ms       25.9±0.5ms     0.78  io.style.Render.time_format_render(12, 120)
-      86.4±0.4ms       66.8±0.5ms     0.77  io.style.Render.time_format_render(36, 120)
-        62.3±2ms       46.4±0.4ms     0.75  io.style.Render.time_format_render(24, 120)
-      10.1±0.3ms       4.16±0.2ms     0.41  io.style.Render.time_format_render(12, 12)
-      24.7±0.2ms       9.53±0.2ms     0.39  io.style.Render.time_format_render(36, 12)
-      17.7±0.2ms       6.76±0.1ms     0.38  io.style.Render.time_format_render(24, 12)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback jreback added this to the 1.3 milestone May 6, 2021
@jreback jreback merged commit ebf3b98 into pandas-dev:master May 6, 2021
@jreback
Copy link
Contributor

jreback commented May 6, 2021

ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Make Styler work with non-unique indexes
2 participants