Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query/ggplot] Updates faceting to respect legend grouping #12497

Merged
merged 1 commit into from
Dec 1, 2022

Conversation

iris-garden
Copy link
Collaborator

@iris-garden iris-garden commented Nov 23, 2022

This change brings the facet_wrap interface more in line with the ggplot2 implementation by defaulting to show axes (with tick marks) for all facets, as well as providing scales, nrow and ncol arguments that the user can use to specify whether the scales should be the same across facets and how many rows or columns to use.

It also updates the faceting code to avoid duplicating legend entries when grouped legends are used, and to disable interactivity accordingly.

With this update, running this code:

import hail as hl
from hail.ggplot import aes, facet_wrap, geom_point, ggplot, vars
ht = hl.utils.range_table(10)
ht = ht.annotate(squared=ht.idx ** 2)
ht = ht.annotate(even=hl.if_else(ht.idx % 2 == 0, "yes", "no"))
ht = ht.annotate(threeven=hl.if_else(ht.idx % 3 == 0, "good", "bad"))
ht = ht.annotate(fourven=hl.if_else(ht.idx % 4 == 0, "minor", "major"))
fig = (
    ggplot(ht, aes(x=ht.idx, y=ht.squared))
    + geom_point(aes(shape=ht.even, color=ht.threeven))
    + facet_wrap(vars(ht.threeven, ht.fourven), nrow=1)
)
fig.show()

Produces the following plot:

Screen Shot 2022-11-23 at 12 23 21

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, a few thoughts on tightening the code up


Returns
-------
:class:`FigureAttribute`
The faceter.

"""
return FacetWrap(facets)
return FacetWrap(facets, nrow, ncol, scales)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that we error if both nrow and ncol are non-missing. I think its better to make our users be explicit rather than implicitly work but potential confuse them down the road.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an error!

@@ -49,14 +55,43 @@ def get_expr_to_group_by(self):

class FacetWrap(Faceter):

def __init__(self, facets):
def __init__(self, facets, nrow, ncol, scales):
self.nrow = nrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pursuant to the above, I'd probably assert mutual exclusivity of nrow and ncol here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an error!

ncol = int(math.ceil(math.sqrt(num_facet_values)))
nrow = int(math.ceil(num_facet_values / ncol))
def _calc(self, num_facet_values, n):
return int(math.ceil(num_facet_values / n))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use integer math for this:

(num_facet_values + n - 1) // n

I prefer a more descriptive name than _calc. Maybe _n_partitions(self, items, n_splits)? This could be in hailtop/utils.py, it's a general purpose thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced!

return (self.nrow, self._calc(num_facet_values, self.nrow))
else:
ncol = int(math.ceil(math.sqrt(num_facet_values)))
return (self._calc(num_facet_values, ncol), ncol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This biases towards more columns rather than more rows. Seems fine. I'm not sure if I should prefer wide rectangles or long rectangles.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's kinda what I figured haha 🤷‍♀️

"shared_xaxes": False,
"shared_yaxes": False,
}
}).get(self.scales, base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should error if you give an invalid scales parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an error!

for aes_name, legend_group in legends.items():
if len(legend_group) > 1:
number_of_displayed_legends += 1
traces = [*traces, [fig_so_far, df, facet_row, facet_col, values, trace_category]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about trace_category. Can you explain to me a bit more about what the goal is here? In particular, I'm generally suspicious of iterating through a group of values and arbitrarily taking the last one that is not None (what if there are multiple not None categorys?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace_category is what will be used to label this group on the legend iff we are making an interactive plot, so in my example above, if we only had the color aesthetic and not the shape one, trace_category would end up being "bad" in one iteration, and "good" in the other. Because there's only one shape category and it was created by default instead of using a category from a user-provided aesthetic mapping, the value of category when we're looking at the shape aesthetic in the iteration is None. Since trace_category will only actually be used if we are making an interactive plot, and since we only support interactivity when there's one or fewer aesthetic mappings that will appear in the legend with multiple values, we can assume that if we have a category that is None, we have exactly one category that is None.

Agreed that this is suspicious haha, but hopefully we can ditch plotly and the way we currently do grouped legends soonish 😸

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so, check that I understand this correctly:

grouped_data is List[DataFrame] where the DataFrame contains one column for each Hail field needed to render this geometry. I think each DataFrame represents a different "grouping" of the data. In particular, I think this comes from Stat.listify. We should add a type to that method. I think it returns List[DataFrame]. I also think there's a bug in StatNone. It should return [pd.DataFrame({})].

IIUC, there's a "grouping" for each "visually manifested" scale. "visually manifested" means it's a discrete scale and not part of the tooltip. I think this should be called data_by_visible_scale or something like that.

Alright, so, IIUC, we're trying to collect all the possible values of these visually manifested scales. From that information, we can determine if we need a legend and, furthermore, if we need a static legend.

I think it's also the case that color_legend is the description of the legend itself for colors. Same for shape_legend and shape.

When we're looping through the data_by_visible_scale, it's not obvious that we don't get conflicting values for category across different data frames. Let's be concrete. In your example above there are four facets ({good, bad} cross {minor, major}) with two scales: color & shape. Color is perfectly correlated with {good, bad}. data_by_visible_scale will have a data frame for each combination of color and shape that appears in that particular facet. OK, so after the first iteration of this loop, legends should be this:

{
    "shape": {"shape": ["circle"]},
    "color": {"color": ["blue"]}
}

Then the next iteration, unless I'm totally misunderstanding here, would blow away these values and replace them with:

{
    "shape": {"shape": ["circle", "square"]},
    "color": {"color": ["red"]}
}

So clearly I'm misunderstanding something. I think it must be my understanding of color_legend / category, what is that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, added return type annotations for listify and fixed the return value for StatNone!

for the example in the PR description, this is what legends looks like:

### `apply_to_fig` call ###
#### iteration of `grouped_data` loop ####
{'color': {'bad': 'rgb(0, 219, 207)'}, 'shape': {'no': 'square'}}
#### iteration of `grouped_data` loop ####
{'color': {'bad': 'rgb(0, 219, 207)'}, 'shape': {'no': 'square', 'yes': 'circle'}}
### `apply_to_fig` call ###
#### iteration of `grouped_data` loop ####
{'color': {'bad': 'rgb(0, 219, 207)'}, 'shape': {'yes': 'circle'}}
### `apply_to_fig` call ###
#### iteration of `grouped_data` loop ####
{'color': {'good': 'rgb(255, 0, 0)'}, 'shape': {'no': 'square'}}
#### iteration of `grouped_data` loop ####
{'color': {'good': 'rgb(255, 0, 0)'}, 'shape': {'no': 'square', 'yes': 'circle'}}
### `apply_to_fig` call ###
#### iteration of `grouped_data` loop ####
{'color': {'good': 'rgb(255, 0, 0)'}, 'shape': {'yes': 'circle'}}

so each aesthetic (shape, color) has a key in legends, which is mapped to a dict mapping the possible values for that aesthetic (the "categories" extracted from the corresponding column of the hail Table or Expression) to their value for that aesthetic.

in the None case i mentioned above, we get something like this:

{'color': {None: 'black'}, 'shape': {'no': 'circle', 'yes': 'square'}}

due to .... some other code somewhere 😭 (maybe somewhere in ScaleDiscrete?) ... the value for the category name when we don't actually have more than one category is None

does that actually address your question? sorry, i got a bit lost along the way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I kinda see what's happening now. I have a suspicion we could find a slightly simpler, if less efficient, solution by passing over the data twice: once to compute the legend and once to add the trace. No need to act on that now. Let's keep it in mind as we look to Altair though.

count = 0
for legend_group in legends.values():
count += 1 if len(legend_group) > 1 else 0
requires_static = count > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit but I think this whole bit is easier to read as:

has_displayed_legend = any(
    len(legend_group) > 1 for legend_group in legends.values()
)
requries_static = is_faceted or has_displayed_legend

Copy link
Collaborator Author

@iris-garden iris-garden Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code is actually checking that there are at least two legend_groups with len > 1 (if we only have one, we can still support interactivity), so any wouldn't work here, as far as I can tell?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh. This kind of makes the argument for me that this is a bit hard to follow. How about this:

non_empty_legend_groups = (g for g in legends.values() if len(g) > 1)
has_at_least_two_legend_groups = len(non_empty_legend_groups) >= 2
requries_static = is_faceted or has_at_least_two_legend_groups

The first line is a generator for the kinds of groups you describe. The second line counts the number of them.

Let me know if I'm picking nits here! Maybe I'm being obtuse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair! updated based on your suggestion 😸

for aes_name, legend_group in legends.items():
if len(legend_group) > 1:
number_of_displayed_legends += 1
traces = [*traces, [fig_so_far, df, facet_row, facet_col, values, trace_category]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use traces.append(...).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced!


if requires_static:
for trace in traces:
self._add_trace(*trace[:-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too clever 😉 . What's the harm in sending the trace_category into _add_trace in the static case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I'm using whether it's passed in to _add_trace to decide whether the trace should have an interactive legend entry. Should I pass a flag to it instead, or do something else?

yaxis=dict(linecolor="black", showticklabels=True),
**{
f"{var}axis{idx}": {"linecolor": "black", "showticklabels": True}
for idx in range(2, n_facet_rows + n_facet_cols + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably worthy of a short comment explaining the sillyness of the axis keyword argument names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

# xaxis stands in for xaxis1; xaxis2 and beyond are as expected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment!

@iris-garden iris-garden force-pushed the query/ggplot/facet-updates branch 2 times, most recently from b5af6a5 to 8352869 Compare November 23, 2022 20:43
@iris-garden iris-garden force-pushed the query/ggplot/facet-updates branch 3 times, most recently from 7656e08 to 9dc7a11 Compare November 28, 2022 17:15
@@ -23,40 +27,85 @@ def vars(*args):
return hl.struct(**{f"var_{i}": arg for i, arg in enumerate(args)})


def facet_wrap(facets):
def facet_wrap(facets: StructExpression, nrow: int = None, ncol: int = None, scales: str = "fixed") -> "FacetWrap":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one more thing, let's make nrow, ncol, and scales keyword only. We generally prefer keyword only arguments unless there's a strong usability argument. In this case, I'd argue it's actually more confusing that facet_wrap(..., 3) is allowed, there's no obvious reason that rows would come before cols.

Suggested change
def facet_wrap(facets: StructExpression, nrow: int = None, ncol: int = None, scales: str = "fixed") -> "FacetWrap":
def facet_wrap(facets: StructExpression, *, nrow: int = None, ncol: int = None, scales: str = "fixed") -> "FacetWrap":

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

count = 0
for legend_group in legends.values():
count += 1 if len(legend_group) > 1 else 0
requires_static = count > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh. This kind of makes the argument for me that this is a bit hard to follow. How about this:

non_empty_legend_groups = (g for g in legends.values() if len(g) > 1)
has_at_least_two_legend_groups = len(non_empty_legend_groups) >= 2
requries_static = is_faceted or has_at_least_two_legend_groups

The first line is a generator for the kinds of groups you describe. The second line counts the number of them.

Let me know if I'm picking nits here! Maybe I'm being obtuse.

for aes_name, legend_group in legends.items():
if len(legend_group) > 1:
number_of_displayed_legends += 1
traces = [*traces, [fig_so_far, df, facet_row, facet_col, values, trace_category]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so, check that I understand this correctly:

grouped_data is List[DataFrame] where the DataFrame contains one column for each Hail field needed to render this geometry. I think each DataFrame represents a different "grouping" of the data. In particular, I think this comes from Stat.listify. We should add a type to that method. I think it returns List[DataFrame]. I also think there's a bug in StatNone. It should return [pd.DataFrame({})].

IIUC, there's a "grouping" for each "visually manifested" scale. "visually manifested" means it's a discrete scale and not part of the tooltip. I think this should be called data_by_visible_scale or something like that.

Alright, so, IIUC, we're trying to collect all the possible values of these visually manifested scales. From that information, we can determine if we need a legend and, furthermore, if we need a static legend.

I think it's also the case that color_legend is the description of the legend itself for colors. Same for shape_legend and shape.

When we're looping through the data_by_visible_scale, it's not obvious that we don't get conflicting values for category across different data frames. Let's be concrete. In your example above there are four facets ({good, bad} cross {minor, major}) with two scales: color & shape. Color is perfectly correlated with {good, bad}. data_by_visible_scale will have a data frame for each combination of color and shape that appears in that particular facet. OK, so after the first iteration of this loop, legends should be this:

{
    "shape": {"shape": ["circle"]},
    "color": {"color": ["blue"]}
}

Then the next iteration, unless I'm totally misunderstanding here, would blow away these values and replace them with:

{
    "shape": {"shape": ["circle", "square"]},
    "color": {"color": ["red"]}
}

So clearly I'm misunderstanding something. I think it must be my understanding of color_legend / category, what is that?

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@@ -44,7 +44,7 @@ def _update_legend_trace_args(self, trace_args, legend_cache):
trace_args["showlegend"] = False
else:
trace_args["showlegend"] = True
legend_cache.add(trace_args["name"])
legend_cache[trace_args["name"]] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 197, you **prev where prev is a value from the legend_cache. **True will error. Should this be assigning {} instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't cause problems on 197 because GeomPoint.apply_to_fig never calls _update_legend_trace_args. i changed legend_cache to a dict instead of a set because i needed to put the whole legend_group in there rather than just the aes_name, but since all the other geoms still function the same as before, this was just the easiest way i could think of to have it basically still be a set (just the set of the dict's keys, while ignoring the values)

that said, i think it's still a good idea to update it, so that the type of legend_cache is more consistent!

def listify(self, agg_result):
return pd.DataFrame({})
def listify(self, agg_result) -> List[DataFrame]:
return [pd.DataFrame({})]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

CHANGELOG: Added support for `scales`, `nrow`, and `ncol` arguments, as
well as grouped legends, to `hail.ggplot.facet_wrap`.

This change brings the `facet_wrap` interface more in line with
[the `ggplot2` implementation](https://ggplot2-book.org/facet.html#facet-wrap)
by defaulting to show axes (with tick marks) for all facets, as well as
providing `scales`, `nrow` and `ncol` arguments that the user can use to specify
whether the scales should be the same across facets and how many rows or columns
to use.

It also updates the faceting code to avoid duplicating legend entries when
[grouped legends](hail-is#12254) are used, and to
disable interactivity accordingly.
@danking danking merged commit 5bd0f8f into hail-is:main Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants