Skip to content

[query/ggplot] fixes legend behavior for single-category columns #12913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

iris-garden
Copy link
Contributor

CHANGELOG: Fixed bug in hail.ggplot where all legend entries would have the same text if one column had exactly one value for all rows and was mapped to either the shape or the color aesthetic for geom_point.

This change fixes an edge case that was encountered where, if the user creates a scatter plot and assigns one column with multiple values in it to the color aesthetic and one column with exactly one value in it to the shape aesthetic (or vice versa), the plot has a legend which contains the correct visual entries, but they are all labelled with the value from the single-value column, instead of the labels from the other column.

For a concrete example, this code:

import hail as hl
from hail.ggplot import ggplot, aes, geom_point

ht = hl.utils.range_table(10).annotate(x = hl.rand_norm(0, 1), y = hl.rand_norm(0, 1))

ht = ht.annotate(proj_title = 'constant')
ht = ht.annotate(subpop = hl.str(ht.idx % 3))

p = (
    hl.ggplot.ggplot(ht, hl.ggplot.aes(x=ht.x, y=ht.y, color=ht.subpop, shape=ht.proj_title))
    + hl.ggplot.geom_point()
)
p.show()

Produces a graph like this one:

Screen-Shot-2023-03-07-at-13 58 48

When it should instead produce one like this:

newplot

Further complicating matters, the first graph cannot be reproduced consistently due to the nature of the bug. See my comment on the code below for more information.

for aes_name in self.aes_legend_groups:
category = self._get_aes_value(df, f"{aes_name}_legend")
trace_category = category if category is not None else trace_category
Copy link
Contributor Author

@iris-garden iris-garden Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to reproduce this issue locally, I was initially confused because the code example above (distilled from one of Dan's examples in the above-linked Zulip thread) produced the correct graph on my machine. However, another example that had worked fine for Dan evinced the issue for me. The example in question switched shape to be the aesthetic that had only one category ("constant").

The reason for this is that this for loop is iterating through a set, not a list, so it does not have a deterministic order, and the root cause of the bug is that, in this line, if the case is encountered where there are two different aesthetics that each have categories, whichever one was seen most recently is the category used for the legend. This went unnoticed in my testing because it only shows up when there's a column with exactly one value in it mapped to either the color or shape aesthetic, not when one of the two is unmapped or both are mapped to columns with multiple values in them.

My change addresses the issue by storing all categories that are seen in this loop, and waiting until we have checked which column has multiple values to flatten them down to a single relevant category. In the case where multiple columns have multiple values, the category label we generate by this process is ignored in favor of the dummy legend introduced in #12254.

CHANGELOG: Fixed bug in `hail.ggplot` where all legend entries would
have the same text if one column had exactly one value for all rows and
was mapped to either the `shape` or the `color` aesthetic for
`geom_point`.

[An edge case was encountered](https://hail.zulipchat.com/#narrow/stream/123000-general/topic/plotting.20legend.20bug/near/340198585)
where, if the user creates a scatter plot
and assigns one column with multiple values in it to the `color`
aesthetic and one column with exactly one value in it to the `shape`
aesthetic (or vice versa), the plot has a legend which contains the
correct visual entries, but they are all labelled with the value from
the single-value column, instead of the labels from the other column.

For a concrete example, this code:

```python3
import hail as hl
from hail.ggplot import ggplot, aes, geom_point
ht = hl.utils.range_table(10).annotate(x = hl.rand_norm(0, 1), y = hl.rand_norm(0, 1))

ht = ht.annotate(proj_title = 'constant')
ht = ht.annotate(subpop = hl.str(ht.idx % 3))

p = (
    hl.ggplot.ggplot(ht, hl.ggplot.aes(x=ht.x, y=ht.y, color=ht.subpop, shape=ht.proj_title))
    + hl.ggplot.geom_point()
)
p.show()
```

Produces a graph like this one:

<graph here>

When it should instead produce this graph:

<graph here>

Further complicating matters, the first graph cannot be reproduced
consistently due to the nature of the bug. See my comment on the code
below for more information.

This change fixes the bug.
@iris-garden iris-garden force-pushed the query/ggplot/legends-again branch from 5817940 to 172a135 Compare April 20, 2023 16:14
@danking
Copy link
Contributor

danking commented Apr 21, 2023

If ready for review, "assign" someone and it'll show up in their CI queue :)

@danking danking merged commit 7d02f56 into hail-is:main Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants