Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a datamation for binary outcomes #98

Closed
jhofman opened this issue Sep 29, 2021 · 44 comments · Fixed by #112
Closed

Create a datamation for binary outcomes #98

jhofman opened this issue Sep 29, 2021 · 44 comments · Fixed by #112
Assignees
Labels
brainstorming gemini priority next action visualizing-verbs has to do with how a data analysis verb is represented visually

Comments

@jhofman
Copy link
Contributor

jhofman commented Sep 29, 2021

This is a followup to #97, which is complicated because it deals with low base rates.

So to simplify things we'll start w/ visualizing Simpson's Paradox in batting averages instead, as explained by this example on Wikipedia. This example compares two players and shows that while one has a higher batting average than the other within each year, the trend reverses if you look across both years. This happens because of the uneven number of at-bats that each player has in each year.

Below is some R code to make the final plot versions, and the task here is to brainstorm what the datamation will look like.

Right now here's what we're thinking for the overall datamation (across both years):

  • Start with a grid, one for each player, each dot represents one at-bat
  • Then shade the dots based on whether it represents a hit (colored in solid) or not (empty just stroke on outline)
  • Next collapse the points to their averages and error bars

The datamation that breaks out each year will be similar, but the grid will be a 2-by-2 (player + year).

library(tidyverse)

theme_set(theme_minimal())

# data from https://en.wikipedia.org/wiki/Simpson's_paradox#Batting_averages
# Year
# Batter  
#                     1995	1996	Combined
# Derek Jeter	12/48	.250	183/582	.314	195/630	.310
# David Justice	104/411	.253	45/140	.321	149/551	.27

jeter_1995 <- data.frame(
  player = "Derek Jeter",
  year = 1995,
  is_hit = c(rep(1, 12), rep(0, 48-12))
)

jeter_1996 <- data.frame(
  player = "Derek Jeter",
  year = 1996,
  is_hit = c(rep(1, 183), rep(0, 582-183))
)

justice_1995 <- data.frame(
  player = "David Justice",
  year = 1995,
  is_hit = c(rep(1, 104), rep(0, 411-104))
)

justice_1996 <- data.frame(
  player = "David Justice",
  year = 1996,
  is_hit = c(rep(1, 45), rep(0, 140-45))
)

df <- bind_rows(jeter_1995,
                jeter_1996,
                justice_1995,
                justice_1996)

# datamation #1:
# jeter has a higher batting average than justice overall
df %>%
  group_by(player) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()))  %>%
  ggplot(aes(x = player, y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se)) +
  labs(x = "",
       y = "Batting average")
  #geom_bar(stat = "identity")

# datamation #2:
# but justice has a higher batting average than jeter within each year
df %>%
  group_by(player, year) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()) ) %>%
  ggplot(aes(x = as.factor(year), y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se),
    position = position_dodge(width = 0.25)) +
  labs(x = "",
       y = "Batting average")
  #geom_bar(stat = "identity", position = "dodge")


@jhofman jhofman added visualizing-verbs has to do with how a data analysis verb is represented visually priority next action gemini brainstorming labels Sep 29, 2021
@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman , I created this datamation.

Its frames are:

  • starts with groups by year and player
  • then shows is_hit or not. (I could not find a solution to show part of circle and part of stroked circle, because they have either mark: point or circle).
  • then showing batting averages within each year
  • and then showing overall averages, which shows paradox.

I think we don't need two datamations, only one, like this. What do you think?

simpson.mov

@jhofman
Copy link
Contributor Author

jhofman commented Sep 30, 2021

Very cool on the quick turnaround for this @giorgi-ghviniashvili!

I think it's a very good start.

Small details:

  • I would switch the year to be the x axis and player to be the y axis
  • "Betting" should be spelled "Batting"
  • It would be great to find an open circle---it must be possible, right?

As for the two vs. one damation, I see what you mean. At the same time, I think it's nice to have them separately as well so you can compare them. That's what we did w/ the salary data and I think it was effective. Can you generate each separately so we could see them side-by-side?

@giorgi-ghviniashvili
Copy link
Collaborator

Hi @jhofman,

  • switched year to be x axis and player to be y axis
  • fixed typo "Betting"
  • open vs filled circle, I made this work.

About the 2 datamations side by side: to make this work, I must wrap whole app.js into a closure function (or class), otherwise it is not possible to have two instances at the same time. When init is called second times, it overwrites old values because they are in global scope.

So this:
image

Must be changed to this: (notice function App() {} declaration, which encloses the code).

image

@jhofman
Copy link
Contributor Author

jhofman commented Sep 30, 2021 via email

@giorgi-ghviniashvili
Copy link
Collaborator

I just modified it to make side-by-side. @jhofman let me know what you think.

side-by-side.mov

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman , having it wrapped by a function, into a private scope is best practice. I hope @sharlagelfand can fix modify htmlwidgets to support this. Otherwise I can revert that change. Not it does not create any problems, because it is in a separate branch.

@jhofman
Copy link
Contributor Author

jhofman commented Oct 1, 2021

Gotcha. Is there a way to get better spacing on the circles so they don't overlap?

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman good point. I reduced circle radius:

image

@jhofman
Copy link
Contributor Author

jhofman commented Oct 4, 2021

Smaller radius looks much better.

Can you update colors to be consistent as well, so that Derek Jeter's points are all orange from the first frame to the last?

Also, do you think it's worth changing the orientation of the initial frames so the players are side-by-side (1 row, 2 columns) instead of stacked on top of each other (1 column, 2 rows)? That should transition more naturally to the final frame where players' names are on the x axis, right?

Once those are set can you render the full animation to see what it looks like? (I'm seeing just the key frames at the moment.)

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman

I can't make the colors consistent on second frame, because I am adjusting fill and stroke colors based on hit. So if hit === yes, then fill blue, otherwise #fff.

I am not able to add expression like that: datum['player'] == 'Derek' && datum['hit'] === 'hit' ? 'orange' : '#fff', the only thing I can do is to map [yes, no] to ['blue', 'white'].

About placing players side by side, I chose stacked, because of space.. But now I did this:

datamations.mov

@jhofman
Copy link
Contributor Author

jhofman commented Oct 5, 2021

This is a nice update!

On the colors, what can we do as a workaround for this? I think it will be generally important to have this kind of functionality.

Is it a limitation due to Gemini or Vega, or something about the stack you've built on top of them?

@giorgi-ghviniashvili
Copy link
Collaborator

I could not figure out in Vega. I still can try out some workarounds.

@jhofman
Copy link
Contributor Author

jhofman commented Oct 7, 2021

Ah that's too bad. I didn't realize this would be so difficult. Would using different shapes (instead of filled vs. empty circles) be any easier?

@giorgi-ghviniashvili
Copy link
Collaborator

Yes, using different shapes will be easier. Can I try triangle and circle?

@jhofman
Copy link
Contributor Author

jhofman commented Oct 7, 2021 via email

@giorgi-ghviniashvili
Copy link
Collaborator

Ah yes, square and circle will be ok. I will try tomorrow.

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman I easily made it with shapes:

image

image

giorgi-ghviniashvili added a commit that referenced this issue Oct 9, 2021
@sharlagelfand
Copy link
Collaborator

Hi @jhofman @giorgi-ghviniashvili! Just catching up on this thread 👋🏻 I think we need to back up a bit and consider how we can generate these visualizations within the existing datamations framework and API - right now we aren't even able to handle non-numeric (i.e. binary or categorical) response variables, even though it was technically possible to generate these visualizations via custom specs as Giorgi has done. I'm going to put together some thoughts / adjustments on how we can start to handle this, hopefully over the next day or two. Just wanted to say my eyes are back on! 👀

@sharlagelfand
Copy link
Collaborator

I'll have to detect it on my end whether the response variable is categorical (e.g. character / factor in R) or binary (0/1 or TRUE/FALSE) and generate another info grid at the "display the response variable" step, instead of the jittered scatter plot that we're currently generating at that step (which would still be used when the variable is numeric).

If we want to map shape to categorical / binary variables, then the grid generation function needs to be able to take shape - @giorgi-ghviniashvili, would you be able to update the generateGrid() function so that shape could be passed in, like color is? I will pass it the same way, e.g. spec.encoding.shape like how spec.encoding.color is passed.

I was thinking too, re: the comment about the spacing on the circles that the number of rows in the grid generation could be dynamic - right now it is fixed to 10, but we could base it on how much data there instead. e.g. there are 551 points for Justice, 630 for Jeter, so with only 10 rows that means about 55 and 63 columns respectively.

If we base it on the number of points to try to get a "square" grid for the biggest group, e.g. sqrt(630) = 25 approximately, then that would mean 22 columns for Justice and 25 for Jeter, which would look a lot more balanced.

Thanks!

@jhofman
Copy link
Contributor Author

jhofman commented Oct 19, 2021

@giorgi-ghviniashvili will work on hacking things to get filled vs. empty circles, and adjust grid to fill from top to bottom and left to right.

+1 to the idea of making an adaptive spacing square grid that @sharlagelfand suggested. when not a perfect square let's err on the side of more columns than rows for a "wider" grid, so col = ceiling(sqrt(N))

wrt @sharlagelfand's question about passing shape to generateGrid(), it sounds like @giorgi-ghviniashvili took care of this is a different branch (see here). let's test that and see if it can be merged into main?

@sharlagelfand will work on the first paragraph, detecting whether we have a binary outcome. this could include:

  • 0/1 or T/F
  • 2 level factor
  • character that converts to a 2 level factor

and then make the updates for the subsequent generateGrid() calls.

@jhofman
Copy link
Contributor Author

jhofman commented Oct 26, 2021

@giorgi-ghviniashvili, just for reference, here's some ggplot2 code that creates the desired player + hit/no-hit split:

ggplot(data.frame(x = 1:4, y = rep(1, 4), player = c("A","A","B","B"), hit = c(T,F,T,F)), aes(x = x, y = y, shape = hit, color = player)) + geom_point(size = 5) + scale_shape_manual(values = c(1,19))

it looks like vegalite doesn't have a fillOpacity legend, unclear if vega does. vega spec below to play with, doesn't behave as expected.

some related issues:

vega/vega-lite#4982
vega/vega-lite#4495
vega/vega#1513
vega/vega-lite#5030
vega/vega#751

{
  "$schema": "https://vega.github.io/schema/vega/v3.0.json",
  "height": 400,
  "padding": 5,

  "signals": [
    { "name": "chartWidth", "value": 300 },
    { "name": "chartPad", "value": 20 },
    { "name": "width", "update": "2 * chartWidth + chartPad" },
    { "name": "year", "value": 2000,
      "bind": {"input": "range", "min": 1850, "max": 2000, "step": 10} }
  ],

  "data": [
    {
      "name": "population",
      "url": "data/population.json"
    },
    {
      "name": "popYear",
      "source": "population",
      "transform": [
        {"type": "filter", "expr": "datum.year == year"}
      ]
    },
    {
      "name": "males",
      "source": "popYear",
      "transform": [
        {"type": "filter", "expr": "datum.sex == 1"}
      ]
    },
    {
      "name": "females",
      "source": "popYear",
      "transform": [
        {"type": "filter", "expr": "datum.sex == 2"}
      ]
    },
    {
      "name": "ageGroups",
      "source": "population",
      "transform": [
        { "type": "aggregate", "groupby": ["age"] }
      ]
    }
  ],

  "scales": [
    {
      "name": "y",
      "type": "band",
      "range": [{"signal": "height"}, 0],
      "round": true,
      "domain": {"data": "ageGroups", "field": "age"}
    },
    {
      "name": "c",
      "type": "ordinal",
      "domain": [1, 2],
      "range": ["#1ba66e", "#d7603c"]
    },
    {
      "name": "ctext",
      "type": "ordinal",
      "domain": ["Male", "Female"],
      "range": ["#1ba66e", "#d7603c"]
    },
    {
      "name": "f",
      "type": "ordinal",
      "domain": ["Male", "Female"],
      "range": [0,1]
    }
  ],

  "marks": [
    {
      "type": "text",
      "interactive": false,
      "from": {"data": "ageGroups"},
      "encode": {
        "enter": {
          "x": {"signal": "chartWidth + chartPad / 2"},
          "y": {"scale": "y", "field": "age", "band": 0.5},
          "text": {"field": "age"},
          "baseline": {"value": "middle"},
          "align": {"value": "center"},
          "fill": {"value": "#000"}
        }
      }
    },
    {
      "type": "group",

      "encode": {
        "update": {
          "x": {"value": 0},
          "height": {"signal": "height"}
        }
      },

      "scales": [
        {
          "name": "x",
          "type": "linear",
          "range": [{"signal": "chartWidth"}, 0],
          "nice": true, "zero": true,
          "domain": {"data": "population", "field": "people"}
        }
      ],

      "axes": [
        {"orient": "bottom", "scale": "x", "format": "s"}
      ],

      "marks": [
        {
          "type": "rect",
          "from": {"data": "females"},
          "encode": {
            "enter": {
              "x": {"scale": "x", "field": "people"},
              "x2": {"scale": "x", "value": 0},
              "y": {"scale": "y", "field": "age"},
              "height": {"scale": "y", "band": 1, "offset": -1},
              "fill": {"scale": "c", "field": "sex"},
              "fillOpacity": {"scale": "f", "field": "sex"}
            }
          }
        }
      ],

      "legends": [
          {
              "fill": "ctext",
              "title": "Sex",
              "encode": {"symbols": {"enter": {"fillOpacity": {"value": 0.6}}}}
          }
      ]
    },
    {
      "type": "group",

      "encode": {
        "update": {
          "x": {"signal": "chartWidth + chartPad"},
          "height": {"signal": "height"}
        }
      },

      "scales": [
        {
          "name": "x",
          "type": "linear",
          "range": [0, {"signal": "chartWidth"}],
          "nice": true, "zero": true,
          "domain": {"data": "population", "field": "people"}
        }
      ],

      "axes": [
        {"orient": "bottom", "scale": "x", "format": "s"}
      ],

      "marks": [
        {
          "type": "rect",
          "from": {"data": "males"},
          "encode": {
            "enter": {
              "x": {"scale": "x", "field": "people"},
              "x2": {"scale": "x", "value": 0},
              "y": {"scale": "y", "field": "age"},
              "height": {"scale": "y", "band": 1, "offset": -1},
              "fill": {"scale": "c", "field": "sex"},
              "fillOpacity": {"value": 0.6}
            }
          }
        }
      ]
    }
  ]
}

@giorgi-ghviniashvili
Copy link
Collaborator

giorgi-ghviniashvili commented Nov 2, 2021

@jhofman I was playing with the batting averages datamation and I think that if we want to compare players within the year like this:
image

Then we need to have years as column facets in previous facets as well, otherwise animation looks ugly.

ugly-datamation.mov

This is better I think:

adjustment.mov

Lmk your thoughts.

@giorgi-ghviniashvili
Copy link
Collaborator

giorgi-ghviniashvili commented Nov 2, 2021

@jhofman hit yes/no legend achieved using shape encoding and legend symbolFillColor 😎 :

image

--

image

--

hit.legend.mov

P.S. code pushed to fill-vs-stroked-legend branch.

@giorgi-ghviniashvili
Copy link
Collaborator

there is a gist for solution spec.

@sharlagelfand
Copy link
Collaborator

@giorgi-ghviniashvili I think this solution only works when there is a variable mapped to color:

Here is your example pared down:

Screen Shot 2021-11-02 at 11 53 53 AM

And how it looks with just the colour mapping removed:

Screen Shot 2021-11-02 at 11 54 13 AM

So unfortunately I'm not sure if this solution is generalizable to specs that don't have a variable mapped to color - do you have thoughts?

@giorgi-ghviniashvili
Copy link
Collaborator

@sharlagelfand yes , good catch. It seems like that vega-lite tries to give default color to shape encoding which is blue and ignores all legend parameters. As long as we will have color for sure, I think this solution will work. We just need to make sure that no shape is passed when color is missing.

@sharlagelfand
Copy link
Collaborator

I don't think we can guarantee that color will be present @giorgi-ghviniashvili so we will need to find some solution for when it is missing

@giorgi-ghviniashvili
Copy link
Collaborator

giorgi-ghviniashvili commented Nov 2, 2021

@sharlagelfand if you don't have color, then use this approach with fill and stroke in combination.

If color, then use shape, fillOpacity, stroke with color. Will that work?

The problem we had for the fill vs non filled was color, if there was not color, then we could achieve this with fill and stroke. That was my first solution.

@sharlagelfand
Copy link
Collaborator

Thanks @giorgi-ghviniashvili!

Just want to share where things are at with the Simpson's Paradox example now since I have made pretty good progress. Just a note that I have sampled the data (~30%) since we cannot support that many points (#51), so the actual numbers might not look as you expect @jhofman

grouping by player only

Specs

player.only.mov

grouping by player and year

Specs

group.by.player.and.year.mov

Things are a bit off here - in the frame with is_hit, the legend for it seems to appear twice - once overlapping the colour player legend

Screen Shot 2021-11-03 at 4 55 23 PM

Also, the placement of the mean and errorbar are off in the mean / errorbar frames - definitely off in the X values, but I think off in Y too when you compare the second last and last frames - and the y axis values are not even showing up! @giorgi-ghviniashvili could you please help me figure out why? thanks!

Screen Shot 2021-11-03 at 4 55 28 PM

Screen Shot 2021-11-03 at 4 55 29 PM

@jhofman
Copy link
Contributor Author

jhofman commented Nov 5, 2021

@giorgi-ghviniashvili is going to hide the faked legend, which should take care of that problem.

sounded like there was a css fix for the x axis misalignment, and something that needed to be added to the spec to fix the y-axis annotations?

@giorgi-ghviniashvili
Copy link
Collaborator

To hide a faked legend, please include css:

.vega-vis-wrapper .vega-for-axis .role-legend {
    display: none;
}

About the second issue of being off, @sharlagelfand . It only happens when there is y axis title, but no scale.domain. I included scale.domain = [0, 0.4] in y encoding and the issue solved. Error bars working great as well. I think this should be a general solution to always include scale.domain if there is a title for y axis. It is also needed for the hacked faceted view to have real domain, otherwise circle positions are not correct along with error bars.

image

Here is how it looks fixed:

fixed-errorbars.mov

@sharlagelfand
Copy link
Collaborator

thanks @giorgi-ghviniashvili, that works great now! Just want to confirm that we will never need to see the faked legend, and that we only use real ones? so using

.vega-vis-wrapper .vega-for-axis .role-legend {
    display: none;
}

will never hide something that we actually need to see.

Here is how the datamations look now (cc @jhofman) - I think they look pretty good!!

group by player

specs

one.mov

There is one slight issue with timing here where the y-axis values show up at the end of the animation between the "is_hit" frame and "mean is_hit" frames, if that's something that can be fixed. There doesn't seem to be the same issue in the second animation!

group by player and year

specs

two.mov

@sharlagelfand
Copy link
Collaborator

And just wanted to share how categorical values looks, with shape!

library(palmerpenguins)

"penguins %>%
  group_by(island) %>%
  summarise(n = n_distinct(species))" %>%
  datamation_sanddance()
categorical.mov

@giorgi-ghviniashvili
Copy link
Collaborator

giorgi-ghviniashvili commented Nov 9, 2021

@sharlagelfand yes, I confirm that the css only hides faked legend.

About the axis issue, seems like it is gemini issue: animating from A to B, where A does not have y axis and B has, causes this issue. Seems like that when we have facets, this issue is gone, because axis is drawn via faked axis layer and not the actual axis.

Tested gemini recommendations and all missing y axis.

image

@sharlagelfand
Copy link
Collaborator

@giorgi-ghviniashvili it looks like there's an issue with the grid generation - e.g. this data set:

# A tibble: 5 × 2
  player  is_hit
  <chr>    <dbl>
1 Justice      1
2 Justice      0
3 Jeter        1
4 Jeter        0
5 Jeter        0

I send this spec for showing the values of is_hit

{
  "height": 300,
  "width": 300,
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": {
    "parse": "grid",
    "axes": false,
    "description": "Plot is_hit within each group",
    "splitField": "player",
    "xAxisLabels": ["Jeter", "Justice"]
  },
  "data": {
    "values": [
      {
        "player": "Jeter",
        "is_hit": 0,
        "n": 2,
        "datamations_y_tooltip": 0
      },
      {
        "player": "Jeter",
        "is_hit": 1,
        "n": 1,
        "datamations_y_tooltip": 1
      },
      {
        "player": "Justice",
        "is_hit": 0,
        "n": 1,
        "datamations_y_tooltip": 0
      },
      {
        "player": "Justice",
        "is_hit": 1,
        "n": 1,
        "datamations_y_tooltip": 1
      }
    ]
  },
  "mark": {
    "type": "point",
    "filled": true
  },
  "encoding": {
    "x": {
      "field": "datamations_x",
      "type": "quantitative",
      "axis": null
    },
    "y": {
      "field": "datamations_y",
      "type": "quantitative",
      "axis": null
    },
    "fill": {
      "field": "is_hit",
      "scale": {
        "domain": [1, 0],
        "range": ["#4c78a8", "#ffffff"]
      }
    },
    "stroke": {
      "field": "is_hit",
      "scale": {
        "domain": [1, 0],
        "range": ["#4c78a8", "#4c78a8"]
      }
    }
  }
}

but the real spec that the JS code produces has no values with is_hit = 1, they are all 0:

Screen Shot 2021-11-09 at 2 41 33 PM

@giorgi-ghviniashvili
Copy link
Collaborator

@sharlagelfand there has been a small but with index. Fixed it:

image

@sharlagelfand
Copy link
Collaborator

Thanks @giorgi-ghviniashvili, that case is fixed. It does not seem to be generalizable though - here is a small variant, where the only change is "is_hit": 1 comes before "is_hit": 0.

{
  "height": 300,
  "width": 300,
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": {
    "parse": "grid",
    "axes": false,
    "description": "Plot is_hit within each group",
    "splitField": "player",
    "xAxisLabels": ["Jeter", "Justice"]
  },
  "data": {
    "values": [
      {
        "player": "Jeter",
        "is_hit": 0,
        "n": 2,
        "datamations_y_tooltip": 0
      },
      {
        "player": "Jeter",
        "is_hit": 1,
        "n": 1,
        "datamations_y_tooltip": 1
      },
      {
        "player": "Justice",
        "is_hit": 0,
        "n": 1,
        "datamations_y_tooltip": 0
      },
      {
        "player": "Justice",
        "is_hit": 1,
        "n": 1,
        "datamations_y_tooltip": 1
      }
    ]
  },
  "mark": {
    "type": "point",
    "filled": true
  },
  "encoding": {
    "x": {
      "field": "datamations_x",
      "type": "quantitative",
      "axis": null
    },
    "y": {
      "field": "datamations_y",
      "type": "quantitative",
      "axis": null
    },
    "fill": {
      "field": "is_hit",
      "scale": {
        "domain": [1, 0],
        "range": ["#4c78a8", "#ffffff"]
      }
    },
    "stroke": {
      "field": "is_hit",
      "scale": {
        "domain": [1, 0],
        "range": ["#4c78a8", "#4c78a8"]
      }
    }
  }
}

Here are the specs it produces - all of the values of "is_hit" are 0 when there should be 1s.

Screen Shot 2021-11-10 at 11 25 07 AM

@giorgi-ghviniashvili
Copy link
Collaborator

@sharlagelfand good point, fixed.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorming gemini priority next action visualizing-verbs has to do with how a data analysis verb is represented visually
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants