Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add visualization of binary variables to existing Datamations demo #138

Closed
msftbozo opened this issue Jan 24, 2022 · 25 comments
Closed

Add visualization of binary variables to existing Datamations demo #138

msftbozo opened this issue Jan 24, 2022 · 25 comments
Assignees
Labels
documentation Improvements or additions to documentation priority next action r rendering-animations has to do with rendering an animation

Comments

@msftbozo
Copy link

Since we completed the visualization of binary variables, it would be great to add a dataset to the existing (pinguins/salaries) dataset that has binary variables and datamate those.

@msftbozo msftbozo changed the title Add a binary variable dataset of existing Datamations demo Add visualization of binary variables to existing Datamations demo Jan 24, 2022
@jhofman
Copy link
Contributor

jhofman commented Jan 28, 2022

Let's see if we can use what's in #98 to make a nice example here using the baseball data.

If it looks good we can add it to the docs or a vignette or something.

@jhofman jhofman added documentation Improvements or additions to documentation priority next action r rendering-animations has to do with rendering an animation labels Jan 28, 2022
@willdebras
Copy link
Collaborator

I will have this prototyped to show Thursday.

@msftbozo
Copy link
Author

msftbozo commented Feb 1, 2022

that's fantastic thanks @willdebras

@willdebras
Copy link
Collaborator

Had to make some minor bugfixes to get binary variables working. Previously across environments (in the renv environment, containered 4.1 with fresh installs, and 4.0.5 with fresh installs), datamations_sanddance() was erroring for binaries due to a util call that generates tooltip specs. This call was expecting gemini IDs despite being passed already aggregated data. The function removed columns that weren't grouping columns to calculate unique fields, but removed gemini_ids even if they didn't exist.

This commit removes this field only if it exists:

924ce7d

A basic datamation with this binary variable from the following call:

"df %>%
  group_by(player, year) %>%
  summarise(mean = mean(is_hit))"  %>% datamation_sanddance()
Viewer.Zoom.2022-02-03.08-38-08.mp4

I'll provide more examples of variants of this in the next day or two.

@willdebras
Copy link
Collaborator

Also, @jhofman/@giorgi-ghviniashvili, do you have suggestions on best practice for showcasing/recording these? This one is pretty low res and has my cursor over top of it :)

@jhofman
Copy link
Contributor

jhofman commented Feb 4, 2022

@willdebras: did you record with the native screencapture tool? that usually works pretty well for me, but @giorgi-ghviniashvili may have some other tricks up his sleeve!

@giorgi-ghviniashvili
Copy link
Collaborator

@willdebras on mac, I was using QuickTime player which records as .mov file. But lately github had issue of embedding a player for .mov. Don't know if they resolved that issue yet or not.

@willdebras
Copy link
Collaborator

A couple examples of the simpsons paradox with the baseball example for review tomorrow, additionally showcasing passing ggplot2 code to datamation_sanddance

# datamation #1:
# jeter has a higher batting average than justice overall
'df %>%
  group_by(player) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()))  %>%
  ggplot(aes(x = player, y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se)) +
  labs(x = "",
       y = "Batting average")' %>% datamation_sanddance()

datamation_jeter_1

# datamation #2:
# but justice has a higher batting average than jeter within each year
'df %>%
  group_by(player, year) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()) ) %>%
  ggplot(aes(x = year, y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se),
                  position = position_dodge(width = 0.25)) +
  labs(x = "",
       y = "Batting average")' %>% datamation_sanddance()
  #geom_bar(stat = "identity", position = "dodge")

datamation_jeter_2

@willdebras
Copy link
Collaborator

Viewer.Zoom.2022-02-09.08-57-49.mp4

@jhofman
Copy link
Contributor

jhofman commented Feb 9, 2022

@willdebras: looks better without the ggplot code at the end. we should probably modify things at some point so the ggplot-ed version looks better, but let's focus on this default version for now.

'df %>%
  group_by(player, year) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()) )' %>%
  datamation_sanddance()

@giorgi-ghviniashvili for some reason we're seeing the points reshuffle between 11 seconds and 14 seconds. would be great to eliminate that and just have the points go from all solid to only the hits being solid. can you take a look at this? seems like a sorting issue on the point ids.

let's try two versions:

  1. points are pre-sorted by is_hit at 11 seconds so that there's no visual resorting, things just hope to 14 seconds
  2. points are not pre-sorted, is_hit lights up after 11 seconds as part of "plot is_hit within each group", and then they get rearranged as in 14 seconds in.

@jhofman
Copy link
Contributor

jhofman commented Feb 11, 2022

@willdebras, can you continue to work on this and also play with the two different sortings mentioned in the above comment?

@willdebras
Copy link
Collaborator

I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.

Working through the sorting methods mentioned.

@willdebras
Copy link
Collaborator

It looks like sorting here works for a single groupby, but not the two. This is definitely something to fix on the R side. It seems when plotting summarize calls with multiple groups for binary variables, we miss the appending of gemini IDs somewhere, so they get rearranged.

From the baseball example, you can see the values field should have a gemini_id field with an array of ids, but doesn't:

"meta": { "parse": "grid", "axes": true, "description": "Plot is_hit within each group", "splitField": "year" }, "data": { "values": [ { "player": "David Justice", "year": "1995", "is_hit": 1, "n": 104 },

@willdebras
Copy link
Collaborator

After some debugging, I have updated the R end to ensure gemini IDs are in every summarize step, but they look correctly sorted in the specs I send along based on gemini ID. We can take a look on call tomorrow, but this might need more diving into the frontend to figure out if the plotting of this step is in the correct order.

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json

@jhofman
Copy link
Contributor

jhofman commented Feb 16, 2022

I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.

Working through the sorting methods mentioned.

this is great. let's change the group by on the second example to be group_by(year, player) so that it's easier to see the simpson's paradox bit pop up.

@willdebras pointed out that there's still some funny re-grouping going on. for instance, look at jeter in 1995---it goes from two columns to one column for some reason. it should stay fixed in one column.

@giorgi-ghviniashvili, can you take a look when you're back?

@willdebras
Copy link
Collaborator

Here are the screenshots of the funky regrouping. We can see that despite the sort on gemini ids in both steps now, we still get a rearrange. Only happens for two groupbys.

image

image

@giorgi-ghviniashvili
Copy link
Collaborator

@willdebras can you send json specs for this funky groupings?

@willdebras
Copy link
Collaborator

Sure thing, these are saved here, @giorgi-ghviniashvili:

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json

You can see these now have been updated to have gemini ID in all steps and have appropriate sorting. Still get the funky grouping though.

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman @willdebras checked it and funky groupings happen because of different number of rows in third (rows = 35) and fourth (rows = 50) frames. We did this on purpose to avoid overlaps:

image

If we comment this code, all specs will have same number of rows and no funky grouping will happen, but in some cases, when there are too much points, they will overlap.

@jhofman
Copy link
Contributor

jhofman commented Mar 8, 2022

sounds like we should apply this rule to all specs, find the one with the most number of rows, and use that throughout to avoid this. @giorgi-ghviniashvili, can you give it a try?

@giorgi-ghviniashvili
Copy link
Collaborator

@jhofman fixed the funky animation as per your suggestion.

funky-grouping-fixed.mov

giorgi-ghviniashvili added a commit that referenced this issue Mar 9, 2022
@jhofman
Copy link
Contributor

jhofman commented Mar 10, 2022

looks good! small thing: from 11 to 13 seconds the stroke width looks like it's changing. @giorgi-ghviniashvili can you track down where and @willdebras can you update on the R side?

let's change our default so that the initial circles are empty (indicating zeros) w/ strokewidth and then once we split into groups and introduce the outcome (hit or no hit), we color in the "1"s (hit). @willdebras, hopefully this is just an R side change?

at some point this might conflict w/ explicit ggplot2 commands, we can cross that bridge when we get to it ...

@willdebras
Copy link
Collaborator

@giorgi-ghviniashvili, do you know if there are unpushed (to main) changes related to the binary variables on your end? I'm wondering because in the recording above, at ~13s, we get the plotting of the dots fully filled, then it transitions to make these hollow at ~15s.

This staggered transition isn't in place in the main repo or in the example in the docs (auto generated from main):

https://microsoft.github.io/datamations/articles/Examples.html#binary-variables

It seems to go straight to the stroke with transparent fill.

From my understanding these are different animationFrames, but not different specs, right?

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json#L174

There is only one set of specs titled "Plot is_hit within each group." Do I need to add to these specs or is this staggered effect of the plotting of the 1 v. 0 something you have in js code that isn't in main yet?

giorgi-ghviniashvili added a commit that referenced this issue Mar 14, 2022
@giorgi-ghviniashvili
Copy link
Collaborator

giorgi-ghviniashvili commented Mar 14, 2022

@willdebras main branch is up to date regarding to the binary variables. This is the issue in this example as well, the circles size increases because of stroke property in "is_hit" spec, but it is not a visual issue in this example because there are few circles.

In general, issue is that previous spec does not have "stroke" property and that's why it looks thin. In this example, it has stroke-width: 1, but not stroke and this stroke-width will be ignored in this case.

image

Solution A:
add stroke encoding to previous specs if there is color encoding with same field of color.
(but this one did not really work well)..

Solution B:
reduce circles size in "is_hit" spec.
I used size = 22:

image

But previous spec should be changed to:
image

hit_non_hit.mov

See this commit

@jhofman
Copy link
Contributor

jhofman commented Mar 29, 2022

@willdebras has included this in the updated shiny app, related to #129

@jhofman jhofman closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation priority next action r rendering-animations has to do with rendering an animation
Projects
None yet
Development

No branches or pull requests

4 participants