  ## Multivariate Data Visualization

  Data visualization is an incredibly rich tool to explore and understand data. Data visualization is often the first way to see if there are extreme data values, how much variation there is in the data, and where typical values lie in the distribution. In this section of the course, we plan to explore the following related to distributions:  
  
  1. Univariate distributions
      + Shape
      + Center
      + Spread
      + Extreme Values
  2. Multivariate distributions
      + Shape
      + Center
      + Spread
      + Extreme Values
      + Comparing distributions


   ## Multivariate Distributions

   We are going to use some real data about higher education institutions from the college scorecard (https://collegescorecard.ed.gov/) to explore the types of conclusions we can make from the data. The college scorecard releases data on higher education institutions to help make the institutions more transparent and provide a place for parents, students, educators, etc can get information about specific instituations from a third party (i.e. US Department of Education).

   ### Loading R packages

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)

theme_set(theme_bw())


   ### Read in Data

   The below code will read in the data for us to use in the future. The R function to read in the data is `read_csv()`. Function arguments are passed within the parentheses and for the `read_csv()` function the first argument is the path to the data. The data for this example are posted on GitHub in a comma separated file. This means the data is stored in a text format and each variable (i.e. column in the data) is separated by a comma. This is a common format data is stored.

   The data is stored to an object named `college_score`. In R (and other statistical programming languages), it is common to use objects to store results to use later. In this instance, we would like to read in the data and store it to use it later. For example, we will likely want to explore the data visually to see if we can extract some trends from the data. The assignment to an object in R is done with the `<-` assignment operator. Finally, there is one additional argument, `guess_max` which helps to ensure that the data are read in appropriately. More on this later.

In [0]:
college_score <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/College-scorecard-4143.csv", guess_max = 10000)
head(college_score)


   ## Multivariate Distributions

   Real world data are never as simple exploring a distribution of a single variable, particularly when trying to understand individual variation. In most cases things interact, move in tandem, and many phenomena help to explain the variable of interest. For example, when thinking about admission rates, what may be some important factors that would explain some of the reasons why higher education institutions differ in their admission rates? Take a few minutes to brainstorm some ideas.


In [0]:
gf_histogram(~ adm_rate, data = college_score, bins = 30, fill = ~ preddeg) %>%
  gf_labs(x = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          fill = "Primary Deg")


   Often density plots are easier to visualize when there are more than one group. To plot more than one density curve, we need to specify the color or fill arguments. Below depicts a few ways to specify these arguments and what the resulting figure looks like. I prefer either color and fill or just color with a light gray background for the density figures (last 2 figures).

In [0]:
gf_density(~ adm_rate, data = college_score, color = ~ preddeg) %>%
  gf_labs(x = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          color = "Primary Deg")



In [0]:
gf_density(~ adm_rate, data = college_score, fill = ~ preddeg) %>%
  gf_labs(x = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          fill = "Primary Deg")



In [0]:
gf_density(~ adm_rate, data = college_score, fill = ~ preddeg, color = ~ preddeg) %>%
  gf_labs(x = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          color = "Primary Deg",
          fill = "Primary Deg")



In [0]:
gf_density(~ adm_rate, data = college_score, color = ~ preddeg, fill = 'gray85', size = 1) %>%
  gf_labs(x = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          color = "Primary Deg")



   ## Violin Plots

   Violin plots are another way to make comparisons of distributions across groups. Violin plots are also easier to show more groups on a single graph. Violin plots are density plots that are mirrored to be fully enclosed. Best to explore with an example.

In [0]:
gf_violin(adm_rate ~ preddeg, data = college_score) %>%
  gf_labs(y = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          x = "Primary Deg")


   Aesthetically, these figures are a bit more pleasing to look at if they include a light fill color. This is done similar to the density plots shown above with the `fill = ` argument.

In [0]:
gf_violin(adm_rate ~ preddeg, data = college_score, fill = 'gray85') %>%
  gf_labs(y = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          x = "Primary Deg")
          


   Adding quantiles are useful to aid in the comparison with the violin plots. These can be added with the `draw_quantiles` argument.

In [0]:
gf_violin(adm_rate ~ preddeg, data = college_score, fill = 'gray85', draw_quantiles = c(.1, .5, .9)) %>%
  gf_labs(y = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          x = "Primary Deg")


   ### Violin Plots with many groups

   Many groups are more easily shown in the violin plot framework.

   With many groups, it is often of interest to put the long x-axis labels representing each group on the y-axis so that it reads the correct direction and the labels do not run into each other. This can be done with the `gf_refine()` function with `coord_flip()`. This also fits better with the orientation we have explored with density and histograms where the attribute of interest is depicted on the x-axis. For this course, I will always flip violin plots so that the attribute of interest, in this case admission rates, is on the x-axis. 

In [0]:
gf_violin(adm_rate ~ region, data = college_score, fill = 'gray80', draw_quantiles = c(.1, .5, .9)) %>%
  gf_labs(y = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          x = "US Region") %>%
  gf_refine(coord_flip())


   ## Facetting

   Facetting is another way to explore distributions of two or more variables.

In [0]:
gf_violin(adm_rate ~ region, data = college_score, fill = 'gray80', draw_quantiles = c(.1, .5, .9)) %>%
  gf_labs(y = 'Admission Rate (in %)',
          title = 'Multivariate distribution of higher education admission rates by degree type',
          x = "US Region") %>%
  gf_refine(coord_flip()) %>%
  gf_facet_wrap(~ preddeg)




