    ## Applying Functions to Data - Part 2

    ## Setup

    We are going to use some real data about higher education institutions from the college scorecard (https://collegescorecard.ed.gov/) to explore the types of conclusions we can make from the data. The college scorecard releases data on higher education institutions to help make the institutions more transparent and provide a place for parents, students, educators, etc can get information about specific instituations from a third party (i.e. US Department of Education).

    ### Loading R packages

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)
library(mosaic)

theme_set(theme_bw())

college_score <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/College-scorecard-4143.csv", guess_max = 10000)
head(college_score)


    ## Measures of Variation

    So far we have focused primarily on applying functions to columns of data to provide a single numeric summary for where the center of the distribution may lie. The center of the distribution is important, however the primary goal in research and with statistics is to try to understand the variation in the distribution.

    One crude measure of variation that is intuitive is the range of a variable. The range is the difference between the smallest and the largest number in the data. We can compute this with the `df_stats()` function.

In [0]:
college_score %>%
  df_stats(~ adm_rate, range)


    The details of the `df_stats()` function are in the previous course notes. The output for this computation returns two values, the minimum and maximum value in the data and unsurprisingly, is 0 and 1 respectively. The range is most useful as a data checking process to ensure that the variable contains values that are theoretically possible, which is true in this case. The range is known as a biased statistic in that it will almost always be smaller than the population value. Therefore, we would like a better statistic for measures of variation.

   ### Robust measure of variation
   A robust measure of variation that often is used in tandem with the median is the interquartile range (IQR). This statistic can be calculated in two ways, either using the `IQR()` or `quantile()` function. Both are presented below.

In [0]:
college_score %>%
  df_stats(~ adm_rate, IQR, quantile(c(0.25, 0.75)), nice_names = TRUE)


   The IQR is the difference between the 75th and 25th percentiles and in this example equals 0.285 or about 28.5%. As the IQR represents differences in percentiles, we could say that the middle 50% of the distribution is found between 55% and 84% and the middle 50% is spread out by about 28.5%. The idea behind the IQR representing differences in percentiles allows us to extend this to different percentiles that may be more directly interpretable for a given situation. For example, suppose we wanted to know how spread out the middle 80% of the distribution is. We can do this directly by computing the 90th and 10th percentiles and finding the difference between the two.

In [0]:
mid_80 <- college_score %>%
  df_stats(~ adm_rate, quantile(c(0.1, 0.9)), nice_names = TRUE)
mid_80


   As you can see, once you extend the amount of the distribution contained, the distance increases, now to 0.555 or 55.5% the the range of the middle 80% of the admission rate distribution. We can also visualize what this looks like.

In [0]:
gf_histogram(~ adm_rate, data = college_score, bins = 30, color = 'black') %>%
  gf_vline(color = 'blue', xintercept = ~ value, data = gather(mid_80), size = 1)


   We can also view the exact percentages using the empirical cumulative density function.

In [0]:
gf_ecdf(~ adm_rate, data = college_score) %>%
  gf_vline(color = 'blue', xintercept = ~ value, data = gather(mid_80), size = 1)


  ### Variation by Group
  These statistics can also be calculated by different grouping variables similar to what was done with statisitcs of center. Now the variable of interest is on the left-hand side of the equation and the grouping variable is on the right hand side.

In [0]:
iqr_groups <- college_score %>%
  df_stats(adm_rate ~ region, IQR, quantile(c(0.25, 0.75)), nice_names = TRUE)
iqr_groups


  This can also be visualized to see how these statistics vary across the groups.

In [0]:
gf_histogram(~ adm_rate, data = college_score, bins = 30, color = 'black') %>%
  gf_vline(color = 'blue', xintercept = ~ value, 
     data = filter(pivot_longer(iqr_groups, IQR_adm_rate:'X75.'), name %in% c('X25.', 'X75.')), size = 1) %>%
  gf_facet_wrap(~ region)


  ## Other measures of variation
   There are many other variation measures that are used in statistics. We will apply a functional approach to these and try to visualize what they are trying to represent. The statistics discussed here represent deviations from the mean, either the average absolute deviation or the average squared deviation.

In [0]:
college_score %>%
  df_stats(~ adm_rate, sd, var)


 In order to compute the mean absolute error, we first need to define a new function.

In [0]:
mae <- function(x, na.rm = TRUE, ...) {
  avg <- mean(x, na.rm = na.rm, ...)
  abs_avg <- abs(x - avg)
  
  mean(abs_avg)
} 


 We can now use this new function just like any other function.

In [0]:
college_score %>%
  df_stats(~ adm_rate, sd, var, mae)
