 ## In Class Activity - variation

 ## Setup

In [0]:
.libPaths('../RPackages')

install.packages("lubridate")

library(tidyverse)
library(ggformula)
library(mosaic)
library(lubridate)

theme_set(theme_bw())

mae <- function(x, na.rm = TRUE, ...) {
  avg <- mean(x, na.rm = na.rm, ...)
  abs_avg <- abs(x - avg)
  
  mean(abs_avg)
} 

Riders <- Riders %>%
  select(date, day, highT, lowT, precip, clouds, riders, weekday) %>%
  mutate(month = month(date, label = TRUE),
         precip_two = ifelse(precip > 0, 'Rained', 'No-Rain'),
         precip_two_num = ifelse(precip > 0, 1, 0))


 ## Riders data

 Some data that already come with R collected data on the number of users on a Massachusetts Rail Trail over a 90 day period from April 5, 2005 to November 15, 2005. Below is some information about the variables in the data.

 - **date**: date of data collection
 - **day**: Day of the week
 - **highT**: High temperature for the day
 - **lowT**: Low temperature for the day
 - **precip**: Precipitation amount in inches
 - **clouds**: A measure of cloud cover (in oktas; higher is more cloud cover, 0 = no cloud cover, 8 = completely overcast).
 - **riders**: Number of riders counted
 - **weekday**: N = weekday or holiday; Y = non-holiday weekday
 - **month**: Month of the year
 - **precip_two**: Dichtomous variable indicating whether the day had any rain or not
 - **precip_two_num**: Dichotomous variable where 1 = rained during the day; 0 = no rain

In [0]:
head(Riders, n = 10)


 ### Explore `riders` variable descriptively.

 Using the `df_stats()` function, explore the variation of the `riders` variable descriptively (i.e. explore the IQR, SD, variance, etc). Fill in the various statistics in place of ^^ below. More than one statistic can be separated with a comma.

In [0]:
riders_descrip <- Riders %>%
  df_stats(~ riders, ^^)
riders_descrip 


 Let's visualize the distribution.

In [0]:
gf_density(~ riders, data = Riders)


 ### Questions
 1. Are these statistics a very good representation of the variation of ridership? Why or why not?
 2. Would there be other variables that may help to explain the variation in number of riders?

 ## Conditional descriptive statistics
 Let's explore if there are differences in the amount of variation as part of the day of the week (i.e. the variable `day`). In the code below, add the statistics of variation in place of "^^" and add in the variable `day` in the "%%".

In [0]:
riders_days <- Riders %>%
  df_stats(riders ~ %%, ^^)
riders_days


 Visualize the distributions across days.

In [0]:
gf_density(~ riders, data = Riders, bins = 30, color = 'black') %>%
  gf_facet_wrap(~ day)


 ### Questions
 1. Does the conditional distributions seem a better representation than the single overall distribution? Why or why not?
 2. Is there evidence of differences in variation across the different days of the week?
 3. What about measures of center, are there differences across the days of the week?
 4. Do you think that the `weekday` variable would be sufficient here similar to the births data?

 ## Differences by Month
 Let's now explore if there are differences by month. Compute descriptive statistics by the month (calculate statistics for both center and variation). In the code below, add the statistics of variation and center in place of "^^" and add in the variable `month` in the "%%".

In [0]:
riders_months <- Riders %>%
  df_stats(riders ~ %%, ^^)
riders_months



In [0]:
gf_violin(riders ~ month, data = Riders, fill = 'gray80', draw_quantiles = c(.25, .5, .75)) %>%
  gf_labs(y = 'Number of Riders',
          title = 'Violin plots of the number of riders in each month',
          x = "") %>%
  gf_refine(coord_flip())


 ### Questions
 1. Is there evidence of differences in variation across the different months of the year?
 2. What about measures of center, are there differences across the months of the year?

In [0]:
gf_violin(riders ~ month, data = Riders, fill = 'gray80', draw_quantiles = c(.25, .5, .75), scale = 'width') %>%
  gf_labs(y = 'Number of Riders',
          title = 'Violin plots of the number of riders in each month',
          x = "") %>%
  gf_refine(coord_flip()) %>%
  gf_facet_wrap(~ weekday)


 ### Questions
 1. Using the figure above, are there differences in variation and center by month and whether the day was a weekday?
 2. Identify which groups had the smallest and largest amounts of variation in the number of riders from the figure.
 3. Based on the figure, what time of year seems to be the most popular?

 ## Descriptives for Dichotomous variable
 Dichotomous data are variables that take on two values. Some examples may include, graduate vs not-graduated, watched a tv show vs did not watch a tv show, took a medicine vs did not take medicine, and many other options. These types of variables can sometimes be represented as text, but are often represented as numbers, most commonly where one group is represented with a 0 and the other with a 1. Two examples in the `Riders` data are *precip_two* and *precip_two_num*, where *precip_two* is a categorical version of *precip_two_num*. A table may be helpful to see the values.

In [0]:
count(Riders, precip_two, precip_two_num)


 The table shows that there are 29 days that had rain and 61 days without rain and the value of 1 for the *precip_two_num* variable represents a day where it rained. Let's now try to compute some descriptive statisics on the special numeric dichotomous variable (i.e. *precip_two_num*).

In [0]:
Riders %>%
  df_stats(~ precip_two_num, sum, mean, median, sd, IQR, length)


 In the above code, we computed the sum, mean, median, standard deviation, IQR, and length of the *precip_two_num* variable. Let's try to think about the interpretation of these a bit more.
 ### Questions
 1. Focusing first on the output from computing the sum, what does this represent here? Hint, it may be helpful to look at the output from the `count()` function above.
 2. Now, think about the interpretation of the measures of center, i.e. mean and median. Keep in mind the variable here is depicting whether the day rained or not.
 3. Finally, think about the last three measures, sd, IQR, and length. What do these statistics represent here?