## Activity 3 - Explore Multivariate Statistics and Distributions

**Due** on *Monday, October 2nd* by 11:59 pm

You will be asked to complete a short survey on ICON that asks questions about the output generated below. Furthermore, there are additional questions to consider sprinkled throughout the notebook below, these do not need to be explicitly answered, but can provide a bit of a guide to thinking and interpreting the following statistical output. 

## Setup

This first code cell needs to be executed ("Run") everytime this notebook is opened. For example, if you stop working on this activity and come back to the activity, this first code cell will need to be executed again to load the data, even though output may still show up from the prior time you worked on the activity. 

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(lubridate)

theme_set(theme_bw(base_size = 16))

mae <- function(x, na.rm = TRUE, ...) {
  avg <- mean(x, na.rm = na.rm, ...)
  abs_avg <- abs(x - avg)
  
  mean(abs_avg)
} 

Riders <- Riders |>
  select(date, day, highT, lowT, precip, clouds, riders, weekday) |>
  mutate(month = month(date, label = TRUE),
         precip_two = ifelse(precip > 0, 'Rained', 'No-Rain'),
         precip_two_num = ifelse(precip > 0, 1, 0))

   ## Riders data

   Some data that already come with R collected data on the number of users on a Massachusetts Rail Trail over a 90 day period from April 5, 2005 to November 15, 2005. Below is some information about the attributes in the data.

   - **date**: date of data collection
   - **day**: Day of the week
   - **highT**: High temperature for the day
   - **lowT**: Low temperature for the day
   - **precip**: Precipitation amount in inches
   - **clouds**: A measure of cloud cover (in oktas; higher is more cloud cover, 0 = no cloud cover, 8 = completely overcast).
   - **riders**: Number of riders counted
   - **weekday**: N = weekday or holiday; Y = non-holiday weekday
   - **month**: Month of the year
   - **precip_two**: Dichtomous variable indicating whether the day had any rain or not
   - **precip_two_num**: Dichotomous variable where 1 = rained during the day; 0 = no rain

In [None]:
head(Riders, n = 10)

**Note:** If the code chunk right above this does not run properly, make sure to rerun the very first code chunk that reads in the data. 

### Explore `riders` attribute descriptively.

Using the `df_stats()` function, explore the variation of the `riders` attribute descriptively (i.e. explore the `IQR`, `SD`, `mean`, or `median`, etc). Fill in some of the statistics in place of ^^ below that help to describe the `riders` attribute appropriately. *More than one statistic can be separated with a comma.*

In [None]:
riders_descrip <- Riders |>
  df_stats(~ riders, ^^)
riders_descrip 

   Let's visualize the distribution.

In [None]:
gf_density(~ riders, data = Riders)

### Questions to think about 

1. Would the median or mean represent a better measure of center? Why did you pick this statistic?
2. Are these statistics a very good representation of the variation of ridership? Why or why not?
3. Would there be other variables that may help to explain the variation in number of riders?

   ## Conditional descriptive statistics
   Let's explore if there are differences in the amount of variation as part of the day of the week (i.e. the attribute `day`). In the code below, add the statistics of variation in place of "^^" (see the code chunk above or the course notes on descriptive statistics for example of descriptive statistics that may be worth adding).

In [None]:
riders_days <- Riders |>
  df_stats(riders ~ day, ^^)
riders_days

   Visualize the distributions across days.

In [None]:
gf_density(~ riders, data = Riders, bins = 30, color = 'black') |>
  gf_facet_wrap(~ day)

### Questions to consider

1. Does the conditional distributions seem a better representation than the single overall distribution? Why or why not?
2. Is there evidence of differences in variation across the different days of the week? Support your statement with evidence from the statistics or figures above.
3. What about measures of center, are there differences across the days of the week? Support your statement with evidence from the statistics or figures above.
4. Do you think that the `weekday` variable would be sufficient here? Why or why not?

   ## Descriptive Statistics for a Dichotomous variable

   Dichotomous data are attributes that take on two values. Some examples may include, graduate vs not-graduated, watched a tv show vs did not watch a tv show, took a medicine vs did not take medicine, and many other options. These types of attributes can sometimes be represented as text, but are often represented as numbers, most commonly where one group is represented with a 0 and the other with a 1. Two examples in the `Riders` data are *precip_two* and *precip_two_num*, where *precip_two* is a categorical version of *precip_two_num*. A table may be helpful to see the values.

In [None]:
count(Riders, precip_two, precip_two_num)

   The table shows that there are 29 days that had rain and 61 days without rain and the value of 1 for the *precip_two_num* attribute represents a day where it rained. Let's now try to compute some descriptive statistics on the special numeric dichotomous attribute (i.e. *precip_two_num*).

In [None]:
Riders %>%
  df_stats(~ precip_two_num, sum, mean, median, sd, IQR, length)

In the above code, the sum, mean, median, standard deviation, IQR, and length of the *precip_two_num* attribute were computed. Let's try to think about the interpretation of these a bit more.

### Questions to consider

1. Focusing first on the output from computing the sum, what does this represent here? Hint, it may be helpful to look at the output from the `count()` function above.
2. Now, think about the interpretation of the measures of center, i.e. mean and median. Keep in mind the variable here is depicting whether the day rained or not. What may be the best interpretation of these statistics?
3. Finally, think about the last three measures, sd, IQR, and length. What do these statistics represent here? Are they useful to consider here?