 ## In Class Activity - classification

 ## Setup

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)
library(mosaic)
library(titanic)
library(rpart)
library(rsample)
library(rpart.plot)

theme_set(theme_bw(base_size = 14))

gss_cat_mod <- gss_cat %>%
  drop_na() %>% 
  mutate(tv_cat = ifelse(tvhours <= 2, 'Less than 2', 'More than 2'),
         tv_cat = fct_relevel(tv_cat, 'Less than 2', 'More than 2'),
         partyid2 = fct_collapse(partyid,
                                 other = c("No answer", "Don't know", "Other party"),
                                 republican = c("Strong republican", "Not str republican"),
                                 independent = c("Ind,near rep", "Independent", "Ind,near dem"),
                                 democrat = c("Not str democrat", "Strong democrat")),
         relig2 = fct_lump(relig, n = 6),
         rincome2 = fct_collapse(rincome,
                                 other = c('No answer', "Don't know", "Refused", "Not applicable")),
         rincome2 = fct_lump(rincome2, n = 2)
           ) %>%
  filter(partyid2 != 'other', marital != 'No answer') %>%
  select(-partyid, -relig, -denom, -rincome)


 ## gss_cat data

 Some data that already come with R that came originally from the (General Social Survey)[https://gssdataexplorer.norc.org/]. Some data processing has been done on this data, primarily to simplify categories of some of the data. The attributes of the data are described in some detail below.

 - **year**: year of data collection, data collected every other year
 - **marital**: Marital status; Never Married, Separated, Divorced, Widowed, Married
 - **age**: Age of the individual
 - **race**: Race; Black, White, Other
 - **tvhours**: Number of hours spent watching tv daily
 - **tv_cat**: The tvhours variable dichotomized into less or equal to 2 hours daily vs more than 2 hours
 - **partyid2**: Political Affiliation; Republican, Independent, Democrat
 - **relig2**: Religion; Christian, Jewish, Catholic, Protestant, Other, None
 - **rincome2**: Dichotomous income variable; $25000 or more vs less than $25000

In [0]:
head(gss_cat_mod, n = 10)


In [0]:
# split into training/test data
set.seed(123)
gss_cat_split <- initial_split(gss_cat_mod, prop = .8)
gss_cat_train <- training(gss_cat_split)
gss_cat_test <- testing(gss_cat_split)

# explore variable initially
gf_bar(~ tv_cat, data = gss_cat_train)


 ### Classification Model

 Using the `rpart()` function, fit a classification model to the data to predict whether an individual watches 2 or more hours of tv a day. Hypothesize and include the predictors you think would be most relevant in predicting the dichotomous tv variable. Insert the attribute of interest in place of "$$" in the `rpart()` function below and fill in the important attributes to predict the dichotomous tv watched variable in place of "^^" below.

In [0]:
tv_class <- rpart($$ ~ ^^, 
                  method = 'class', data = gss_cat_train)

rpart.plot(tv_class, roundint = FALSE, type = 3, branch = .3)


 **Questions:**
 1. Which attribute is most important in the classification model?
 2. Exploring the probabilities in the classification tree figure (i.e. middle number in the leaves at the bottom of the classification tree), how confident is the model in predicting the class membership?
 3. Based on the probabilities and your thoughts on #2, make a prediction for the classification accuracy.
 4. Can you think of attributes not in the current data that may help in the prediction?

 Let's now compute the classification accuracy.

In [0]:
# Test accuracy
gss_cat_test <- gss_cat_test %>%
  mutate(model_predicted = predict(tv_class, newdata = gss_cat_test, type = 'class'))

gss_cat_test %>%
  mutate(same_class = ifelse(tv_cat == model_predicted, 1, 0)) %>%
  df_stats(~ same_class, mean, sum, length)


 **Questions:**
 1. How strong is this classification? (i.e. interpret the prediction accuracy above).
 2. Would you be satisfied in using this model to predict whether an individual watches more than 2 hours of tv in a day?
 3. In the output above, what do the columns named "sum_same_class" and "length_same_class" represent?

 ### Visually show accuracy
 Let's visually explore and interpret the accuracy.

In [0]:

gf_bar(~ tv_cat, fill = ~ model_predicted, data = gss_cat_test, position = 'fill') %>%
  gf_labs(y = "Proportion", x = "Observed TV Category", fill = "Model Predicted Class") %>%
  gf_refine(scale_y_continuous(breaks = seq(0, 1, .1)))


 ## Impact of training/test split
 In this section, we will explore with an experiment how the amount of training/test data may influence the prediction accuracy. In the following cell in place of "^^", place various proportions that you wish to explore. These proportions would represent the amount of data that would go into the training data. Separate different values by a comma, for example, the code may look like: `proportions <- c(0.1, 0.3, 0.5, 0.7, 0.9)`

In [0]:
proportions <- c(^^)


 The cell below will take a little bit of time to run, and will depend on how many values you picked for the `proportions` above. The below code will create 500 different splits of the data, fit the classification model, and then return the predicted classification accuracy. Finally, a violin plot will be created that summarizes the results.

 *Run the following cell, then hypothesize the results while it is running*

 1. **Which proportion of training/test split do you think will result in the best prediction accuracy?**
 2. **Do you think one of the training/test splits will result in more precise prediction accuracy estimates?**

In [0]:
calc_predict_acc_split <- function(proportion) {
  gss_cat_split <- initial_split(gss_cat_mod, prop = proportion)
  gss_cat_train <- training(gss_cat_split)
  gss_cat_test <- testing(gss_cat_split)
  
  tv_class <- rpart(tv_cat ~ year + marital + age + race + rincome2 + partyid2 + relig2, 
                    method = 'class', data = gss_cat_train)
  
  gss_cat_test <- gss_cat_test %>%
    mutate(model_predicted = predict(tv_class, newdata = gss_cat_test, type = 'class'))
  gss_cat_test %>%
    mutate(same_class = ifelse(tv_cat == model_predicted, 1, 0)) %>%
    df_stats(~ same_class, mean, sum)
  
}

predict_accuracy <- vector("list", length(proportions))

for(i in seq_along(proportions)) {
  predict_accuracy[[i]] <- map(1:500, function(x) calc_predict_acc_split(proportion = proportions[i])) %>%
    bind_rows() %>%
    mutate(condition = proportions[i])
}
predict_accuracy <- bind_rows(predict_accuracy)

gf_violin(mean_same_class ~ factor(condition), data = predict_accuracy, fill = 'gray80',
          draw_quantiles = c(0.1, 0.5, 0.9)) %>% 
  gf_refine(coord_flip()) %>%
  gf_labs(x = "Proportion in Training Data",
          y = "Prediction Accuracy") 

 **Questions:**
 1. What seems to be the overall impact that the proportion of data in the training data? More specifically, what are key differences you see in the violin plot?
 2. Based on this small experiment, does there appear to be an optimal strategy for splitting the data into training/test data for the classification model?
 3. Is there additional information you would want to explore to better understand how well the classification accuracy is?
 4. Are you surprised by the results shown in the violin plot above? Why or why not?