  ## Classification - Training/Test Data


  ## Setup

  In this example, we will explore data from the titanic that comes from Kaggle (https://www.kaggle.com/c/titanic/data). You can view the attributes in the data from the link previous. The following set of code will install a couple a new packages that we will utilize for this section of the course, the titanic package has the data we will use and the rpart package includes functions to perform the tree based models we will employ.

  ### Loading R packages

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)
library(mosaic)
library(titanic)
library(rpart)
library(rsample)
library(rpart.plot)

theme_set(theme_bw())

titanic <- bind_rows(titanic_train, titanic_test) %>% 
  mutate(survived = ifelse(Survived == 1, 'Survived', 'Died')) %>% 
  drop_na(survived)

head(titanic)


 ## Training/Test Data
 So far we have used the entire data to make our classification. This is not best practice and we will explore this is a bit more detail. First, take a minute to hypothesize why using the entire data to make our classification prediction may not be the best?

 It is common to split the data prior to fitting a classification/prediction model into a training data set in which the model makes a series of predictions on the data, learns which data attributes are the most important, etc. Then, upon successfully identifying a useful model with the training data, test these model predictions on data that the model has not seen before. This is particularly important as the algorithms to make the predictions are very good at understanding and exploiting small differences in the data used to fit the model. Therefore, exploring the extent to which the model does a good job on data the model has not seen is a better test to the utility of the model. We will explore in more detail the impact of not using the training/test data split later, but first, let's refit the classification tree to the titanic data by splitting the data into 70% training and 30% test data. Why 70% training and 30% test? This is a number that is sometimes used as the splitting, an 80/20 split is also common. The main idea behind the making the test data smaller is so that the model has more data to train on initially to understand the attributes from the data. Secondly, the test data does not need to be quite as large, but we would like it to be representative. Here, the data are not too large, about 1000 passengers with available survival data, therefore, withholding more data helps to ensure the test data is representative of the 1000 total passengers.

 ### Splitting the data into training/test
 This is done with the `rsample` package utilizing three functions, `initial_split()`, `training()`, and `test()`. The `initial_split()` function helps to take the initial random sample and the proportion of data to use for the training data is initially identified. The random sample is done without replacement meaning that the data are randomly selected, but can not show up in the data more than once. Then, after using the `initial_split()` function, the `training()` and `test()` functions are used on the resulting output from `initial_split()` to obtain the training and test data respectively. It is good practice to use the `set.seed()` function to save the seed that was used as this is a random process. Without using the `set.seed()` function, the same split of data would likely not be able to be recreated in the code was ran again.

 Let's do the data splitting.

In [0]:
set.seed(2019)
titanic_split <- initial_split(titanic, prop = .7)
titanic_train <- training(titanic_split)
titanic_test <- testing(titanic_split)


 We can now fit the classification tree similar to as before, but now instead of passing the entire titanic data, we will simply use the training data.

In [0]:
class_tree <- rpart(survived ~ Pclass + Sex + Age + Fare + Embarked + SibSp + Parch, 
   method = 'class', data = titanic_train)

rpart.plot(class_tree, roundint = FALSE, type = 3, branch = .3)


 Let's prune the tree using a CP rule of .02.

In [0]:
prune_class_tree <- prune(class_tree, cp = .02)

rpart.plot(prune_class_tree, roundint = FALSE, type = 3, branch = .3)


 This seems like a reasonable model. Let's check the model accuracy.

In [0]:
titanic_predict <- titanic_train %>%
  mutate(tree_predict = predict(prune_class_tree, type = 'class'))
titanic_predict %>%
  mutate(same_class = ifelse(survived == tree_predict, 1, 0)) %>%
  df_stats(~ same_class, mean, sum)


 This is actually slightly better accuracy compared to the model last time, about 84.5% compared to about 82.7% prediction accuracy. But, let's test the model out on the test data to see the prediction accuracy for the test data, the real test.

In [0]:
titanic_predict_test <- titanic_test %>%
  mutate(tree_predict = predict(prune_class_tree, newdata = titanic_test, type = 'class'))
titanic_predict_test %>%
  mutate(same_class = ifelse(survived == tree_predict, 1, 0)) %>%
  df_stats(~ same_class, mean, sum)


 For the test data, prediction accuracy was quite a bit lower, about 78.6%.

 ### Introduction to resampling/bootstrap
 To explore these ideas in more detail, it will be helpful to use a statistical technique called resampling or the bootstrap. We will use these ideas a lot going forward in this course. In very simple terminology, resampling or the bootstrap can help us understand uncertainty in our estimates and also allow us to be more flexible in the statistics that we run. The main drawback of resampling and bootstrap methods is that they can be computationally heavy, therefore depending on the situation, more time is needed to come to the conclusion desired.

 Resampling and bootstrap methods use the sample data we have and perform the sampling procedure again treating the sample we have data for as the population. Generating the new samples is done with replacement (more on this later). This resampling is done many times (100, 500, 1000, etc.) with more in general being better. As an example with the titanic data, let's take the titanic data, assume this is the population of interest, and resample from this population 1000 times (with replacement) and each time we will calculate the proportion that survived the disaster in each sample. Before we write the code for this, a few questions to consider.

 1. Would you expect the proportion that survived to be the same in each new sample? Why or why not?
 2. Sampling with replacement keeps coming up, what do you think this means?
 3. Hypothesize why sampling with replacement would be a good idea?

 Let's now try the resampling with the calculation of the proportion that survived. We will then save these 1000 survival proportions and create a visualization.

In [0]:
resample_titanic <- function(...) {
    titanic %>%
        sample_n(nrow(titanic), replace = TRUE) %>%
        df_stats(~ Survived, mean)
}

survival_prop <- map(1:1000, resample_titanic) %>% 
  bind_rows()

gf_density(~ mean_Survived, data = survival_prop)


 1. How would we interpret this figure?
 2. What are some key features of this figure?
 3. Why is there variation?

 ## Bootstrap variation in prediction accuracy
 We can apply these same methods to evaluate the prediction accuracy based on the classification model above. When using the bootstrap, we can get an estimate for how much variation there is in the classification accuracy based on the sample that we have. In addition, we can explore how different the prediction accuracy would be for many samples when using all the data and by splitting the data into training and test sets.

 ### Bootstrap full data.
 Let's first explore the full data to see how much variation there is in the prediction accuracy using all of the data. Here we will again use the `sample_n()` function to sample with replacement, then fit the classification model to each of these samples, then calculate the prediction accuracy. First, I'm going to write a function to do all of these steps one time.

In [0]:
calc_predict_acc <- function(data) {
  rsamp_titanic <- titanic %>%
    sample_n(nrow(titanic), replace = TRUE)

  class_model <- rpart(survived ~ Pclass + Sex + Age + Fare + SibSp + Parch, 
        method = 'class', data = rsamp_titanic, cp = .02)

  titanic_predict <- rsamp_titanic %>%
    mutate(tree_predict = predict(class_model, type = 'class'))
  titanic_predict %>%
    mutate(same_class = ifelse(survived == tree_predict, 1, 0)) %>%
    df_stats(~ same_class, mean, sum)

}


 This function we can run once and it should generate the prediction accuracy and the number of surviving passengers for our resampled data.

In [0]:
calc_predict_acc()


 To do the bootstrap, this process can be replicated many times. In this case, I'm going to do 500. In practice, we would likely want to do a few more.

In [0]:
predict_accuracy_fulldata <- map(1:500, calc_predict_acc) %>%
  bind_rows()


 This can be plotted to inspect.

In [0]:
gf_density(~ mean_same_class, data = predict_accuracy_fulldata)


 Let's do the same, but now split the data into training/test data. The model will be fitted to the training data and the model predictions will be explored with the test data. The function created above will be modified to do the training/test split.

In [0]:
calc_predict_acc_split <- function(data) {
  titanic_split <- initial_split(titanic, prop = .7)
  titanic_train <- training(titanic_split)
  titanic_test <- testing(titanic_split)

  class_model <- rpart(survived ~ Pclass + Sex + Age + Fare + SibSp + Parch, 
        method = 'class', data = titanic_train, cp = .02)

  titanic_predict <- titanic_test %>%
    mutate(tree_predict = predict(class_model, newdata = titanic_test, type = 'class'))
  titanic_predict %>%
    mutate(same_class = ifelse(survived == tree_predict, 1, 0)) %>%
    df_stats(~ same_class, mean, sum)

}

calc_predict_acc_split()


 This seems to be working. Let's now do this 500 times as well.

In [0]:
predict_accuracy_traintest <- map(1:500, calc_predict_acc_split) %>%
  bind_rows()

gf_density(~ mean_same_class, data = predict_accuracy_traintest)


 We can combine these two objects and see how these look on a single figure.

In [0]:
bind_rows(
  mutate(predict_accuracy_fulldata, type = "Full Data"),
  mutate(predict_accuracy_traintest, type = "Train/Test")
) %>%
  gf_density(~ mean_same_class, color = ~ type, fill = NA, size = 1.25)