 ## Classification - Decision Trees


 ## Setup

 In this example, we will explore data from the titanic that comes from Kaggle (https://www.kaggle.com/c/titanic/data). You can view the attributes in the data from the link previous. The following set of code will install a couple a new packages that we will utilize for this section of the course, the titanic package has the data we will use and the rpart package includes functions to perform the tree based models we will employ.

 ### Loading R packages

In [0]:
.libPaths('../RPackages')

install.packages(c("titanic", "rpart", "caret", "rsample", "rpart.plot"))

library(tidyverse)
library(ggformula)
library(mosaic)
library(titanic)
library(rpart)
library(caret)
library(rpart.plot)

theme_set(theme_bw())

titanic <- bind_rows(titanic_train, titanic_test) %>% 
  mutate(survived = ifelse(Survived == 1, 'Survived', 'Died')) %>% 
  select(-Survived) %>%
  drop_na(survived)

head(titanic)


 ## Introduction to Decision Trees
 Decision trees is a method to predict an outcome based on decision rules and is named after the tree like structure that it creates. The method makes "decisions" based on the data to maximize some critera that we choose. Common criteria could be minimizing error or maximizing the correct predictions made. A picture and example may be helpful in this context to see an example of a decision tree. The New York Times published an interactive online quiz that asks a series of questions to explore if an individual is more likely to be a democrat or a republican. The questions are relatively simple and using a decision tree method attempts to predict the likelihood someone is a democrat or republican (https://www.nytimes.com/interactive/2019/08/08/opinion/sunday/party-polarization-quiz.html).
 ![decision-tree](../images/nytimes-decision-tree.png)

 ## Predict Survival
 Let's try to predict whether a passenger survived the titanic iceberg disaster. What may be some attributes that are important in predicting whether an individual survived the titanic disaster? What would be a good performance measure to understand how well the model does at identifying if a person survived the disaster?

 Let's first look at how many individuals survived. For this, we can use the `count()` function. The first argument is the data, the second argument is the attribute we wish to count the number of occurances for each unique value.

In [0]:
count(titanic, survived)


 As you can see, about 38% survived the disaster, 342 out of 891.

 We can also create a bar chart that shows the number that survived.

In [0]:
gf_bar(~ survived, data = titanic)


 ## Fitting a Classification Tree
 Let's class_tree our first classification tree to predict the dichotomous attribute, survival. For this, we will use the `rpart()` function from the rpart package. The first argument to the `rpart()` function is a formula where the outcome of interest is specified to the left of the `~` and the attributes that are predictive of the outcome are specified to the right of the `~` separated with `+` signs. The second argument specifies the method for which we want to run the analysis, in this case we want to classify individuals based on the values in the data, therefore we specify `method = 'class'`. The final argument is the data element, in this case titanic.

 In this example, I picked a handful of attributes that would seem important. These can either be numeric or represent categories, the method does not care the type of attributes that are included in the analysis. Notice that I save the computation to the object, `class_tree`.

In [0]:
class_tree <- rpart(survived ~ Pclass + Sex + Age + Fare + Embarked + SibSp + Parch, 
   method = 'class', data = titanic)


 Visualizing the model object can be a good way to understand what is happening with the model. Here we are using the `rpart.plot()` function to create a nicer looking visualization than the default. The primary argument for this function is the model object that was saved above.

In [0]:
rpart.plot(class_tree, roundint = FALSE, type = 3, branch = .3)


 We can also print out a list of the decision rules based on the classification model using the `rpart.rules()` function. This function again takes the model object fitted above as the primary argument.

In [0]:
rpart.rules(class_tree, cover = TRUE)


 ### Pruning Trees
 One downside of decision trees, is that they can tend to overfit the data and capitalize on chance variation in our sample that we can not generalize to another sample. This means that there are features in the current sample that would not be present in another sample of data. There are a few ways to overcome this, one is to prune the tree to only include the attributes that are most important and improve the classification accuracy. One measure of this can be used is called the complexity parameter (CP) and this statistic attempts to balance the tree complexity related to how strongly the levels of the tree improve the classification accuracy. We can view these statistics with the `printcp()` and `plotcp()` functions where the only argument to be specified is the classification tree computation that was saved in the previous step.

In [0]:
printcp(class_tree)
plotcp(class_tree)


 ### Perform the Pruning
 To perform the pruning, the `prune()` function is used

In [0]:
prune_class_tree <- prune(class_tree, cp = class_tree$cptable[which.min(class_tree$cptable[,"xerror"]),"CP"])

# plot the pruned tree and the decision rules


In [0]:
rpart.plot(prune_class_tree, roundint = FALSE, type = 3, branch = .3)
rpart.rules(prune_class_tree, cover = TRUE)


 ### Accuracy
 We can explore model performance by looking at the percentage of correct classifications. Basically, does the classification tree accurately classify passengers as surviving vs not surviving. To do this, we use the model to apply the decision rules shown in the above figure to apply the survived or not classifications. This can be done quickly with R using the `predict()` function.

In [0]:
titanic_predict <- titanic %>%
  mutate(tree_predict = predict(prune_class_tree, type = 'class')) %>%
  cbind(predict(prune_class_tree, type = 'prob'))
head(titanic_predict, n = 20)


 We can then create a table to show how the data differ based on the observed vs predicted values.

In [0]:
titanic_predict %>%
  count(survived, tree_predict)


 This result can also be visualized.

In [0]:
gf_bar(~ survived, fill = ~tree_predict, data = titanic_predict)


 Normalizing the groups can be advantageous as this can help to show percentages more directly.

In [0]:
gf_bar(~ survived, fill = ~tree_predict, data = titanic_predict, position = 'fill')


 Finally, to get a numeric quantity, we could then compute the percentage of classification accuracy.

In [0]:
titanic_predict %>%
  mutate(same_class = ifelse(survived == tree_predict, 1, 0)) %>%
  df_stats(~ same_class, mean, sum)


