#### Building a simple decision tree
The loans dataset contains 11,312 randomly-selected people who applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree's predictions differ for an applicant with good credit versus one with bad credit.

The dataset loans is already in your workspace.

In [None]:
# Load the rpart package
library(rpart)

str(loans)
# 'data.frame':	11312 obs. of  14 variables:
#  $ loan_amount       : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 2 3 2 3 3 2 1 3 ...
#  $ emp_length        : Factor w/ 5 levels "10+ years","2 - 5 years",..: 1 4 3 2 4 4 2 1 1 4 ...
#  $ home_ownership    : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 3 4 4 4 1 4 4 ...
#  $ income            : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 3 3 2 2 1 1 1 3 ...
#  $ loan_purpose      : Factor w/ 14 levels "car","credit_card",..: 2 1 1 12 10 3 10 7 3 7 ...
#  $ debt_to_income    : Factor w/ 3 levels "AVERAGE","HIGH",..: 2 3 3 3 1 1 3 1 1 3 ...
#  $ credit_score      : Factor w/ 3 levels "AVERAGE","HIGH",..: 1 1 3 1 1 1 1 2 1 1 ...
#  $ recent_inquiry    : Factor w/ 2 levels "NO","YES": 2 2 2 2 1 2 2 1 1 2 ...
#  $ delinquent        : Factor w/ 3 levels "IN PAST 2 YEARS",..: 3 3 3 3 3 3 3 3 3 3 ...
#  $ credit_accounts   : Factor w/ 3 levels "AVERAGE","FEW",..: 2 2 2 1 2 2 3 3 1 1 ...
#  $ bad_public_record : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
#  $ credit_utilization: Factor w/ 3 levels "HIGH","LOW","MEDIUM": 1 2 1 3 3 1 3 2 1 3 ...
#  $ past_bankrupt     : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
#  $ outcome           : Factor w/ 2 levels "default","repaid": 2 1 2 1 1 1 1 2 1 1 ...

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
#      1 
# repaid 
# Levels: default repaid

# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
#       1 
# default 
# Levels: default repaid

#### Visualizing classification trees
Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions. The model loan_model that you fit in the last exercise is in your workspace.

In [None]:
# Examine the loan_model object
loan_model
# n= 11312 

# node), split, n, loss, yval, (yprob)
#       * denotes terminal node

#  1) root 11312 5654 repaid (0.4998232 0.5001768)  
#    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
#      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
#      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
#       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
#       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
#         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
#         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
#    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *

# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

![loan_model_tree](./figures/tree_1.png)

![loan_model_tree](./figures/tree_2.png)

#### Creating random test datasets
Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.



The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets. The dataset loans is loaded in your workspace.

In [None]:
# Determine the number of rows for training
nrow(loans) * 0.75

# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans) * 0.75)

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]

#### Building and evaluating a larger tree
Previously, you created a simple decision tree that used the applicant's credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

The rpart package is loaded into the workspace and the loans_train and loans_test datasets have been created.



In [None]:
# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$outcome, loans_test$pred)
#           default repaid
#   default     821    632
#   repaid      546    829

# Compute the accuracy on the test dataset
mean(loans_test$outcome == loans_test$pred)
# [1] 0.5834512


#### Preventing overgrown trees
The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

rpart is loaded.

In [None]:
# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")

mean(loans_test$pred == loans_test$outcome)
# [1] 0.5922914

#### Creating a nicely pruned tree
Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree's performance versus complexity, and use this information to prune the tree to an appropriate level.

The rpart package is loaded into the workspace, along with loans_test and loans_train.

In [None]:
# Grow an overly complex tree
loan_model <- rpart(outcome ~., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$outcome == loans_test$pred)
# [1] 0.6007779

![post_pruned_tree](./figures/tree_3.png)

#### Building a random forest model
In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

In [None]:
# Load the randomForest package
library(randomForest)

# Build a random forest model
loan_model <- randomForest(outcome ~ . , data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$outcome == loans_test$pred)
# [1] 0.6000707