## Decision tree
- We'll be introduced to various ways to make sure any model we're asked to create or discuss is generalizable, evaluated correctly, and properly selected from among other possible models.
- Here we'll tune min_samples_split, which is the minimum number of samples required to create an additional binary split, and max_depth, which is how deep we want to grow the tree. The deeper a tree, the more splits and therefore captures more information about the data.

In [2]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
loan_data = pd.read_csv('data/Loan payments data.csv')
# X
# y

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

# Instantiate, Fit, Predict
loans_clf = DecisionTreeClassifier() 
loans_clf.fit(X_train, y_train)
y_pred = loans_clf.predict(X_test)

# Evaluation metric
print("Decision Tree Accuracy: {}".format(accuracy_score(y_test,y_pred)))

- Import the correct function to perform cross-validated grid search.
- Instantiate a decision tree classifier and use it with the parameter grid to perform a cross-validated grid-search.
- Fit and print model evaluation metrics


In [3]:
# Import modules
from sklearn.model_selection import GridSearchCV

In [None]:
# Create the hyperparameter grid
param_grid = {"criterion": ["gini"], "min_samples_split": [2, 10, 20], 
              "max_depth": [None, 2, 5, 10]}

# Instantiate classifier and GridSearchCV, fit
loans_clf = DecisionTreeClassifier()
dtree_cv = GridSearchCV(loans_clf, param_grid, cv=5)
fit = dtree_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Decision Tree Parameter: {}".format(dtree_cv.best_params_))
print("Tuned Decision Tree Accuracy: {}".format(dtree_cv.best_score_))

#### K-fold cross-validation improved the accuracy of a decision tree model by more than 10 percent!

### A forest of decision trees
- **Task** : practice using the bootstrapped Decision Tree, otherwise known as the Random Forest.We'll then compare its accuracy to a model where we've tuned hyperparameters with cross-validation. 

- This time, we'll tune an additional hyperparameter, **max_features**, which lets our **model decide how many features to use**. When it is not set specifically, then it defaults to auto. Something to keep in mind for an interview is that Decision Trees consider all features by default, whereas Random Forests usually consider the square root of the number of features.

In [4]:
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

# Instantiate, Fit, Predict
loans_rf = RandomForestClassifier() 
loans_rf.fit(X_train, y_train)
y_pred = loans_rf.predict(X_test)

# Evaluation metric
print("Random Forest Accuracy: {}".format(accuracy_score(y_test,y_pred)))

In [None]:
# Create the hyperparameter grid
param_grid = {"criterion": ["gini"], "min_samples_split": [2, 10, 20], 
              "max_depth": [None, 2, 5, 10],"max_features": [10, 20, 30]}

# Instantiate classifier and GridSearchCV, fit
loans_rf = RandomForestClassifier()
rf_cv = GridSearchCV(loans_rf, param_grid, cv=5)
fit = rf_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Random Forest Parameter: {}".format(rf_cv.best_params_))
print("Tuned Random Forest Accuracy: {}".format(rf_cv.best_score_))

#### Although k-fold cross-validation did not improve a random forest model as much as it did for the decision tree, it had a 7 percent improvement over the baseline!