### Codio Activity 14.6:  Preventing Overfitting by Limiting Growth

**Expected Time = 60 minutes**

**Total Points = 50**

This activity focuses on using the hyperparameters in the scikit-learn model that restrict the depth of the tree.  You will compare different setting combinations of these hyperparameters to determine the best parameters using a test set for evaluating your classifier.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

### The Data

For this exercise, you will use the credit card default dataset.  Again, the goal is to predict credit card default.  Below, the data is loaded, cleaned, and split for you.

In [2]:
! pip install xlrd



In [3]:
default = pd.read_excel("data/Default.xls", skiprows=1)
default.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [4]:
default = default.rename({"default payment next month": "default"}, axis=1)

In [5]:
# default.info()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    default.drop("default", axis=1), default.default, random_state=42
)

[Back to top](#-Index)

### Problem 1

#### Unlimited Growth

**10 Points**

Use the default settings for the `DecisionTreeClassifier` to fit the estimator on the training data and compare the training and test set accuracy score.  Assign the estimator as `dtree`, training score as floats to `train_acc` and `test_acc` respectively.  Examine the depth of the tree with the `.get_depth()` method.  Assign this to `depth_1`.  

<div class="alert alert-block alert-info"><b>Note: </b> Use <code>random_state = 42</code> for all estimators in this assignment!</div>

In [7]:
def get_tree(**kwargs: dict):
    dtree = DecisionTreeClassifier(random_state=42, **kwargs).fit(X_train, y_train)
    depth = dtree.get_depth()
    train_acc = accuracy_score(y_train, dtree.predict(X_train))
    test_acc = accuracy_score(y_test, dtree.predict(X_test))
    return dtree, depth, train_acc, test_acc

In [8]:
### GRADED
dtree, depth_1, train_acc, test_acc = get_tree()

### Answer Check
print(f"Training Accuracy: {train_acc: .2f}")
print(f"Trest Accuracy: {test_acc: .2f}")
print(f"Depth of tree: {depth_1}")

Training Accuracy:  1.00
Trest Accuracy:  0.73
Depth of tree: 41


[Back to top](#-Index)

### Problem 2

### `min_samples_split`

Setting the `min_samples_split` argument will control splitting nodes with either a number of samples or percent of the data as valued.  From the estimators docstring:

```
min_samples_split : int or float, default=2
    The minimum number of samples required to split an internal node:

    - If int, then consider `min_samples_split` as the minimum number.
    - If float, then `min_samples_split` is a fraction and
      `ceil(min_samples_split * n_samples)` are the minimum
      number of samples for each split.
```

Use this to limit the trees growth to nodes with more than 5% of the samples.  Assign the estimator to `dtree_samples`, and train and test accuracy as floats to `samples_train_acc` and `samples_test_acc` respectively.  Assign the depth of the tree to `depth_2` below.  Remember to set `random_state = 42` in your estimator.

**10 Points**


In [9]:
### GRADED
dtree_samples, depth_2, samples_train_acc, samples_test_acc = get_tree(
    min_samples_split=5 / 100
)

### Answer Check
print(f"Training Accuracy: {samples_train_acc: .2f}")
print(f"Test Accuracy: {samples_test_acc: .2f}")
print(f"Depth of tree: {depth_2}")

Training Accuracy:  0.82
Test Accuracy:  0.82
Depth of tree: 24


[Back to top](#-Index)

### Problem 3

#### `max_depth`

Below, create a tree that grows to a maximum depth of 5.  Assign the estimator as `depth_tree` and the accuracy on the train and test set as floats to `depth_train_acc` and `depth_test_acc` respectively.  Be sure to set `random_state = 42`. 



**10 Points**


In [10]:
### GRADED
depth_tree, depth_depth, depth_train_acc, depth_test_acc = get_tree(max_depth=5)

### Answer Check
print(f"Training Accuracy: {depth_train_acc: .2f}")
print(f"Test Accuracy: {depth_test_acc: .2f}")
print(f"Depth of tree: {depth_depth}")

Training Accuracy:  0.83
Test Accuracy:  0.82
Depth of tree: 5


[Back to top](#-Index)

### Problem 4

### `min_impurity_decrease`

**10 Points**

This stops splitting when there is less than a given amount of impurity decrease.  Below, use a decision tree called `imp_tree` with a `min_impurity_decrease = 0.01`, examine its depth as `depth_4` and the train and test scores as floats to `imp_training_acc` and `imp_test_acc` respectively.  Set `random_state = 42` in your estimator.




In [11]:
### GRADED
imp_tree, depth_4, imp_train_acc, imp_test_acc = get_tree(min_impurity_decrease=0.01)

### Answer Check
print(f"Training Accuracy: {imp_train_acc: .2f}")
print(f"Trest Accuracy: {imp_test_acc: .2f}")
print(f"Depth of tree: {depth_4}")

Training Accuracy:  0.82
Trest Accuracy:  0.82
Depth of tree: 2


[Back to top](#-Index)

### Problem 5

###  Grid Searching parameters

**10 Points**


Finally, consider the parameters for each of the growth preventing parameters and use a grid search with a decision tree.  Assign the best parameters as `best_growth_parameters` below and the train and test score as `grid_train_acc` and `grid_test_acc`.  Be sure to set `random_state = 42`. 

In [12]:
params = {
    "min_impurity_decrease": [0.01, 0.02, 0.03, 0.05],
    "max_depth": [2, 5, 10],
    "min_samples_split": [0.1, 0.2, 0.05],
}

In [13]:
### GRADED
grid = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=params).fit(
    X_train, y_train
)
best_params = grid.best_params_
_, _, grid_train_acc, grid_test_acc = get_tree(**best_params)

### Answer Check
print(f"Training Accuracy: {grid_train_acc: .2f}")
print(f"Trest Accuracy: {grid_test_acc: .2f}")
print(f"Best parameters of tree: {best_params}")

Training Accuracy:  0.82
Trest Accuracy:  0.82
Best parameters of tree: {'max_depth': 2, 'min_impurity_decrease': 0.01, 'min_samples_split': 0.1}


Note how long the basic grid search takes.  You likely don't want to try to be too exhaustive with the parameters due to the time for training cost. 