# Week07 - Prediction using Decision Tree (with Hyperparameter Tuning)

This census data is just too tough... try using another dataset -- univeral bank????


## Introduction and Overview


In this notebook, we will reuse the [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income) census data.

Previously, we modeled this data using k-NN (with a k search from 1 through root N) and an unpruned decision tree. As previously discuss, for this particular context (relatively equal cost/benefits for Fn vs FP, and imbalanced data classes) we argued that f1 was the best metric to maximize. 

In previous attempts we found a default (note pruned, and no hyperparameter tuning applied) produced an f1 score of approximately .49, while k-NN (with k=?) produced a recall of approimately 0.70 @ k=19 (see a3_template_approach.ipynb and accompaning video) 

In this notebook, we will explore if we can apply hyperparameter tuning to develop a better performing model.

## Step 1: Install and import necessary packages

random_seed = 1
np.random.seed(random_seed)

In [1]:
# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import numpy as np

In [2]:
random_seed = 1
np.random.seed(random_seed)


## Step 2: Load, clean and prepare data


### 2.1 Read data (income.csv)

In [3]:
df = pd.read_csv('https://github.com/timcsmith/MIS536-Public/raw/master/Data/UniversalBank.csv')
df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


### 2.2 Explore the dataset

In [4]:
# Explore the dataset
# read the first row of the dataset 
print(df.head())
print(df.columns)
print(df.describe())
print(df.info())

   ID  Age  Experience  Income  ZIP Code  Family  CCAvg  Education  Mortgage  \
0   1   25           1      49     91107       4    1.6          1         0   
1   2   45          19      34     90089       3    1.5          1         0   
2   3   39          15      11     94720       1    1.0          1         0   
3   4   35           9     100     94112       1    2.7          2         0   
4   5   35           8      45     91330       4    1.0          2         0   

   Personal Loan  Securities Account  CD Account  Online  CreditCard  
0              0                   1           0       0           0  
1              0                   1           0       0           0  
2              0                   0           0       0           0  
3              0                   0           0       0           0  
4              0                   0           0       0           1  
Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education'

### 2.3 Clean/transform data (where necessary)

In [5]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

Drop the columns we are not using as predictors (see previous notebooks -- we are given a subset of input variables to consider)

In [6]:
df = df.drop(columns=['ID', 'ZIP Code'])
df

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,25,1,49,4,1.6,1,0,0,1,0,0,0
1,45,19,34,3,1.5,1,0,0,1,0,0,0
2,39,15,11,1,1.0,1,0,0,0,0,0,0
3,35,9,100,1,2.7,2,0,0,0,0,0,0
4,35,8,45,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4995,29,3,40,1,1.9,3,0,0,0,0,1,0
4996,30,4,15,4,0.4,1,85,0,0,0,1,0
4997,63,39,24,2,0.3,3,0,0,0,0,0,0
4998,65,40,49,3,0.5,2,0,0,0,0,1,0


In [7]:
# translation education categories into dummy vars
df['Education'] = df['Education'].astype('category')
df = pd.get_dummies(df, prefix_sep='_', drop_first=False)

## Step 3 Split data intro training and validation sets

In [8]:
# construct datasets for analysis
target = 'Personal Loan'
predictors = list(df.columns)
predictors.remove(target)
X = df[predictors]
y = df[target]


In [9]:
# create the training set and the test set 
train_X, valid_X, train_y, y_test = train_test_split(X,y, test_size=0.3, random_state=1)
print(train_X)
print(valid_X)

      Age  Experience  Income  Family  CCAvg  Mortgage  Securities Account  \
1334   47          22      35       2    1.3         0                   0   
4768   38          14      39       1    2.0         0                   0   
65     59          35     131       1    3.8         0                   0   
177    29           3      65       4    1.8       244                   0   
4489   39          13      21       3    0.2         0                   0   
...   ...         ...     ...     ...    ...       ...                 ...   
2895   60          36      39       4    1.3       140                   0   
2763   55          31      13       4    0.7         0                   0   
905    46          22      28       1    1.0        84                   0   
3980   46          22      89       4    1.4         0                   0   
235    38           8      71       4    1.8         0                   0   

      CD Account  Online  CreditCard  Education_1  Education_2 

## Step 4: Prediction with Decision Tree (using default parameters)



You can find details about SKLearm's DecisionTree classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

### 4.1 Create a decision tree using all of the default parameters

In [10]:
dtree=DecisionTreeClassifier(random_state=random_seed)

### 4.2 Fit the model to the training data

In [11]:
_ = dtree.fit(train_X, train_y)

### 4.3 Review of the performance of the model on the validation/test data

In [12]:
dtree.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 1,
 'splitter': 'best'}

In [13]:
print(dtree.get_depth())
print(dtree.get_n_leaves())

10
47


In [14]:
y_pred = dtree.predict(valid_X)
print("************************************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred)}")
print("************************************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")
print("************************************")

************************************
Recall Score:     0.87248322147651
************************************
Accuracy Score:   0.9766666666666667
Precision Score:  0.8904109589041096
F1 Score:         0.8813559322033899
************************************


> NOTE: In one of our notebooks from last week already fit a default decision tree - but, I've fit it in this notebook to remind us of the performance of a default tree. In the following sections we will use hyper parameter tuning to potential find a better decision tree than the defaul.

## Step 5: Prediction with Decision Tree (using hyperparameter tuning)

This section demonstrates how to refine the performance of a decision tree using hyperparameter tuning techniques. 

This section doesn't duplicate the data loading, cleaning, and splitting of the first section. This section shows how to create and test a random forest classifier using best_random_search_modelCV and best_grid_search_modelCV techniques. 

Both best_random_search_modelCV and best_grid_search_modelCV test different model parameters. These help to determine the parameters that produce the best performing model.

### 5.1: Determine the parameters that can be "tuned"

You can review the parameters of the model which you're trying to "tune". In this case, we're using a DecisionTreeClassifier. Begin by reviewing the parameters for this model found [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

After reviewing these parameters (while also understanding something about DecisionTrees), we can identify the following parameters that could affect model fit. 

* criterion
* max_depth
* min_samples_split
* min_samples_leaf
* max_leaf_nodes
* min_impurity_decrease



### 5.2: Create an initial 'wide' range of possible hyperparameter values

Here we create a wide range of possible parameter values for each of the hyperparameters we've listed above. 


In [15]:
# Criterion used to guide data splits
criterion = ['gini', 'entropy', 'log_loss']

# Maximum number of levels in tree. If None, then nodes are expanded until all leaves are pure or until all 
# leaves contain less than min_samples_split samples.
# default = None
max_depth = [int(x) for x in np.linspace(1, 40000, 50)]
max_depth.append(None)

# Minimum number of samples required to split a node
# default is 2
min_samples_split = [int(x) for x in np.linspace(2, 5000, 50)]

# Minimum number of samples required at each leaf node
# default = 1 
min_samples_leaf = [int(x) for x in np.linspace(1, 10000, 50)]

# max_leaf_nodes  - Grow trees with max_leaf_nodes in best-first fashion.
# If None then unlimited number of leaf nodes.
# default=None 
max_leaf_nodes = [int(x) for x in np.linspace(2, len(y_test), 50)]
max_leaf_nodes.append(None)

# min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
# default=0.0
min_impurity_decrease = [x for x in np.arange(0.0, 0.01, 0.0001).round(5)]

# Create the random grid
param_grid_random = { 'criterion': criterion,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf' : min_samples_leaf,
                      'max_leaf_nodes' : max_leaf_nodes,
                      'min_impurity_decrease' : min_impurity_decrease,
                     }



### 5.3: Use Randomize Search to narrow the possible range of parameter values


In [16]:
dtree_default = DecisionTreeClassifier(random_state=random_seed)
# change n_iter to 200_000 for full run
best_random_search_model = RandomizedSearchCV(
        estimator=DecisionTreeClassifier(random_state=random_seed), 
        scoring='recall', 
        param_distributions=param_grid_random, 
        n_iter = 250_000, 
        cv=10, 
        verbose=0, 
        n_jobs = -1
    )
_ = best_random_search_model.fit(train_X, train_y)

In [17]:
random_search_best_params = best_random_search_model.best_params_
print('Best parameters found: ', random_search_best_params)

Best parameters found:  {'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0015, 'max_leaf_nodes': 460, 'max_depth': 37551, 'criterion': 'gini'}


### 5.4: Test the performance of the selected parameters

In [18]:
y_pred = best_random_search_model.predict(valid_X)
print("************************************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred)}")
print("************************************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")
print("************************************")

************************************
Recall Score:     0.8791946308724832
************************************
Accuracy Score:   0.9813333333333333
Precision Score:  0.9290780141843972
F1 Score:         0.9034482758620689
************************************


### 5.5: Use knowledge gained from random search to create new 'narrow' range of possible hyperparameter values and use best_grid_search_model to fine tune the model.

The best parameters found using RandomizedSearchCV were:

In [19]:
random_search_best_params

{'min_samples_split': 2,
 'min_samples_leaf': 1,
 'min_impurity_decrease': 0.0015,
 'max_leaf_nodes': 460,
 'max_depth': 37551,
 'criterion': 'gini'}

Let's now use these current best parameters as a starting point for a more refined grid search. We'll use the same parameters as before, but we'll use a much smaller range of values for each parameter.

NOTE: Depending on the speed of your computer, the following code could take 5-20 minutes to run.

In [20]:
plus_minus = 10 # change this to 10-15 when doing a final run. this current value is for testing
increment = 2

param_grid = { 'min_samples_split': [x for x in range(random_search_best_params['min_samples_split']-plus_minus, random_search_best_params['min_samples_split']+plus_minus,2) if x >= 2],       
              'min_samples_leaf': [x for x in range(random_search_best_params['min_samples_leaf']-plus_minus , random_search_best_params['min_samples_leaf']+plus_minus,2) if x > 0],
              'min_impurity_decrease': [x for x in np.arange(random_search_best_params['min_impurity_decrease']-0.001, random_search_best_params['min_impurity_decrease']+0.001,.0001).round(5) if x >= 0.000],
              'max_leaf_nodes':[x for x in range(random_search_best_params['max_leaf_nodes']-plus_minus , random_search_best_params['max_leaf_nodes']+plus_minus, 2) if x > 1],  
              'max_depth': [x for x in range(random_search_best_params['max_depth']-plus_minus , random_search_best_params['max_depth']+plus_minus, 2) if x > 1],
              'criterion': [random_search_best_params['criterion']]
              }

best_grid_search_model = GridSearchCV(estimator=DecisionTreeClassifier(random_state=random_seed), 
                                    scoring='recall', param_grid=param_grid, cv=10, verbose=0,  n_jobs = -1)
_ = best_grid_search_model.fit(train_X, train_y)

In [21]:
print('Best parameters found: ', best_grid_search_model.best_params_)

Best parameters found:  {'criterion': 'gini', 'max_depth': 37541, 'max_leaf_nodes': 450, 'min_impurity_decrease': 0.0013000000000000004, 'min_samples_leaf': 1, 'min_samples_split': 2}


### 5.6: Test the performance of the model using identified parameters

In [22]:
y_pred = best_grid_search_model.predict(valid_X)
print("************************************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred)}")
print("************************************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")
print("************************************")

************************************
Recall Score:     0.8791946308724832
************************************
Accuracy Score:   0.9813333333333333
Precision Score:  0.9290780141843972
F1 Score:         0.9034482758620689
************************************


### 5.7: If necessary, repeat steps 5.2 through 5.6 with more granular and targeted parameter search

Continuing to tune your model is optional, but if there is evidence that there is room for continued improvement in the model, use the information from Step 6 to create a more refined set of parameter ranges and rerun the best_grid_search_modelCV. 

In this case, we increased accuracy from 0.87248322147651 to 0.8791946308724832. The importance of this increase is context-dependent. In some contexts, every little increase is worth the effort -- in other contexts, such increases would have minimal benefit. In this case, we'll assume that we have been sufficiently thorough and any increase from further search would be small.

NOTE: The decision to continue further is dependent on the expected incremental increase in performance, the importance of any small incremental increase (to the business), and the resources you have (time and computing power). Recognize that you generally have a "diminishing" returns situation, where it takes an increasingly large amount of computation (training) to gain an increasingly smaller increase in performance. 

## Step 6: Summarize results    

As usual -- in this section you provide a recap your approach, results, and discussion of findings. 
