# Homework 4: Hyperparameter tuning
### Customer churn prediction 

In this assignment, you will apply hyperparameter tuning to machine learning models. You will be working with decision trees and random forests, using SciKit's functionality to find the best combination of hyperparameters for each model. Therefore you have to find the best hyperparameters for 1) decision trees and 2) random forest. After you have found the best hyperparameters for them, you have to choose the best model between your decision tree and random forest models. 

The datasets, named ```churn_train.csv``` and ```churn_test.csv```, contain information about telecom customers and their churn (switch to another telecommunications provider) behavior.  
This dataset has 3500 samples with 19 input features and a boolean target variable called "churn". The goal in this assignment is to predict with high accuracy which customers are likely to churn. For your convenicnce, the code for loading and encoding the training data is already provided.

Hyperparameters to be tuned for Randomforest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- n_estimators = number of trees in the foreset
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

Hyperparameters to be tuned for DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Look at the Jupyter notebook from class for reference.

Evaluate the model you select on ```churn_test.csv```.

In [2]:
import pandas as pd
df = pd.read_csv("churn_train.csv")
df[0:20]

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,OH,146,area_code_408,no,yes,31,202.5,91,34.43,241.4,108,20.52,169.6,77,7.63,7.8,2,2.11,1,no
1,NC,126,area_code_415,no,no,0,103.7,93,17.63,127.0,107,10.8,329.3,66,14.82,14.4,1,3.89,0,no
2,AL,61,area_code_415,no,yes,20,254.4,133,43.25,161.7,96,13.74,251.4,91,11.31,10.5,4,2.84,0,no
3,MN,116,area_code_408,no,no,0,197.9,84,33.64,168.1,113,14.29,239.8,145,10.79,12.0,6,3.24,1,no
4,MO,103,area_code_408,no,yes,24,111.8,85,19.01,239.6,102,20.37,268.3,81,12.07,6.9,4,1.86,1,no
5,MO,147,area_code_415,yes,no,0,157.0,79,26.69,103.1,94,8.76,211.8,96,9.53,7.1,6,1.92,0,no
6,WY,37,area_code_510,no,yes,33,225.7,133,38.37,223.9,115,19.03,192.7,131,8.67,11.2,5,3.02,3,no
7,RI,93,area_code_415,no,no,0,98.4,78,16.73,249.6,129,21.22,248.2,114,11.17,14.2,4,3.83,1,no
8,OH,70,area_code_510,no,yes,23,205.5,95,34.94,224.3,80,19.07,196.1,76,8.82,9.7,4,2.62,1,no
9,CT,23,area_code_510,no,no,0,321.6,107,54.67,251.6,115,21.39,141.1,158,6.35,11.3,3,3.05,2,yes


In [3]:
df.shape


(3500, 20)

In [4]:
# Set X and y
X = df.drop(['churn'], axis=1)
y = df['churn']

In [5]:
#Encode categorical variables as dummy variables
# Select non-numeric columns
non_numeric_cols = X.select_dtypes(include=['object']).columns
X = pd.get_dummies(X, columns=non_numeric_cols)
X.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,...,state_WI,state_WV,state_WY,area_code_area_code_408,area_code_area_code_415,area_code_area_code_510,international_plan_no,international_plan_yes,voice_mail_plan_no,voice_mail_plan_yes
0,146,31,202.5,91,34.43,241.4,108,20.52,169.6,77,...,False,False,False,True,False,False,True,False,False,True
1,126,0,103.7,93,17.63,127.0,107,10.8,329.3,66,...,False,False,False,False,True,False,True,False,True,False
2,61,20,254.4,133,43.25,161.7,96,13.74,251.4,91,...,False,False,False,False,True,False,True,False,False,True
3,116,0,197.9,84,33.64,168.1,113,14.29,239.8,145,...,False,False,False,True,False,False,True,False,True,False
4,103,24,111.8,85,19.01,239.6,102,20.37,268.3,81,...,False,False,False,True,False,False,True,False,False,True


In [6]:
# Prepare the test set
tf = pd.read_csv("churn_test.csv")
X_test = tf.drop(['churn'], axis=1)
y_test = tf['churn']
non_numeric_cols = X_test.select_dtypes(include=['object']).columns
X_test = pd.get_dummies(X_test, columns=non_numeric_cols)
#X_test.head()

In [7]:
# Tune hyperparameters, find the best set of hyperparameters

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

rstate = 20001104

# 1) Decision trees
#  DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
tree_params = {
    'criterion': ['gini', 'entropy'], 
    'max_depth': [2,3,4,5,6,7,8,9,10, 25 , 100, None],
    #'min_samples_leaf': range(5, 50, 5)
    'min_samples_split': [2, 5, 10, 25, 30, 40, 50]
    }
dtree = DecisionTreeClassifier(random_state=rstate)

# Perform Grid Search Cross-Validation
grid_search_tree = GridSearchCV(estimator=dtree, param_grid=tree_params, cv=5)  # I checked, cv=5 is the default value
grid_search_tree.fit(X, y)
print("Best Hyperparameters: ", grid_search_tree.best_params_)

Best Hyperparameters:  {'criterion': 'entropy', 'max_depth': 8, 'min_samples_split': 2}


In [8]:
## Random 
grid_search_tree2 = GridSearchCV(estimator=dtree, param_grid=tree_params, cv=5)  # I checked, cv=5 is the default value
print("Best Hyperparameters: ", grid_search_tree.best_params_)
print(grid_search_tree2)



Best Hyperparameters:  {'criterion': 'entropy', 'max_depth': 8, 'min_samples_split': 2}
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=20001104),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 25, 100,
                                       None],
                         'min_samples_split': [2, 5, 10, 25, 30, 40, 50]})


In [9]:
print(grid_search_tree.get_params())
print(dtree.get_params())

grid_search_tree.get_params()

{'cv': 5, 'error_score': nan, 'estimator__ccp_alpha': 0.0, 'estimator__class_weight': None, 'estimator__criterion': 'gini', 'estimator__max_depth': None, 'estimator__max_features': None, 'estimator__max_leaf_nodes': None, 'estimator__min_impurity_decrease': 0.0, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2, 'estimator__min_weight_fraction_leaf': 0.0, 'estimator__random_state': 20001104, 'estimator__splitter': 'best', 'estimator': DecisionTreeClassifier(random_state=20001104), 'n_jobs': None, 'param_grid': {'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 25, 100, None], 'min_samples_split': [2, 5, 10, 25, 30, 40, 50]}, 'pre_dispatch': '2*n_jobs', 'refit': True, 'return_train_score': False, 'scoring': None, 'verbose': 0}
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_frac

{'cv': 5,
 'error_score': nan,
 'estimator__ccp_alpha': 0.0,
 'estimator__class_weight': None,
 'estimator__criterion': 'gini',
 'estimator__max_depth': None,
 'estimator__max_features': None,
 'estimator__max_leaf_nodes': None,
 'estimator__min_impurity_decrease': 0.0,
 'estimator__min_samples_leaf': 1,
 'estimator__min_samples_split': 2,
 'estimator__min_weight_fraction_leaf': 0.0,
 'estimator__random_state': 20001104,
 'estimator__splitter': 'best',
 'estimator': DecisionTreeClassifier(random_state=20001104),
 'n_jobs': None,
 'param_grid': {'criterion': ['gini', 'entropy'],
  'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 25, 100, None],
  'min_samples_split': [2, 5, 10, 25, 30, 40, 50]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

In [11]:
# 2) Random forest
# Hyperparameters to be tuned for Randomforest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# n_estimators = number of trees in the foreset. max_depth = max number of levels in each decision tree. min_samples_split = min number of data points placed in a node before the node is split
# min_samples_leaf = min number of data points allowed in a leaf node. bootstrap = method for sampling data points (with or without replacement)

forest_params = {
    #'criterion': ['gini', 'entropy'], 
    'n_estimators': [5, 10, 25, 35,], # I tried ..., 50, 100 but the runtime quadrupled, and still 25 was the best option.
    'max_depth': [2, 5, 10, 20, 25, 30, 50, None], #Tested many variations, 20-30 seems to be the best for every run.
    'min_samples_leaf': [2, 3, 4, 7, 25, 70], # In every previous run I tried (see below), the smallest number is the best parameters, but runtime can increase very fast. So I only check some small examples. 
    'min_samples_split': [5, 10, 25, 30, 50],   #Big & many numbers: can triple runtime. Not worth it considering the best parameters are usually between 5 and 30.
    'bootstrap': [True, False]
    }
unparadforest = RandomForestClassifier(random_state=rstate)

grid_search_forest = GridSearchCV(estimator=unparadforest, param_grid=forest_params, cv=5)
grid_search_forest.fit(X, y)
print("Best Hyperparameters: ", grid_search_forest.best_params_)

KeyboardInterrupt: 

### Previous (other) runs

In [None]:
#Around 9.5 mins of runtime on my laptop
forest_params = {
    #'criterion': ['gini', 'entropy'], 
    'n_estimators': [5, 10, 25, 35,], # I tried ..., 50, 100 but the runtime quadrupled, and still 25 was the best option.
    'max_depth': [2, 5, 10, 20, 25, 30, 50, None], #Tested multiple, 25-30 seems to be the best
    'min_samples_leaf': [4, 7, 10, 25, 70], # Small numbers are the best parameters, but runtime is too long
    'min_samples_split': [5, 10, 25, 30, 40, 50, 80],   #Big & many numbers: can triple runtime
    'bootstrap': [True, False]
    }
unparadforest = RandomForestClassifier(random_state=rstate)

grid_search_forest = GridSearchCV(estimator=unparadforest, param_grid=forest_params, cv=5)
grid_search_forest.fit(X, y)
print("Best Hyperparameters: ", grid_search_forest.best_params_)

Best Hyperparameters:  {'bootstrap': False, 'max_depth': 30, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 25}


In [12]:
#Dummy
forest_params = {
    #'criterion': ['gini', 'entropy'],
    'n_estimators': [5, 10, 25, ], #Way too log
    'max_depth': [25, 30,], #Tested multiple, 25-30 seems to be the best
    'min_samples_leaf': [3, 30,], # I tried smaller numbers, but the runtime was too long #25, 50,
    'min_samples_split': [5, 10, 25,],   #Big & many numbers: can triple runtime
    'bootstrap': [True, False]
    }
unparadforest = RandomForestClassifier(random_state=rstate)

grid_search_forest = GridSearchCV(estimator=unparadforest, param_grid=forest_params, cv=5)
grid_search_forest.fit(X, y)
print("Best Hyperparameters: ", grid_search_forest.best_params_)

Best Hyperparameters:  {'bootstrap': False, 'max_depth': 25, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 25}


In [None]:
#Took 12 mins of runtime on my laptop
forest_params = {
    #'criterion': ['gini', 'entropy'], 
    'n_estimators': [5, 10, 25, 50, 100,], #Way too log
    'max_depth': [2, 5, 10, 20, 25, 30, 50, None], #Tested multiple, 25-30 seems to be the best
    'min_samples_leaf': [10, 20,  80], # I tried smaller numbers, but the runtime was too long #25, 50,
    'min_samples_split': [5, 10, 25, 30, 40, 50, 80],   #Big & many numbers: can triple runtime
    'bootstrap': [True, False]
    }
unparadforest = RandomForestClassifier(random_state=rstate)

grid_search_forest = GridSearchCV(estimator=unparadforest, param_grid=forest_params, cv=5)
grid_search_forest
grid_search_forest.fit(X, y)
print("Best Hyperparameters: ", grid_search_forest.best_params_)

Best Hyperparameters:  {'bootstrap': False, 'max_depth': 25, 'min_samples_leaf': 10, 'min_samples_split': 30, 'n_estimators': 25}


In [13]:
# Train the model with the best combination of hyperparameters
dtreebest = DecisionTreeClassifier(criterion='entropy', max_depth=8, min_samples_split=2 , random_state=rstate)
rforestbest = RandomForestClassifier(bootstrap=False, max_depth=20, min_samples_leaf=2, min_samples_split=5, n_estimators=35, random_state=rstate)
dtreebest.fit(X, y)
rforestbest.fit(X, y); # Semi-colon to suppress output

In [14]:
# Evaluate the best model on the test set
y_pred_train = dtreebest.predict(X)                   # Predictions for train set
print ("Decision tree: Accuracy on train data: ", metrics.accuracy_score(y, y_pred_train))

y_pred_test = dtreebest.predict(X_test)                   # Predictions for train set
print ("Decision tree: Accuracy on test data: ", metrics.accuracy_score(y_test, y_pred_test))

y_pred_train_forest = rforestbest.predict(X)                   # Predictions for train set
print ("Random forest: Accuracy on train data: ", metrics.accuracy_score(y, y_pred_train_forest))

y_pred_test_forest = rforestbest.predict(X_test)                   # Predictions for train set
print ("Random forest: Accuracy on test data: ", metrics.accuracy_score(y_test, y_pred_test_forest))



Decision tree: Accuracy on train data:  0.9771428571428571
Decision tree: Accuracy on test data:  0.936
Random forest: Accuracy on train data:  0.9891428571428571
Random forest: Accuracy on test data:  0.9306666666666666


In [15]:
type(y_pred_test_forest)

numpy.ndarray