<a href="https://colab.research.google.com/github/ojasnadkar96/cs273p_project/blob/master/decisionTree1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree Classification

In [0]:
#Importing all the necessary libraries
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [0]:
#Function for importing preprocessed data
def import_pkl(df,name):
    fullname = name+'.pkl'
    df = pickle.load(open(fullname, 'rb'))
    return df

In [0]:
#Declaring training, validation and testing pandas dataframes
df_train = pd.DataFrame()
df_valid = pd.DataFrame()
df_test = pd.DataFrame()
df_train_l = pd.DataFrame()
df_valid_l = pd.DataFrame()
df_test_l = pd.DataFrame()

In [0]:
#Loading the preprocessed data into pandas dataframes
df_train = import_pkl(df_train,'train_x')
df_valid = import_pkl(df_valid,'valid_x')
df_test = import_pkl(df_test,'test_x')
df_train_l = import_pkl(df_train_l,'train_x_l')
df_valid_l = import_pkl(df_valid_l,'valid_x_l')
df_test_l = import_pkl(df_test_l,'test_x_l')

In [0]:
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)
print(df_train_l.shape)
print(df_valid_l.shape)
print(df_test_l.shape)

(77854, 168)
(13737, 168)
(10175, 168)
(77854, 1)
(13737, 1)
(10175, 1)


In [0]:
#import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [0]:
# Decision Tree Model Training
from sklearn.tree import DecisionTreeClassifier
DecisionTree_model = DecisionTreeClassifier(random_state=0)
DecisionTree_model.fit(df_train,np.ravel(df_train_l))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

Here, we created a Decision Tree classifier object and trained it on training data.

In [0]:
# Decision Tree Model Training Score
Train_score = DecisionTree_model.score(df_train, np.ravel(df_train_l))
print('Train score:', Train_score*100)

Train score: 99.99614663344208


In [0]:
# Decision Tree Model Validation Score
score = DecisionTree_model.score(df_valid, np.ravel(df_valid_l))
print(score*100)

47.4848948096382


In [0]:
from pprint import pprint
pprint(DecisionTree_model.get_params())

{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': 0,
 'splitter': 'best'}


Decision Tree has all the hyperparameters as listed above. In order to tune the hyperparameters, we use RandomizedSearchCV as below:

In [0]:
from sklearn.model_selection import RandomizedSearchCV

max_features = ['auto', 'sqrt']
max_depth = [10,20,30,40,50]
max_depth.append(None)
min_samples_split = [2, 6, 10]
min_samples_leaf = [1, 2, 4]

random_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}
pprint(random_grid)

{'max_depth': [10, 20, 30, 40, 50, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 6, 10]}


We test the classifier for above random values of hyperparameters as shown by 'random_grid' to find the hyperparameters that give maximum score.

In [0]:
import numpy as np
from sklearn.model_selection import PredefinedSplit
Dec_tree = DecisionTreeClassifier()
train_len = len(df_train)
valid_len = len(df_valid)
df_tv = pd.concat([df_train, df_valid], ignore_index = True)
df_tv_l = pd.concat([df_train_l, df_valid_l], ignore_index = True)
bound = np.array([(i < train_len) * -1 for i in range(train_len + valid_len)])
split = PredefinedSplit(bound)
Dec_tree = RandomizedSearchCV(estimator = Dec_tree, param_distributions = random_grid, n_iter = 100, n_jobs = 2, verbose = 1, cv = split)

Dec_tree.fit(df_tv,np.ravel(df_tv_l))

Dec_tree.best_params_

Fitting 1 folds for each of 100 candidates, totalling 100 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   35.4s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:  1.1min finished


{'min_samples_split': 6,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 10}

Out of all the random values allotted to hyperparameters, above listed values give the maximum score.

#### We now train the model on these hyperparameter values and check the score:

In [0]:
Dec_tree_new = DecisionTreeClassifier(class_weight='balanced',
 min_samples_split=2,
 min_samples_leaf=2,
 max_features='sqrt',
 max_depth=10)
Dec_tree_new.fit(df_train,np.ravel(df_train_l))
Train_score = Dec_tree_new.score(df_train,np.ravel(df_train_l))
print(Train_score*100)

33.962288385953194


In [0]:
Valid_score = Dec_tree_new.score(df_valid,np.ravel(df_valid_l))
print(Valid_score*100)

32.68544805998398


#### As we can observe here, the training score has decreased while the validation score has increased. This indicates that the model is no more overfitted. However, the score can be furthermore improved using GridSearchCV.

In [0]:
from sklearn.model_selection import GridSearchCV


# Number of features to consider at every split
max_features = ['sqrt']
# Maximum number of levels in tree
max_depth = [10,20]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2,3]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 4]

# Create the random grid
search_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

pprint(search_grid)

{'max_depth': [10, 20, None],
 'max_features': ['sqrt'],
 'min_samples_leaf': [2, 4],
 'min_samples_split': [2, 3]}


Here, we provide hyperparameter value range according to the values obtained in RandomizedCV.

In [0]:
# Create a based model
Dec_tree_3 = DecisionTreeClassifier()
# Instantiate the grid search model
dec3_grid = GridSearchCV(estimator = Dec_tree_3, param_grid = search_grid, n_jobs = 2, verbose = 1, cv = split)

# Fit the random search model
dec3_grid.fit(df_tv, np.ravel(df_tv_l))

dec3_grid.best_params_

Fitting 1 folds for each of 12 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:    8.0s finished


{'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 3}

#### Thus, we obtained the hyperparameter values that the classifier works best on. Now we calculate final score of the classifier model on these values.

In [0]:
Dec_tree_final = DecisionTreeClassifier(max_depth=10, max_features='sqrt', min_samples_leaf=4, min_samples_split=2)
Dec_tree_final.fit(df_train, np.ravel(df_train_l))
Train_score = Dec_tree_final.score(df_train,np.ravel(df_train_l))
print('Final Train score: ',score*100)

Final Train score:  47.4848948096382


In [0]:
Valid_score = Dec_tree_final.score(df_valid,np.ravel(df_valid_l))
print('Final Validation score: ', Valid_score*100)

Final Validation score:  55.80548882579893


In [0]:
# Decision Tree Model Test Score
Test_score = Dec_tree_final.score(df_test, np.ravel(df_test_l))
print('Final Test score: ',Test_score*100)

Final Test score:  56.00000000000001


As we can see, the accuracy scores have increased after hyperparameter tuning.