# Decision Tree Classifier

A decision tree is one of most frequently and widely used supervised machine learning algorithms that can perform both regression and classification tasks. The intuition behind the decision tree algorithm is simple, yet also very powerful.

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

In [5]:
import pickle as pkl

with open('../data/titanic_tansformed.pkl', 'rb') as f:
    df_data = pkl.load(f)

In [6]:
df_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,1,2,3,female,male,C,Q,S
0,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1


In [3]:
df_data.shape

(889, 10)

<img src='img/decision_tree.png'>

In [7]:
data = df_data.drop("Survived",axis=1)
label = df_data["Survived"]

In [8]:
from sklearn.model_selection import train_test_split  
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size = 0.2, random_state = 101)

In [9]:
from sklearn.tree import DecisionTreeClassifier
import time

tic = time.time()
dt_cla = DecisionTreeClassifier()
dt_cla.fit(data_train,label_train)
print('Time taken for training Decision Tree', (time.time()-tic), 'secs')

predictions = dt_cla.predict(data_test)
print('Accuracy', dt_cla.score(data_test, label_test))

from sklearn.metrics import classification_report, confusion_matrix                
print(confusion_matrix(label_test, predictions))  
print(classification_report(label_test, predictions)) 

Time taken for training Decision Tree 0.0022280216217041016 secs
Accuracy 0.7921348314606742
[[86 21]
 [16 55]]
             precision    recall  f1-score   support

          0       0.84      0.80      0.82       107
          1       0.72      0.77      0.75        71

avg / total       0.80      0.79      0.79       178



### Hyperparameters for Decision Tree
- There are a number of [hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for a decision tree
- Mostly commonly tuned parameter are 
    - __max_depth__ - The maximum depth of a tree. Defaults to complete expansion of the tree
    - __min_samples_split__ - Minimum number of samples required to split an internal node

In [12]:
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

max_depth = [2,3,4,5,6,7,8]
min_samples_split = [2, 3, 4, 5, 10, 20]
score_func = 'accuracy'

dt_cla = DecisionTreeClassifier()
dt_grid = GridSearchCV(estimator=dt_cla, 
                    param_grid=[{'max_depth':max_depth, 'min_samples_split':min_samples_split}], 
                    cv=5, 
                    scoring=score_func)
dt_grid.fit(data_train, label_train)
print('Best Score', dt_grid.best_score_)
print('Best Max Depth', dt_grid.best_estimator_.max_depth)
print('Best Split Samples', dt_grid.best_estimator_.min_samples_split)

Best Score 0.8185654008438819
Best Max Depth 7
Best Split Samples 5
