# Decision Tree
![Matrix Image](pictures/DecisionTreeModel/SololearnMachineLearningDecisionTreeModel.png "Decision Tree")

### Gini

$gini = 2*p*(1-p)$

Where p is percentage of positive values

![Matrix Image](pictures/DecisionTreeModel/SololearnMachineLearningDecisionTreeModelGinyGraph.png "Gini Graph")

### Entropy

$entropy = -[p*\log_2p+(1-p)*\log_2{(1-p)}]$

![Matrix Image](pictures/DecisionTreeModel/SololearnMachineLearningDecisionTreeModelEntropyGraph.png "Entropy Graph")

### Information Gain

$Information\ Gain = H(S)-\dfrac{|A|}{|S|}*H(A)-\dfrac{|B|}{|S|}*H(B)$

Where:
- H is Gini function
- S is original data length $(|S|)$
- A is positive data
- B is negative data


In [1]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print(model.predict([[3, True, 22, 1, 0, 7.25]]))

[0]


### Comparing LogisticRegression vs DecisionTree

In [3]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
print("DecisionTree:")
print("accuracy:", model.score(X_test, y_test))
y_pred = model.predict(X_test)
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print()
print("Logistic Regression:")
print("accuracy:", lr.score(X_test, y_test))
y_pred_lr = lr.predict(X_test)
print("precision:", precision_score(y_test, y_pred_lr))
print("recall:", recall_score(y_test, y_pred_lr))

DecisionTree:
accuracy: 0.7882882882882883
precision: 0.7415730337078652
recall: 0.7333333333333333

Logistic Regression:
accuracy: 0.7522522522522522
precision: 0.7058823529411765
recall: 0.6666666666666666


### Using entropy

In [4]:
model = DecisionTreeClassifier(criterion='entropy')

#### Comparing

In [5]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import numpy as np
kf = KFold(n_splits=5, shuffle=True)
for criterion in ['gini', 'entropy']:
    print("Decision Tree - {}".format(criterion))
    accuracy = []
    precision = []
    recall = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        dt = DecisionTreeClassifier(criterion=criterion)
        dt.fit(X_train, y_train)
        y_pred = dt.predict(X_test)
        accuracy.append(accuracy_score(y_test, y_pred))
        precision.append(precision_score(y_test, y_pred))
        recall.append(recall_score(y_test, y_pred))
    print("accuracy:", np.mean(accuracy))
    print("precision:", np.mean(precision))
    print("recall:", np.mean(recall))

Decision Tree - gini
accuracy: 0.7801053767536342
precision: 0.7105406907264492
recall: 0.7156788400939873
Decision Tree - entropy
accuracy: 0.7722909921919634
precision: 0.7110742497839271
recall: 0.7063991030113105


### Exporting image of tree

In [6]:
from sklearn.tree import export_graphviz
feature_names = ['Pclass', 'male']
dt = DecisionTreeClassifier()
X = df[feature_names].values
dt.fit(X, y)
dot_file = export_graphviz(dt, feature_names=feature_names)

X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values

In [7]:
# Not working because pycharm problem

# import graphviz
# graph = graphviz.Source(dot_file)
# graph.render(filename='tree', format='png', cleanup=True)

![Matrix Image](pictures/DecisionTreeModel/tree.png "Can be generated with GenerateTreeImage.py")
### Decision tree is prone to overfitting
![Matrix Image](pictures/DecisionTreeModel/treeFull.png "Can be generated with GenerateTreeImageFull.py")

This is reason why we do <b>pruning the tree </b><i>pre-pruning & post-pruning</i>
#### Pre-pruning
- <b>max depth </b> Only grow the tree up to a certain depth, or height of the tree.
If the max depth is 3, there will be at most 3 splits for each datapoint.
- <b>leaf size</b> Don’t split a node if the number of samples at that node is under a threshold
- <b>number of leaf nodes</b> Limit the total number of leaf nodes allowed in the tree

Pruning is a balance. For example, if you set the max depth too small, you won’t have much of a
tree and you won’t have any predictive power. This is called underfitting. Similarly if the leaf
size is too large, or the number of leaf nodes too small, you’ll have an underfit model.

In [8]:
dt1 = DecisionTreeClassifier(max_depth=3, min_samples_leaf=2, max_leaf_nodes=10)

#### finding best limits
GridSearchCV has four parameters
1. The model (in this case a DecisionTreeClassifier)
2. Param grid: a dictionary of the parameters names and all the possible values
3. What metric to use (default is accuracy)
4. How many folds for k-fold cross validation

In [9]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth':[5, 15, 25],
    'min_samples_leaf': [1,3],
    'max_leaf_nodes': [10, 20, 35, 50]}
dt = DecisionTreeClassifier()
gs = GridSearchCV(dt, param_grid, scoring='f1', cv=5)
gs.fit(X,y)
print("best params:", gs.best_params_)
print("best score:", gs.best_score_)

best params: {'max_depth': 15, 'max_leaf_nodes': 35, 'min_samples_leaf': 1}
best score: 0.7709600688632559


![Matrix Image](pictures/DecisionTreeModel/treeBest.png "Can be generated with GenerateTreeImageBest.py")

- Decision tree is slow to build, but very fast predicting model.
- Decision tree is prone to overfitting
- Decision tree is perfect to explain prediction to non technical