# Decision Tree Model

Decision Trees are used for predict probability and classification. It's more intuitive than regression. We are going 
to use some sample data to illustrate this model.
In this example, we specify the depth of the tree.  If the depth of the tree is too low, it may result in an underfit model.

In [None]:
# import the things we need first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [None]:
# we want to read in the csv file provided, noticed the path down in the read_csv() can be changed as we like.
df = pd.read_csv('Decision_Tree_bankloan.csv')
df # show the first five rows

### Preprocess the data
In this sample data set, the `outcome` is determined according to `Age`, `Has_hob` and 'Own_house'


In [None]:
# change the type of these features to `category` for mapping in the next step
df['Age'] = df['Age'].astype('category')
df['Has_job'] = df['Has_job'].astype('category') 
df['Own_house'] = df['Own_house'].astype('category')
df['Outcome'] = df['Outcome'].astype('category') 

# use .cat.codes on `category` type to map all literals to numeric values
df['Age'] = df['Age'].cat.codes
df['Has_job'] = df['Has_job'].cat.codes
df['Own_house'] = df['Own_house'].cat.codes
df['Outcome'] = df['Outcome'].cat.codes

df

In [None]:
# check if the categories are balanced
df['Outcome'].value_counts()

In [None]:
# extract data and target from our dataframe
data = df[['Age', 'Has_job', 'Own_house']] # independent variables
target = df['Outcome']  # dependent variable: y
data

In [None]:
target

### Build the Decision Tree Model

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, target, random_state = 42)

In [None]:
# import decision tree model from sklearn
from sklearn.tree import DecisionTreeClassifier

# instantiate a decision tree model. All parameters can be omited to use default ones.
# details please check https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

#dt = DecisionTreeClassifier() 
#dt.fit(x_train, y_train) # train our model

dt = DecisionTreeClassifier(max_depth = 1, random_state = 0) 
dt.fit(x_train, y_train) 

In [None]:
x_train

In [None]:
y_train

### Evaluate the model

In [None]:
y_pred = dt.predict(x_test) # let the model predict the test data

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
print(y_pred) # what the model predict entertainment labels
print(y_test) # true labels


Compare the predicted labels and true labels. The accuracy score formula can be seen as

$$ accuracy\_score = \frac{number\_of\_matches}{number\_of\_samples} $$



### Visualize the Decision Tree

we can use `graphviz` to see what the decision tree looks like

First, run this in the directory this file is in
```
conda install python-graphviz
```

In [None]:
# show the decision tree model
# import graphviz and sklearn.tree first
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=True, feature_names= x_train.columns)) # display the tree, with no output file

In [None]:
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=['No', 'Yes'], feature_names= x_train.columns)) # display the tree, with no output file

In [None]:
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=['No', 'Yes'], feature_names= x_train.columns)) # display the tree, with no output file

### Predict Some Values

In [None]:
## we can use the model to predict some values

print(dt.predict([[1, 0,1]])) 
print(dt.predict([[1, 0,0]])) 