## Decision Tree Model

In this example, two different files are used to train and validate the machine learning model.  The features are explicitly mapped to the codes to ensure consistency among the training and test dataframes. 

In [None]:
# import the things we need first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [None]:
# we want to read in the csv files provided, noticed the path down in the read_csv() can be changed as we like.
df_train = pd.read_csv('Decision_Tree_bankloan-train.csv') # read in training data file
df_test = pd.read_csv('Decision_Tree_bankloan-test.csv') # read in testing data file
df_test.head() # show the first five rows of the test data

### Preprocess the data

We are going to put training and testing data in two dataframe.

In [None]:
df_train

In [None]:
df_train['Age']

In [None]:
# when we construct the dataframe, Pandas automatically set the type of Has_job and Own_house values to Boolean
# we can change it by mapping them to strings

# dictionary for mapping
boolDict = {
    True: 'True',
    False: 'False'
}

# we should only map these two rows
for i in [1,2]:
    df_train.iloc[:, i] = df_train.iloc[:, i].map(boolDict)
    df_test.iloc[:, i] = df_test.iloc[:, i].map(boolDict)

In [None]:
# to customize what value for each feature to be mapped to 
# we can provide a dictionary that has all the mapping rules


mydict = {
    "Yes": 1,
    "No": 0,
    "True": 0,
    "False": 1,
    "young": 0,
    "middle": 1,
    "old": 2,
    "fair" : 0,
    "good" : 1,
    "excellent" : 2    
}

# construct a function that can take a dataframe and
# map all the categorical values in each column according
# to our dictionary
def outcomeTrans(X):
    cols = list(X)
    for i in cols:
        X[i] = X[i].map(mydict)
    return X
        

# mapping for both dataset
outcomeTrans(df_train)
outcomeTrans(df_test)
df_train.head()

In [None]:
df_test

We have the data prepared. 

In [None]:
# RUN THIS
# set training data variable & target
# set testing data variable & target
x_train = df_train.iloc[:, 0:4]
y_train = df_train['Outcome']
x_test = df_test.iloc[:, 0:4]
y_test = df_test['Outcome']

In [None]:
x_train

### Train the Model

In [None]:
# import decision tree model from sklearn
from sklearn.tree import DecisionTreeClassifier

# instantiate a decision tree model. All parameters can be omited to use default ones.
# details please check https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
dt = DecisionTreeClassifier() 
dt.fit(x_train, y_train) # train our model

### Evaluate the model

In [None]:
x_train

In [None]:
y_train

In [None]:
y_pred = dt.predict(x_test) # let the model predict the test data

In [None]:
y_test

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
print(y_pred) # what the model predict entertainment labels
print(y_test) # true labels


In [None]:
## we can use the model to predict any data

print(dt.predict([[1, 0, 1,1]])) 
print(dt.predict([[1, 0, 0,1]])) 

In [None]:
x_test_simple = [[1, 0, 1,1]]
y_test_simple = [1]
y_pred_simple = dt.predict(x_test_simple)
print(y_test_simple)
print(y_pred_simple)


### Visualize the Decision Tree

we can use `graphviz` to see what the decision tree looks like

First, run this in the directory this file is in
```
conda install python-graphviz
```

In [None]:
# show the decision tree model
# import graphviz and sklearn.tree first
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=True, feature_names= x_train.columns)) # display the tree, with no output file

In [None]:
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=['No', 'Yes'], feature_names= x_train.columns)) # display the tree, with no output file