# Divide and Conquer
## Classification Using Decision Trees

* Colleting the data from: [Github Machine-Learning-with-R](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/credit.csv)
* Use pandas to open the file

In [2]:
import pandas as pd
#import the csv data
file = pd.read_csv("credit.csv")

file.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,...,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,...,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,...,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,...,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,...,unknown/none,53,none,for free,2,2,2,none,yes,skilled employee


* Encoding the nonnumeric data into numeric

In [3]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
#econding all the collumns that are not numeric
for col in ['checking_balance', 'credit_history', 'purpose', 'savings_balance',
            'personal_status', 'employment_length', 'other_debtors', 'property',
            'installment_plan', 'housing', 'telephone', 'foreign_worker', 'job']:
    file[col] = le.fit_transform(file[col])


## Data preparation 
* Creating random training and test datasets

In [4]:
from sklearn.model_selection import train_test_split

#Split the data into two parts 
#90 percent of the data will be used for training and 10 percent for test
credit_train, credit_test = train_test_split(file, test_size = 0.1,
                                             random_state = 123)

In [5]:
#Creating the data sets
train_labels = credit_train.pop('default')
test_labels = credit_test.pop('default')

* A function to return error metrics.

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

## Training a model on the data

In [7]:
from sklearn.tree import DecisionTreeClassifier

DTC = DecisionTreeClassifier(max_depth = 4)
DTC = DTC.fit(credit_train, train_labels)


## Evaluating model performace

In [8]:
#apply the decision tree on the test dataset
y_predict = DTC.predict(credit_test)

In [9]:
#Display the confusion matrix for the model
quadro = (confusion_matrix(test_labels, y_predict))
frame_default = pd.DataFrame(quadro, index = ['Actual Positive', 'Actual Negative'],
                             columns = ['Predicted Positive' ,'Predicted Negative'])
frame_default

Unnamed: 0,Predicted Positive,Predicted Negative
Actual Positive,57,9
Actual Negative,17,17


In [10]:
#The error on the test data sets
train_test_full_error = pd.concat([measure_error(test_labels, y_predict, 'test')],
                              axis=1)
train_test_full_error

Unnamed: 0,test
accuracy,0.74
f1,0.814286
precision,0.77027
recall,0.863636


## Exporting the data as a graph

In [19]:
from sklearn.tree import export_graphviz
with open('DTC.dot','w') as file:
    file = export_graphviz(DTC, out_file = file)


* Use the [Webgraphviz](http://www.webgraphviz.com/) to visualize the decision tree formed