## Dataset
**German Credit data**, where credit applicants are rated as Good or Bad, according to whether they have re-paid the credit or not. The dataset consists of:

Train Data: (793 samples X 50 attributes)

Test Data:  (207 samples X 50 attributes)

## Task
Given German Credit data, the task is to train a binary decision tree classifier that is able to predict whether a credit applicant will be rated as good or bad.

## 1) Load training dataset

In [1]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 


# Read data from file 'credit-german-train_num.csv' using read_csv() function
germanCreditTrain = pd.read_csv("credit-german-train_num.csv")

## 2) Understanding the dataset

There are a lot of functions that can be used to know more about this dataset

In [2]:
# print dimensions
print('Data Dimensionality: ')
print(germanCreditTrain.ndim)
print('\n\n');

# print shape
print('Data Shape: ')
print(germanCreditTrain.shape)
print('\n\n');

# print attribute names
print('Attribute Names: ')
print(germanCreditTrain.dtypes)
print('\n\n');


# print first 5 rows in your dataset
print('Head of Data: ')
print(germanCreditTrain.head(5))
print('\n\n');



Data Dimensionality: 
2



Data Shape: 
(793, 50)



Attribute Names: 
id                                               int64
duration                                         int64
credit_amount                                    int64
installment_commitment                           int64
residence_since                                  int64
age                                              int64
existing_credits                                 int64
num_dependents                                   int64
label                                            int64
f_worker                                         int64
checking_status_<0                               int64
checking_status_>=200                            int64
checking_status_no checking                      int64
credit_history_critical/other existing credit    int64
credit_history_delayed previously                int64
credit_history_existing paid                     int64
credit_history_no credits/all paid               

## 4) Shuffle and Split training data to train (80%) and validation (20%) 

In [3]:
from sklearn.utils import shuffle
import math
# Shuffle the training data

ShuffledGermanCreditTrain = shuffle(germanCreditTrain)

# Split 

# number of rows
rowCount = ShuffledGermanCreditTrain.shape[0];
splitFrom = math.floor(rowCount*0.8)

#First 80%
trainGermanCreditTrain = ShuffledGermanCreditTrain.iloc[:splitFrom]

#Remaining 20%
valGermanCreditTrain = ShuffledGermanCreditTrain.iloc[splitFrom:]



## 5) Train DecisionTreeClassifier 

Train three decision trees with different values of "min_samples_split" which is the minimum number of samples required to split an internal node:

min_samples_split = [default = 1, 3, 5]

In [4]:
# To train Decision Tree you need two variables (1) attributes (2) labels

# Attributes are our input
# Labels are the output

# Split your training data to "attributes" and "labels" 


# We are going to create a classifier that predicts 
# whether a credit applicant will be rated as good or bad.

# And knowing "credit_history_existing paid" column is 
# determining whether they have re-paid the credit or not. 
# And that determines where credit applicants are 
# rated as Good or Bad.

# column at index 15 is for credit_history_existing paid 
 

#attributes = ShuffledGermanCreditTrain.iloc[:,]
#labels = ShuffledGermanCreditTrain.iloc[:,15]

attributes = trainGermanCreditTrain.iloc[:,trainGermanCreditTrain.columns != 'credit_history_existing paid']
labels = trainGermanCreditTrain.iloc[:,trainGermanCreditTrain.columns == 'credit_history_existing paid']


# Train Decision tree classifiers
from sklearn.tree import DecisionTreeClassifier

tree2split = DecisionTreeClassifier(min_samples_split=2)
tree3split = DecisionTreeClassifier(min_samples_split=3)
tree5split = DecisionTreeClassifier(min_samples_split=5)

#Create a decision tree classifier model using scikit-learn

tree2split.fit(attributes, labels)
tree3split.fit(attributes, labels)
tree5split.fit(attributes, labels)



DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## 6) Testing your trained classifier on Validation set
Testing **three** trained classifiers on the validation set and print the accuracy


In [5]:
# Spliting labels and attributes of validation data
attributesVal = valGermanCreditTrain.iloc[:,valGermanCreditTrain.columns != 'credit_history_existing paid']
labelsVal = valGermanCreditTrain.iloc[:,valGermanCreditTrain.columns == 'credit_history_existing paid']

predictions2split = tree2split.predict(attributesVal)
predictions3split = tree3split.predict(attributesVal)
predictions5split = tree5split.predict(attributesVal)


from sklearn.metrics import accuracy_score

accuracy2split = 100.0 * accuracy_score(labelsVal, predictions2split)

accuracy3split = 100.0 * accuracy_score(labelsVal, predictions3split)

accuracy5split = 100.0 * accuracy_score(labelsVal, predictions5split)

print("The accuracy of your decision tree with 2 splits on validation data is: " + str(accuracy2split))
print("The accuracy of your decision tree with 3 splits on validation data is: " + str(accuracy3split))
print("The accuracy of your decision tree with 5 splits on validation data is: " + str(accuracy5split))

The accuracy of your decision tree with 2 splits on validation data is: 95.59748427672956
The accuracy of your decision tree with 3 splits on validation data is: 93.08176100628931
The accuracy of your decision tree with 5 splits on validation data is: 92.45283018867924


## 7) Testing your trained classifier on Test set

Predict the labels of testing data **using the best chosen model based on step 6** and report the accuracy 

In [6]:
# Load test data
germanCreditTest = pd.read_csv("credit-german-test_num.csv")


attributesTest = germanCreditTest.iloc[:,germanCreditTest.columns != 'credit_history_existing paid']
labelsTest = germanCreditTest.iloc[:,germanCreditTest.columns == 'credit_history_existing paid']


# Predict

predictionsTest = tree2split.predict(attributesTest)

accuracyTest = 100.0 * accuracy_score(labelsTest, predictionsTest)
print("The accuracy of your decision tree with 2 splits on testing data is: " + str(accuracyTest))


The accuracy of your decision tree with 2 splits on testing data is: 89.3719806763285
