<center>
    <H1> DECISION TREE K-FOLD VALIDATION </H1>
    <br>
======================================================================================================================
<br>
A Decision Tree is a simple classification algorithm in which rules are learned from the training data in the if-else structure. It represents a tree-like structure, where each node is a rule or condition which divides the data set into sub-classes. To cross valiadte the performance of the DT classifier, k-fold validation method was applied, using k=10.

## STEP 1: IMPORT LIBRARIES

In [29]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, accuracy_score

## STEP 2: LOAD DATASET

In [30]:
dataset = pd.read_csv('data/twitter_dataset.csv', encoding = 'latin-1')
dataset.head()

Unnamed: 0,name_wt,statuses_count,followers_count,friends_count,favourites_count,listed_count,label
0,0.9375,43,5,34,0,0,1
1,0.909091,12204,1182,1327,0,4,1
2,0.909091,42,3,34,0,0,1
3,1.0,215,1158,1545,0,21,1
4,0.285714,38420,2293,2198,1987,2,0


In [31]:
# Independent attributes
features=[]
for attributes in dataset.columns:
    if attributes != 'label':
        features.append(attributes)
features

['name_wt',
 'statuses_count',
 'followers_count',
 'friends_count',
 'favourites_count',
 'listed_count']

In [32]:
#split dataset in features and target variable
X = dataset[features] # Features
y = dataset.label # Target variable

## STEP 3: CROSS VALIDATION 

The dataset will be divided into k numer of splits. At each iteration one of the split will be used as test set and rest will be used as training set. Here, dataset was cross validated using 10 folds. Hence 10 iterations were performed and peformance measure at each iteration was evaluated separately. 

In [36]:
#randomly divide the entire dataset into k sets, where k is the number of folds.

number_of_splits=10   #defines k  (practically obtained optimal value for k)
X = np.array(dataset[features]) # Features
y = np.array(dataset.label)
kf = KFold(n_splits=number_of_splits)
kf.get_n_splits(X)

print(kf)

KFold(n_splits=10, random_state=None, shuffle=False)


In [37]:
#perform Decision Tree classification k times
k = 0
#initializing each performance metric with 0
recall_avg=0
precision_avg=0
fmeasure_avg=0
accuracy_avg=0

for train_index, test_index in kf.split(X):
    k = k+1   #number of iteration
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Create Decision Tree classifer object
    clf = DecisionTreeClassifier(min_impurity_decrease=0.001)

    # Train Decision Tree Classifer
    clf = clf.fit(X_train,y_train)

    #Predict the response for test dataset
    y_predict = clf.predict(X_test)

    ## Perormance evaluation of the Model

    #true negatives is C(0,0), false negatives is C(1,0), false positives is C(0,1) and true positives is C(1,1) 
    conf_matrix = confusion_matrix(y_test, y_predict)

    #true_negative
    TN = conf_matrix[0][0]
    #false_negative
    FN = conf_matrix[1][0]
    #false_positive
    FP = conf_matrix[0][1]
    #true_positive
    TP = conf_matrix[1][1]

    # Recall is the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
    # High Recall indicates the class is correctly recognized (small number of FN)
    recall = (TP)/(TP + FN)
    recall_avg += recall
    
    # Precision is the the total number of correctly classified positive examples divided by the total number of predicted positive examples. 
    # High Precision indicates an example labeled as positive is indeed positive (small number of FP)
    precision = (TP)/(TP + FP)
    precision_avg += precision
    
    fmeasure = (2*recall*precision)/(recall+precision)
    fmeasure_avg += fmeasure
    
    accuracy = (TP + TN)/(TN + FN + FP + TP)
    accuracy_avg += accuracy
    #accuracy_score(y_test, y_predict)
    
    print("\n------ CLASSIFICATION PERFORMANCE FOR ITERATION ", k, " ------ ",
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n Accuracy : ", (accuracy*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )



------ CLASSIFICATION PERFORMANCE FOR ITERATION  1  ------  
 Recall :  98.08743169398907 %
 Precision :  97.289972899729 %
 Accuracy :  97.55395683453237 %
 F-measure :  97.68707482993196 %

------ CLASSIFICATION PERFORMANCE FOR ITERATION  2  ------  
 Recall :  96.7551622418879 %
 Precision :  98.79518072289156 %
 Accuracy :  97.84172661870504 %
 F-measure :  97.7645305514158 %

------ CLASSIFICATION PERFORMANCE FOR ITERATION  3  ------  
 Recall :  97.61904761904762 %
 Precision :  97.91044776119404 %
 Accuracy :  97.84172661870504 %
 F-measure :  97.7645305514158 %

------ CLASSIFICATION PERFORMANCE FOR ITERATION  4  ------  
 Recall :  98.78419452887537 %
 Precision :  98.18731117824774 %
 Accuracy :  98.56115107913669 %
 F-measure :  98.48484848484848 %

------ CLASSIFICATION PERFORMANCE FOR ITERATION  5  ------  
 Recall :  95.17045454545455 %
 Precision :  96.26436781609196 %
 Accuracy :  95.68345323741008 %
 F-measure :  95.71428571428571 %

------ CLASSIFICATION PERFORMANCE 

## STEP 4: AVERAGR PERFORMANCE  

In [38]:
recall_avg = recall_avg/number_of_splits
precision_avg = precision_avg/number_of_splits
fmeasure_avg = fmeasure_avg/number_of_splits
accuracy_avg = accuracy_avg/number_of_splits

print("\n------ AVERAGE CLASSIFICATION PERFORMANCE OF DECISION TREE MODEL ------ \n"\
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n Accuracy : ", (accuracy*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )


------ AVERAGE CLASSIFICATION PERFORMANCE OF DECISION TREE MODEL ------ 

 Recall :  98.88268156424581 %
 Precision :  98.33333333333333 %
 Accuracy :  98.55907780979827 %
 F-measure :  98.60724233983287 %
