# Advanced Evaluation of Clustering Results

### 1. Basic Considerations

In the previous example, we use KMeans to try to cauterize the NSL-KDD dataset into five different types of groups: benign (regular traffic), DoS, probe, U2R, and R2L. However, as you can see in the following picture, it does not work very well, where the clusters have more than one cyber group type.

![Clustering Analysis Process](imgs/cluster_results.jpg)


For example, in the red delimited area, you see that in cluster 3, you have more than one element from the cyber group. You have DoS, benign, and r2l. **How to analyze the data in this situation?** 


### 2. Clustering Analysis (Cross-Tabulation)

Cross-tabulation, also known as cross-tab or contingency table, is a statistical tool used for categorical data. Categorical data involves values that are mutually exclusive to each other.

Understanding how cross-tabulation works requires that, first, we will re-apply to the clusterization process using K-Means. To perform this task, we divide it into three phases:
- Load and Adjust the dataset;
- Prepare and clean the dataset; and
- Training and Predict values, using KMeans.

#### 2.1. Load and Adjust the dataset

In [103]:
import pandas as pd

def load_dataset(trainfile, testfile,
                 header_names):
    train_df = pd.read_csv(trainfile, names=header_names)
    print('train shape:',train_df.shape)
    
    test_df = pd.read_csv(testfile, names=header_names)
    print('test shape:',test_df.shape)
    
    return train_df, test_df

header_names = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
                    'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
                    'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations',
                    'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login',
                    'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
                    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
                    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
                    'dst_host_srv_diff_host_rate',
                    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
                    'dst_host_srv_rerror_rate',
                    'attack_type', 'success_pred']

train_file_name = 'resources/ndd/KDDTrain+.txt'
test_file_name='resources/ndd/KDDTest+.txt'

print('loading dataset...')
train_df, test_df= load_dataset(train_file_name, test_file_name,header_names)


from collections import defaultdict

def load_dictionary(dictionary_name):
    category = defaultdict(list)
    category['benign'].append('normal')

    with open(dictionary_name, 'r') as f:
        for line in f.readlines():
            attack, cat = line.strip().split(' ')
            category[cat].append(attack)

    mapping_dict = dict((v, k)
                        for k in category for v in category[k])
    return mapping_dict

print('loading dictionary ...')
dict_file = 'resources/ndd/training_attack_types.txt'
dicta = load_dictionary(dict_file)

def adjust_datasets(train_ds, test_ds, dicta, att_type_lb, catg_attk_lab,drop_list_label):
    train_ds[catg_attk_lab] = train_ds[att_type_lb].map(lambda x: dicta[x])
    test_ds[catg_attk_lab] = test_ds[att_type_lb].map(lambda x: dicta[x])

    for i in range(0, len(drop_list_label)):
        train_ds.drop([drop_list_label[i]], axis=1, inplace=True)
        test_ds.drop([drop_list_label[i]], axis=1, inplace=True)
      
        
import numpy as np
def getColunssByClass(header_names):
    nominal_idx = [1, 2, 3]
    bin_idx = [6, 11, 13, 14, 20, 21]
    numeric_idx = list(set(range(41)).difference(nominal_idx).difference(bin_idx))
    col_names = np.array(header_names)
    nominal_cols = col_names[nominal_idx].tolist()
    binary_cols = col_names[bin_idx].tolist()
    numeric_cols = col_names[numeric_idx].tolist()
    return nominal_cols, binary_cols, numeric_cols


print('adjusting the dataset ...')
nominal_cols, binary_cols, numeric_cols = getColunssByClass(header_names)
drop_list_label=[]
drop_list_label.append('success_pred')
adjust_datasets(train_df,test_df,dicta,'attack_type', 'attack_category',drop_list_label)

loading dataset...
train shape: (125973, 43)
test shape: (22544, 43)
loading dictionary ...
adjusting the dataset ...


#### 2.2. Prepare and clean the dataset

Until this moment, the process is the same as that used in the 'cyber_cluster'Junpyer notebook. Following the previous task, it is required to cleaning and split the labels and data from the test and train dataset. 

The labels contain these values in the cited notebook: benign (regular traffic), DoS, probe, U2R, and R2L. However, as the clusters are not so accurate, the new idea is to have only two values (attack and normal). It is sufficient because the final goal is to identify attacks and not classify attacks by type.

The main difference between the previous **prepare_data' method** and the current implementation, it is that we create a new label "transaction_class", that informs that package is an attack or benign (regular traffic).

In [104]:
def cleaning_data (train_df, test_df, num_cols):
    train_df['su_attempted'].replace(2, 0, inplace=True)
    test_df['su_attempted'].replace(2, 0, inplace=True)
    train_df.drop('num_outbound_cmds', axis=1, inplace=True)
    test_df.drop('num_outbound_cmds', axis=1, inplace=True)
    num_cols.remove('num_outbound_cmds')

print('cleaning data ...')
cleaning_data(train_df,test_df,numeric_cols)

def prepare_dataset(train_df, test_df, target_name,new_target_name, nominal_cols):

    #split the dataset into target_labels and values
    train_label = train_df[new_target_name]
    train_x_raw = train_df.drop([new_target_name, target_name], axis=1)

    test_label = test_df[new_target_name]
    test_x_raw = test_df.drop([new_target_name, target_name], axis=1)

    #convert the categorical values to numeric values
    combined_df_raw = pd.concat([train_x_raw, test_x_raw])
    combined_df = pd.get_dummies(combined_df_raw, columns=nominal_cols, drop_first=True)

    train_x = combined_df[:len(train_x_raw)]
    test_x = combined_df[len(train_x_raw):]
    
    ################################ main change, when comparec with the previous code ###############################
    train_Y = train_label.apply(lambda x: 'normal' if x == 'benign' else 'attack')
    train_df['transaction_class'] = train_label.apply(lambda x: 'normal' if x == 'benign' else 'attack')
    
    test_Y = test_label.apply(lambda x: 'normal' if x == 'benign' else 'attack')
    test_df['transaction_class'] = test_label.apply(lambda x: 'normal' if x == 'benign' else 'attack')
   
    return train_x,train_label, test_x, test_label, train_Y, test_Y

print('preparing data ...')
train_x,train_label, test_x, test_label, train_Y, test_Y = prepare_dataset(train_df, test_df, 'attack_type','attack_category', nominal_cols)


from sklearn.preprocessing import StandardScaler
def scalling(train_x, test_x, num_cols):
    standard_scaler = StandardScaler().fit(train_x[num_cols])
    train_x[num_cols] = standard_scaler.transform(train_x[num_cols])
    test_x[num_cols] = standard_scaler.transform(test_x[num_cols])
    
    return train_x,  test_x

print('scalling data ...')
train_x,  test_x = scalling(train_x, test_x,numeric_cols)

from sklearn.preprocessing import LabelEncoder
def enconding_label(train_label, test_label):
    # encoder the labels and transform in numeric values
    # it is impoirtant to analyze the result
    label_encoder = LabelEncoder()
    train_label_np = train_label.to_numpy()
    train_label_encod = label_encoder.fit_transform(train_label_np)

    test_label_np = test_label.to_numpy()
    test_label_encod = label_encoder.fit_transform(test_label_np)

    return train_label_encod, test_label_encod, label_encoder

print('encoding data ...')
train_label_encod, test_label_encod, label_encoder = enconding_label(train_label,test_label)

cleaning data ...
preparing data ...
scalling data ...
encoding data ...


#### 2.3. Training and Predict values, using KMeans

In [105]:
from sklearn.cluster import KMeans

def trainning (train_x, cluster_number):
    kmeans = KMeans(n_clusters=cluster_number, random_state=17).fit(train_x)
    return kmeans

number_of_cluster = 5
print('runing K-Means....')
kmeans = trainning(train_x,number_of_cluster)

runing K-Means....


#### 2.4. Cross-Tabulation

After the model is trained, the next step is the generation of cross-tabulation. However, to perform this task, it is required to insert a new label in the train and test dataset with the result of the cluster (*'kmeans_y'*).

In [119]:
print('setting the new labels ....')
train_df['kmeans_y'] = kmeans_train_y

kmeans_test_y = kmeans.predict(test_x)
test_df['kmeans_y'] = kmeans_test_y

pd.crosstab(test_df.kmeans_y, test_df.transaction_class)

setting the new labels ....


transaction_class,attack,normal
kmeans_y,Unnamed: 1_level_1,Unnamed: 2_level_1
0,485,526
1,543,958
2,2036,5
3,4643,8136
4,5126,86


### 2. Improve the Cluster Results using Decision Tree

Analyzing the result, we can see that ***cluster 2 is an attack cluster***; but you cannot assume this premise for the others. The idea is to classify each cluster's results using a Decision Tree.

First, we will create a decision tree engine and train that using the specific cluster results.

In [122]:
from sklearn.tree import DecisionTreeClassifier

def train_decision_tree(train_x, train_Y):
    classifier = DecisionTreeClassifier(random_state=0)
    classifier.fit(train_x, train_Y)
    
    return classifier

print('defining the decision tree classifier...')

defining the decision tree classifier...


#### 2.1 Calculation to Cluster 0

In [123]:
from sklearn.metrics import confusion_matrix, zero_one_loss

train_y0 = train_df[train_df.kmeans_y==0]
train_y0=train_y0.drop(['kmeans_y'], axis=1)

test_y0 = test_df[test_df.kmeans_y==0]
test_y0=test_y0.drop(['kmeans_y'], axis=1)

print ('instantiating and fiting the decision tree to cluster 0')
dtc0 = train_decision_tree(train_y0[numeric_cols],train_y0['transaction_class'])

print ('prediction values to cluster 0')
dtc0_pred_y = dtc0.predict(test_y0[numeric_cols])

print ('confusion matrix to cluster 0')
conf_mat = confusion_matrix(test_y0['transaction_class'], dtc0_pred_y)
print(conf_mat)

instantiating and fiting the decision tree to cluster 0
prediction values to cluster 0
confusion matrix to cluster 0
[[309 176]
 [ 14 512]]


#### 2.2 Calculation to Cluster 1

In [124]:
train_y1 = train_df[train_df.kmeans_y==1]
train_y1=train_y1.drop(['kmeans_y'], axis=1)

test_y1 = test_df[test_df.kmeans_y==1]
test_y1=test_y1.drop(['kmeans_y'], axis=1)

print ('instantiating and fiting the decision tree to cluster 1')
dtc1 = train_decision_tree(train_y1[numeric_cols],train_y1['transaction_class'])

print ('prediction values to cluster 1')
dtc1_pred_y = dtc1.predict(test_y1[numeric_cols])

print ('confusion matrix to cluster 0')
conf_mat = confusion_matrix(test_y1['transaction_class'], dtc1_pred_y)
print(conf_mat)

instantiating and fiting the decision tree to cluster 1
prediction values to cluster 1
confusion matrix to cluster 0
[[543   0]
 [  4 954]]


#### 2.3 Calculation to Cluster 3

In [126]:
train_y3 = train_df[train_df.kmeans_y==3]
train_y3=train_y3.drop(['kmeans_y'], axis=1)

test_y3 = test_df[test_df.kmeans_y==3]
test_y3=test_y3.drop(['kmeans_y'], axis=1)

print ('instantiating and fiting the decision tree to cluster 3')
dtc3 = train_decision_tree(train_y3[numeric_cols],train_y3['transaction_class'])

print ('prediction values to cluster 3')
dtc3_pred_y = dtc3.predict(test_y3[numeric_cols])

print ('confusion matrix to cluster 3')
conf_mat = confusion_matrix(test_y3['transaction_class'], dtc3_pred_y)
print(conf_mat)

instantiating and fiting the decision tree to cluster 3
prediction values to cluster 3
confusion matrix to cluster 3
[[1496 3147]
 [ 193 7943]]


#### 2.4 Calculation to Cluster 4

In [None]:
train_y4 = train_df[train_df.kmeans_y==4]
train_y4=train_y4.drop(['kmeans_y'], axis=1)

test_y4 = test_df[test_df.kmeans_y==4]
test_y4=test_y4.drop(['kmeans_y'], axis=1)

print ('instantiating and fiting the decision tree to cluster 4')
dtc4 = train_decision_tree(train_y4[numeric_cols],train_y4['transaction_class'])

print ('prediction values to cluster 4')
dtc4_pred_y = dtc4.predict(test_y4[numeric_cols])

print ('confusion matrix to cluster 4')
conf_mat = confusion_matrix(test_y4['transaction_class'], dtc4_pred_y)
print(conf_mat)