# Exercise

Dataset: Aposemat IoT-23 (8.8 GB). </br>
Link: https://www.stratosphereips.org/datasets-iot23 </br>

**Activities:**</br>
Test at least 5 ML techniques, applying feature selection methods. </br>
Use K-Fold (K=3).</br>
Generate the following evaluation metrics: Accuracy, Recall, F1, Precision, Training Time and Detection Time.</br>

The dataset file is composed by 23 captures. After extracting the data from the captures, a new dataset was created, generating the file: iot23_combined.csv. This file was used for the proposed activities.

### Import and visualize dataset

In [3]:
import pandas as pd
import numpy as np
np.warnings.filterwarnings('ignore')

In [4]:
df = pd.read_csv('iot23_combined.csv')
df

Unnamed: 0,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label,proto_icmp,...,conn_state_RSTOS0,conn_state_RSTR,conn_state_RSTRH,conn_state_S0,conn_state_S1,conn_state_S2,conn_state_S3,conn_state_SF,conn_state_SH,conn_state_SHR
0,2.998796,0,0,0.0,3.0,180.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
1,0.000000,0,0,0.0,1.0,60.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
2,0.000000,0,0,0.0,1.0,60.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
3,2.998804,0,0,0.0,3.0,180.0,0.0,0.0,Benign,0,...,0,0,0,1,0,0,0,0,0,0
4,0.000000,0,0,0.0,1.0,60.0,0.0,0.0,Benign,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1444669,0.000000,0,0,0.0,1.0,40.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
1444670,0.000000,0,0,0.0,1.0,40.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
1444671,0.000000,0,0,0.0,1.0,40.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0
1444672,0.000000,0,0,0.0,1.0,40.0,0.0,0.0,PartOfAHorizontalPortScan,0,...,0,0,0,1,0,0,0,0,0,0


### K-fold (k=3) + Feature Selection + Model implementation

- When using the K-fold cross-validation technique, the dataset will be divided into training and testing and it will be possible to obtain the metrics from these sets and their applications in the respective models.
- Feature selection must be performed after separating the training/test sets so there is no data leakage, and during the cross-validation procedure, allowing the features selection based on the training/test sets generated by k-fold.
- Metrics will be generated during cross-validation

In [6]:
X = df.drop(['label'], axis=1)
y = df['label']

### Feature Selection

In [7]:
from sklearn.model_selection import KFold
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, f_classif, SelectPercentile

##### Variance Threshold

In [8]:
# Split dataset in 3-folds and shuffle (shuffle=True):
kf = KFold(n_splits=3, shuffle = True, random_state=10) 
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Select best features with Variance Threshold technique
    constant_filter = VarianceThreshold(threshold=0.01)
    constant_filter.fit(X_train)

    print('\n Constant columns: ')
    constant_columns = [column for column in X_train.columns if column in X_train.columns[constant_filter.get_support()]]
    print(constant_columns)
    
    X_train_new_vt = constant_filter.transform(X_train)
    X_test_new_vt = constant_filter.transform(X_test) 


 Constant columns: 
['duration', 'orig_bytes', 'resp_bytes', 'missed_bytes', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_S0', 'conn_state_SF']

 Constant columns: 
['duration', 'orig_bytes', 'resp_bytes', 'missed_bytes', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_S0', 'conn_state_SF']

 Constant columns: 
['duration', 'orig_bytes', 'resp_bytes', 'missed_bytes', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_S0', 'conn_state_SF']


##### SelectKBest

In [9]:
kf = KFold(n_splits=3, shuffle = True, random_state=10) 
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Select best features with SelectKBest technique 
    select = SelectKBest(f_classif, k=13)
    X_train_skb = select.fit_transform(X_train, y_train)
    X_test_skb = select.transform(X_test)

    print('\n Constant columns: ')
    constant_columns = [column for column in X_train.columns if column in X_train.columns[select.get_support()]]
    print(constant_columns)


 Constant columns: 
['duration', 'missed_bytes', 'resp_pkts', 'proto_icmp', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTO', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']

 Constant columns: 
['duration', 'resp_bytes', 'missed_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']

 Constant columns: 
['duration', 'missed_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_icmp', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']


##### SelectPercentile

In [10]:
kf = KFold(n_splits=3, shuffle = True, random_state=10) 
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Select best features with SelectPercentile technique 
    select = SelectPercentile(f_classif, percentile=55)
    X_train_sp = select.fit_transform(X_train, y_train)
    X_test_sp = select.transform(X_test)

    print('\n Constant columns: ')
    constant_columns = [column for column in X_train.columns if column in X_train.columns[select.get_support()]]
    print(constant_columns)


 Constant columns: 
['duration', 'missed_bytes', 'resp_pkts', 'proto_icmp', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTO', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']

 Constant columns: 
['duration', 'resp_bytes', 'missed_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']

 Constant columns: 
['duration', 'missed_bytes', 'resp_pkts', 'resp_ip_bytes', 'proto_icmp', 'proto_tcp', 'proto_udp', 'conn_state_OTH', 'conn_state_REJ', 'conn_state_RSTR', 'conn_state_S0', 'conn_state_S3', 'conn_state_SF']


The SelectKBest and SelectPercentile methods generate the same 13 features.

### Evaluating Models

- Decision Tree;
- XGBoost;
- Random Forest;
- Logistic Regression;
- Gaussian Naive Bayes;
- MultiLayer Perceptron Classifier

In [11]:
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

#### Decision Tree

In [14]:
# Selected features: Variance Threshold

# Training time - start:
start_treino = time.time()

# DT
DT = DecisionTreeClassifier()
DT.fit(X_train_new_vt,y_train) 
    
# Training time - end:
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

# Detection time - start:
start_deteccao = time.time()

print('\n Prediction:')
y_pred = DT.predict(X_test_new_vt)
print(y_pred)

# Detection time - end:
end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 1.15s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 0.05s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.99      1.00      1.00      1262
                   Benign       0.96      0.55      0.70     66064
                      C&C       0.56      0.13      0.22      5092
         C&C-FileDownload       0.71      0.88      0.79        17
            C&C-HeartBeat       0.72      0.53      0.61       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.00      0.00      0.00         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       1.00      0.33      0.50         3
                    Okiru       0.70      0.00      0.00     87526


In [15]:
# Selected features: SelectKBest

start_treino = time.time()

DT = DecisionTreeClassifier()
DT.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = DT.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 1.11s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 0.04s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.97      1.00      0.98      1262
                   Benign       0.96      0.55      0.70     66064
                      C&C       0.55      0.13      0.21      5092
         C&C-FileDownload       0.65      0.88      0.75        17
            C&C-HeartBeat       0.70      0.52      0.60       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.00      0.00      0.00         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       1.00      0.33      0.50         3
                    Okiru       0.10      0.00      0.00     87526


#### XGBoost

In [16]:
# Selected features: Variance Threshold

start_treino = time.time()

# XGBoost
clf = GradientBoostingClassifier()
clf.fit(X_train_new_vt,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')


start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_new_vt)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection Time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 437.22s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection Time: 5.21s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       1.00      0.96      0.98      1262
                   Benign       0.96      0.51      0.67     66064
                      C&C       0.26      0.14      0.18      5092
         C&C-FileDownload       0.60      0.71      0.65        17
            C&C-HeartBeat       0.05      0.59      0.10       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.33      0.11      0.17         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       0.00      0.00      0.00         3
                    Okiru       0.00      0.00      0.00     8752

In [18]:
# Selected features: SelectKBest

start_treino = time.time()

clf = GradientBoostingClassifier()
clf.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 435.24s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 5.1s

 Classifiction Report :
                            precision    recall  f1-score   support

                    Attack       0.97      1.00      0.98      1262
                    Benign       0.95      0.56      0.70     66064
                       C&C       0.98      0.11      0.20      5092
          C&C-FileDownload       0.62      0.29      0.40        17
             C&C-HeartBeat       0.00      0.00      0.00       114
C&C-HeartBeat-FileDownload       0.00      0.00      0.00         0
                 C&C-Mirai       0.00      0.00      0.00         1
                 C&C-Torii       0.33      0.11      0.17         9
                      DDoS       1.00      0.82      0.90     46192
              FileDownload       0.33      0.33      0.3

#### Random Forest

In [19]:
# Variance Threshold

start_treino = time.time()

clf = RandomForestClassifier()
clf.fit(X_train_new_vt,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_new_vt)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 24.91s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 4.31s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       1.00      1.00      1.00      1262
                   Benign       0.96      0.55      0.70     66064
                      C&C       0.53      0.14      0.22      5092
         C&C-FileDownload       0.77      1.00      0.87        17
            C&C-HeartBeat       0.65      0.53      0.58       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.50      0.11      0.18         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       1.00      0.67      0.80         3
                    Okiru       0.72      0.00      0.00     87526

In [20]:
# SelectKBest

start_treino = time.time()

clf = RandomForestClassifier()
clf.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 22.92s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 4.04s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.97      1.00      0.99      1262
                   Benign       0.96      0.55      0.70     66064
                      C&C       0.51      0.14      0.22      5092
         C&C-FileDownload       0.71      1.00      0.83        17
            C&C-HeartBeat       0.68      0.52      0.59       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.00      0.00      0.00         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       1.00      0.33      0.50         3
                    Okiru       0.09      0.00      0.00     87526

#### Logistic Regression

In [24]:
# Variance Threshold

start_treino = time.time()

clf = LogisticRegression()
clf.fit(X_train_new_vt,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_new_vt)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 93.0s

 Prediction:
['Benign' 'Benign' 'Benign' ... 'Benign' 'Benign' 'Benign']

 Detection time: 0.05s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.00      0.00      0.00      1262
                   Benign       0.14      0.92      0.24     66064
                      C&C       0.00      0.00      0.00      5092
         C&C-FileDownload       0.47      1.00      0.64        17
            C&C-HeartBeat       0.00      0.00      0.00       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.00      0.00      0.00         9
                     DDoS       0.01      0.00      0.00     46192
             FileDownload       0.00      0.00      0.00         3
                    Okiru       0.00      0.00      0.00     87526
PartOfAHorizontalPortScan       0.01      0.00      0.00    275278

                 accuracy                       

In [23]:
# SelectKBest

start_treino = time.time()

clf = LogisticRegression()
clf.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 118.13s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 0.05s

 Classifiction Report :
                            precision    recall  f1-score   support

                    Attack       0.67      0.23      0.34      1262
                    Benign       0.12      0.00      0.00     66064
                       C&C       0.01      0.00      0.00      5092
          C&C-FileDownload       0.50      0.47      0.48        17
             C&C-HeartBeat       0.00      0.00      0.00       114
C&C-HeartBeat-FileDownload       0.00      0.00      0.00         0
                 C&C-Mirai       0.00      0.00      0.00         1
                 C&C-Torii       0.00      0.11      0.01         9
                      DDoS       0.00      0.00      0.00     46192
              FileDownload       0.00      0.00      0.

#### Naive Bayes

In [25]:
# Variance Threshold

start_treino = time.time()

clf = GaussianNB()
clf.fit(X_train_new_vt,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_new_vt)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 0.96s

 Prediction:
['Okiru' 'Okiru' 'Okiru' ... 'Okiru' 'Okiru' 'Okiru']

 Detection time: 0.54s

 Classifiction Report :
                            precision    recall  f1-score   support

                    Attack       0.90      0.96      0.93      1262
                    Benign       0.86      0.00      0.00     66064
                       C&C       0.46      0.11      0.18      5092
          C&C-FileDownload       0.54      0.76      0.63        17
             C&C-HeartBeat       0.00      0.00      0.00       114
C&C-HeartBeat-FileDownload       0.00      0.00      0.00         0
                 C&C-Mirai       0.00      0.00      0.00         1
                 C&C-Torii       0.17      0.11      0.13         9
                      DDoS       0.60      0.00      0.00     46192
              FileDownload       0.20      0.33      0.25         3
                     Okiru       0.18      1.00      0.31     87526
 PartOfAHorizontalPortScan       0.44      

In [26]:
# SelectKBest

start_treino = time.time()

clf = GaussianNB()
clf.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 1.09s

 Prediction:
['Okiru' 'DDoS' 'Okiru' ... 'Okiru' 'Okiru' 'Okiru']

 Detection time: 0.6s

 Classifiction Report :
                            precision    recall  f1-score   support

                    Attack       0.83      0.96      0.89      1262
                    Benign       0.58      0.00      0.00     66064
                       C&C       0.57      0.11      0.18      5092
          C&C-FileDownload       0.58      0.82      0.68        17
             C&C-HeartBeat       0.08      0.18      0.11       114
C&C-HeartBeat-FileDownload       0.00      0.00      0.00         0
                 C&C-Mirai       0.00      0.00      0.00         1
                 C&C-Torii       0.14      0.11      0.12         9
                      DDoS       0.47      0.82      0.60     46192
              FileDownload       1.00      0.33      0.50         3
                     Okiru       0.22      1.00      0.36     87526
 PartOfAHorizontalPortScan       0.05      0.

#### MLP Classifier

In [27]:
# Variance Threshold

start_treino = time.time()

clf = MLPClassifier()
clf.fit(X_train_new_vt,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_new_vt)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 112.33s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 0.39s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.78      0.96      0.86      1262
                   Benign       0.95      0.55      0.70     66064
                      C&C       0.73      0.08      0.15      5092
         C&C-FileDownload       0.40      0.71      0.51        17
            C&C-HeartBeat       0.00      0.00      0.00       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.50      0.11      0.18         9
                     DDoS       0.98      0.82      0.89     46192
             FileDownload       0.20      0.67      0.31         3
                    Okiru       0.65      0.00      0.00     8752

In [28]:
# SelectKBest

start_treino = time.time()

clf = MLPClassifier()
clf.fit(X_train_skb,y_train) 
    
end_treino = time.time()
print(f'\n Training time: {round(end_treino - start_treino,2)}s')

start_deteccao = time.time()

print('\n Prediction:')
y_pred = clf.predict(X_test_skb)
print(y_pred)

end_deteccao = time.time()
print(f'\n Detection time: {round(end_deteccao - start_deteccao,2)}s')

print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Training time: 90.35s

 Prediction:
['PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' ... 'PartOfAHorizontalPortScan'
 'PartOfAHorizontalPortScan' 'PartOfAHorizontalPortScan']

 Detection time: 0.34s

 Classifiction Report :
                           precision    recall  f1-score   support

                   Attack       0.00      0.00      0.00      1262
                   Benign       0.90      0.56      0.69     66064
                      C&C       0.00      0.00      0.00      5092
         C&C-FileDownload       0.00      0.00      0.00        17
            C&C-HeartBeat       0.00      0.00      0.00       114
                C&C-Mirai       0.00      0.00      0.00         1
                C&C-Torii       0.00      0.00      0.00         9
                     DDoS       1.00      0.82      0.90     46192
             FileDownload       0.00      0.00      0.00         3
                    Okiru       0.00      0.00      0.00     87526

### Another way to verify metrics using cross_validate and RFE (Recursive Feature Elimination) feature selection.

- Decision Tree

In [30]:
X = df.drop(['label'], axis=1)
y = df['label']

In [36]:
from sklearn.model_selection import KFold
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Decision Tree

In [37]:
metricas = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

folds = KFold(n_splits = 3, shuffle = True, random_state = 10)

estimator = DecisionTreeClassifier()

selector = RFE(estimator, n_features_to_select=13)

cv_results = cross_validate(selector, X, y, cv=folds, scoring=metricas)
cv_results

{'fit_time': array([15.86511731, 15.57997274, 16.05305052]),
 'score_time': array([6.1320951 , 6.03219414, 6.39355588]),
 'test_accuracy': array([0.72975841, 0.72918942, 0.7295466 ]),
 'test_precision_macro': array([0.749764  , 0.74989071, 0.70602235]),
 'test_recall_macro': array([0.54313935, 0.59549495, 0.48680803]),
 'test_f1_macro': array([0.57035252, 0.59451515, 0.51490568])}

In [38]:
kf = KFold(n_splits=3, shuffle = True, random_state=10) 
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    estimator = DecisionTreeClassifier()
    
    selector = RFE(estimator, n_features_to_select=13)
    selector = selector.fit(X_train, y_train)
    
    constant_columns = [column for column in X_train.columns if column in X_train.columns[selector.get_support()]]
    
    X_train_new = selector.transform(X_train)
    X_test_new = selector.transform(X_test) 

DT = DecisionTreeClassifier()
DT.fit(X_train_new,y_train) 
y_pred = DT.predict(X_test_new)
print("\n Classifiction Report :")
print(classification_report(y_test, y_pred, zero_division=0))


 Classifiction Report :
                            precision    recall  f1-score   support

                    Attack       1.00      1.00      1.00      1262
                    Benign       0.96      0.55      0.70     66064
                       C&C       0.56      0.13      0.22      5092
          C&C-FileDownload       0.71      0.71      0.71        17
             C&C-HeartBeat       0.72      0.53      0.61       114
C&C-HeartBeat-FileDownload       0.00      0.00      0.00         0
                 C&C-Mirai       0.00      0.00      0.00         1
                 C&C-Torii       0.50      0.11      0.18         9
                      DDoS       1.00      0.82      0.90     46192
              FileDownload       0.50      0.67      0.57         3
                     Okiru       0.72      0.00      0.00     87526
 PartOfAHorizontalPortScan       0.68      1.00      0.81    275278

                  accuracy                           0.73    481558
                 macr