## README ##
Here we present a solution to detect anomalies on Data Centers networks, using the Grey Wolf Optimizer (GWO) [1] to select features and Machine Learning (ML) algorithms to create models that use less features as possible and reach considerable accuracy. We compare results with a solution without features selection.

This code is the implementation of the solution reported at the paper "Detecção Eficiente de Anomalias em Redes de Data Centers Apoiada por Aprendizado de Máquina e Otimizador do Lobo Cinzento para Seleção de Atributos" published at SBRC 2023.

#Features#
 - Creates detection models  using ML algorithms with less features as possible, with features selection;
 - Reduces the number of features to generate models without a significant impact on model accuracy, using GWO.
 - Improve the utilization of the network resources because fewer data need to be transferred to be classified
 - Improve the utilization of the processing resources because fewer
features need to be analyzed
.

#How to run#


To run this code, first you must download the following datasets:

*   **KDDCup99**:  file 'kddcup.data_10_percent.gz', following this link: 'https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html'. This file contents 10% of total flows from the entire dataset and we use that to run our experiments.
* **UNSW-NB15**:  file 'UNSW_NB15_training-set.csv', following this link: 'https://research.unsw.edu.au/projects/unsw-nb15-dataset' . Containing 175,341 records in total.

Save then on your Google Drive, after that, you need to set it on the proper place, signalized as '#Load the data" in the functions 'load_KDD99()' and 'load_UNSW_NB15()', respectively.



After to set the dataset, this code runs an experiment without feature selection, the Gray Wolf Optimizer (GWO), as a comparing reference. Followed by an experiment using GWO to select features and the accuracy to compose the fitness function. And at the end, the experiment using GWO to select features and the F1-Score as parameter of the GWO fitness function.

<br>

#Contact#

Henrique Salvador (PhD Student)

e-mail: henriquesalvador@ime.usp.br


<br>

#Reference#
[1] Seyedali Mirjalili, Seyed Mohammad Mirjalili, Andrew Lewis. Grey Wolf Optimizer. **Advances in Engineering Software**. Volume 69, 2014, Pages 46-61, ISSN 0965-9978.(https://doi.org/10.1016/j.advengsoft.2013.12.007)

In [None]:
import random
import numpy
import math
import time

import numpy as np
import pandas as pd

from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score


from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:


def GWO(objf,lb,ub,dim,SearchAgents_no,Max_iter):

    # The wolves alpha, beta, and delta_pos are initialized
    Alpha_pos=numpy.zeros(dim)
    Alpha_score=float("inf")

    Beta_pos=numpy.zeros(dim)
    Beta_score=float("inf")

    Delta_pos=numpy.zeros(dim)
    Delta_score=float("inf")

    if not isinstance(lb, list):
        lb = [lb] * dim
    if not isinstance(ub, list):
        ub = [ub] * dim

    #Initialize the positions of search agents
    Positions = numpy.zeros((SearchAgents_no, dim))
    for i in range(dim):
        Positions[:, i] = numpy.random.uniform(0,1, SearchAgents_no) * (ub[i] - lb[i]) + lb[i]     # => RANDOM UNIFORM
    Convergence_curve=numpy.zeros(Max_iter)

     # Loop counter
    print("GWO is optimizing  \""+objf.__name__+"\"")

    timerStart=time.time()
    # Main loop
    for l in range(0,Max_iter):
        for i in range(0,SearchAgents_no):

            # Return back the search agents that go beyond the boundaries of the search space
            for j in range(dim):
                Positions[i,j]=numpy.clip(Positions[i,j], lb[j], ub[j])

            # Calculate objective function for each search agent
            fitness=objf(Positions[i,:])

            # Update Alpha, Beta and Delta wolves
            if fitness<Alpha_score :
                Alpha_score=fitness; # Update alpha
                Alpha_pos=Positions[i,:].copy()


            if (fitness>Alpha_score and fitness<Beta_score ):
                Beta_score=fitness  # Update beta
                Beta_pos=Positions[i,:].copy()


            if (fitness>Alpha_score and fitness>Beta_score and fitness<Delta_score):
                Delta_score=fitness # Update delta
                Delta_pos=Positions[i,:].copy()




        a=2-l*((2)/Max_iter); # a decreases linearly from 2 to 0

        # Update the Position of search agents including omegas
        for i in range(0,SearchAgents_no):  # number of wolves (agents)
            for j in range (0,dim):         #dim = n_features

                r1=random.random() # r1 is a random number in [0,1]
                r2=random.random() # r2 is a random number in [0,1]

                A1=2*a*r1-a; # Equation (3.3)
                C1=2*r2; # Equation (3.4)

                D_alpha=abs(C1*Alpha_pos[j]-Positions[i,j]); # Equation (3.5)-part 1
                X1=Alpha_pos[j]-A1*D_alpha; # Equation (3.6)-part 1

                r1=random.random()
                r2=random.random()

                A2=2*a*r1-a; # Equation (3.3)
                C2=2*r2; # Equation (3.4)

                D_beta=abs(C2*Beta_pos[j]-Positions[i,j]); # Equation (3.5)-part 2
                X2=Beta_pos[j]-A2*D_beta; # Equation (3.6)-part 2

                r1=random.random()
                r2=random.random()

                A3=2*a*r1-a; # Equation (3.3)
                C3=2*r2; # Equation (3.4)

                D_delta=abs(C3*Delta_pos[j]-Positions[i,j]); # Equation (3.5)-part 3
                X3=Delta_pos[j]-A3*D_delta; # Equation (3.5)-part 3

                Positions[i,j]=(X1+X2+X3)/3  # Equation (3.7)




        Convergence_curve[l]=Alpha_score;

        #if (l%1==0):
               #print(['At iteration '+ str(l)+ ' the best fitness is '+ str(Alpha_score)]);
              # print('alpha:', numpy.where(Alpha_pos>0.5)[0])

    timerEnd=time.time()
    print('Completed in', (timerEnd - timerStart))


    return Alpha_pos



In [None]:
num_features = -1


def fitness_function_accur(positions):
    features = np.where(positions>=0.4999)[0] # Here we consider selected features just ones with random number greater or equals than 0.5
    print('selected_features (partial):', features)
    train_xf = train_x.iloc[:, features]
    test_xf = test_x.iloc[:, features]

    knn_classifier = KNeighborsClassifier(n_neighbors=7)
    knn_classifier.fit(train_xf, train_y)
    accuracy = knn_classifier.score(test_xf, test_y)
    print('Accuracy (partial):', accuracy)
    print("------------ ")
    w = 0.9
    return -(w*accuracy + (1-w) * 1/(len(features)))


def fitness_function_f1(positions):
    features = np.where(positions>=0.4999)[0] # Here we consider selected features just ones with random number greater or equals than 0.5
    print('selected_features (partial):', features)
    train_xf = train_x.iloc[:, features]
    test_xf = test_x.iloc[:, features]
    knn_classifier = KNeighborsClassifier(n_neighbors=7)
    knn_classifier.fit(train_xf, train_y)
    result = knn_classifier.predict(test_xf)
    f1 = f1_score(test_y, result, average='weighted')
    print("F1-Score (partial): ", f1)
    print("------------ ")
    y = 0.9
    return -(y*f1 + (1-y)* 1/(len(features)))

def load_KDD99():
    print("########### Loading KDD Cup 99 dataset ##############")
    # Load the data
    #when using KDD99 dataset, the number of features is 41
    n_features = 41
    #Here you must to set your Google Drive folder where the dataset was save.
    df_full = pd.read_csv("/content/drive/MyDrive/04_experimentos/01_gwo/kddcup.data_10_percent_corrected", header=None)
    df = df_full.sample(frac=.05, replace=True, random_state=1)
    print(df.shape)

    #Transforming data
    # Categorize columns: "protocols", "services", "flags", "attacks"
    df[1], protocols= pd.factorize(df[1])
    df[2], services = pd.factorize(df[2])
    df[3], flags    = pd.factorize(df[3])
    df[41], attacks = pd.factorize(df[41])

    #Split the dataset on train_data (75%) and test_data (25%)
    train_data, test_data = train_test_split(df)

    return n_features, train_data.iloc[:, :num_features], test_data.iloc[:, :num_features], train_data.iloc[:, -1], test_data.iloc[:, -1]


def load_UNSW_NB15():
    print("########### Loading UNSW-NB15 dataset ##############")
    # Load the data
    n_features = 58
    df_full = pd.read_csv('/content/drive/MyDrive/05_UNSW_NB15/datasets/UNSW_NB15.csv')
    #df = df_full
    df = df_full.sample(frac=0.2, replace=True, random_state=1)

    df['service'].replace('-',np.nan,inplace=True)
    df.dropna(inplace=True)

    features = pd.read_csv('/content/drive/MyDrive/05_UNSW_NB15/datasets/UNSW_NB15_features.csv')
    # selecting column names of all data types
    nominal_names = features['Name'][features['Type ']=='nominal']
    integer_names = features['Name'][features['Type ']=='integer']
    binary_names = features['Name'][features['Type ']=='binary']
    float_names = features['Name'][features['Type ']=='float']

    # selecting common column names from dataset and feature dataset
    cols = df.columns
    nominal_names = cols.intersection(nominal_names)
    integer_names = cols.intersection(integer_names)
    binary_names = cols.intersection(binary_names)
    float_names = cols.intersection(float_names)

    # Converting integer columns to numeric
    for c in integer_names:
      pd.to_numeric(df[c])

      # Converting binary columns to numeric
    for c in binary_names:
      pd.to_numeric(df[c])

      # Converting float columns to numeric
    for c in float_names:
      pd.to_numeric(df[c])

    num_col = df.select_dtypes(include='number').columns

    # selecting categorical data attributes
    cat_col = df.columns.difference(num_col)
    cat_col = cat_col[1:]
    cat_col

    # creating a DF with only categorical attributes
    data_cat = df[cat_col].copy()
    data_cat.head()
    df.shape

    ### Convert categorical attributes to binary ones, named as "column_value"
    data_cat = pd.get_dummies(data_cat,columns=cat_col)
    df = pd.concat([df, data_cat],axis=1)
    df.drop(columns=cat_col,inplace=True)


    ## Data Normalization
    # selecting numeric attributes columns from data
    num_col = list(df.select_dtypes(include='number').columns)
    num_col.remove('id')
    num_col.remove('label')
    print(num_col)

    # Normalizing data
    minmax_scale = MinMaxScaler(feature_range=(0, 1))
    def normalization(df,col):
      for i in col:
        arr = df[i]
        arr = np.array(arr)
        df[i] = minmax_scale.fit_transform(arr.reshape(len(arr),1))
      return df

    df = normalization(df.copy(),num_col)
    X = df.drop(columns=['label','attack_cat','id'],axis=1)
    Y = df['label']
    X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.25, random_state=1)

    return n_features,  X_train,X_test,y_train,y_test



#To run experiments with KDD Cup 99 dataset, uncomment the line below and comment the next one: 'num_features, train_x, test_x, train_y, test_y = load_UNSW_NB15()' .
#num_features, train_x, test_x, train_y, test_y = load_KDD99()

#To run experiments with UNSW_NB15 dataset, uncomment the line below and comment the previous one: 'num_features, train_x, test_x, train_y, test_y = load_KDD99()'.
num_features, train_x, test_x, train_y, test_y = load_UNSW_NB15()


print ('train_x shape:', train_x.shape)
print ('train_y shape:', train_y.shape)
print ('test_x shape:', test_x.shape)
print ('test_y shape:', test_y.shape)

print ('############ Begin  ############ Result withou GWO:')
timerStart=time.time()
knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_x, train_y)
accuracy = knn_classifier.score(test_x, test_y)

knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_x, train_y)

predicted = knn_classifier.predict(test_x)
print('accuracy_score (final):',accuracy_score(test_y, predicted))

knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_x, train_y)
accuracy = knn_classifier.score(test_x, test_y)
timerEnd=time.time()
print('Completed in', (timerEnd - timerStart))

######### end  ############ Result withou GWO

# Feature selection using GWO+F1 Score
print ('\n############ Begin  ############ Result GWO with F1-Score implementation:')
fit = GWO(fitness_function_f1, 0, 1, num_features, 10, 20)

selected_features = np.where(fit>0.5)[0]
print('selected_features (final):',len(selected_features), selected_features)
train_xf = train_x.iloc[:, selected_features]
test_xf = test_x.iloc[:, selected_features]
knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_xf, train_y)

predicted = knn_classifier.predict(test_xf)
print('accuracy_score (final):',accuracy_score(test_y, predicted))

############ end  ############ Result F1 GWO:')

# Feature selection using GWO+F1 Score
print ('\n############ Begin  ############ Result GWO with Accuracy implementation:')

knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_x, train_y)
accuracy = knn_classifier.score(test_x, test_y)

fit = GWO(fitness_function_accur, 0, 1, num_features, 10, 20)
selected_features = np.where(fit>0.5)[0]
print('selected_features (final):',len(selected_features), selected_features)


train_xf = train_x.iloc[:, selected_features]
test_xf = test_x.iloc[:, selected_features]

knn_classifier = KNeighborsClassifier(n_neighbors=7)
knn_classifier.fit(train_xf, train_y)
predicted = knn_classifier.predict(test_xf)
#print('confusion_matrix:\n',confusion_matrix(test_y, predicted))
print('accuracy_score (final):',accuracy_score(test_y, predicted))




########### Loading UNSW-NB15 dataset ##############
['dur', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports', 'proto_tcp', 'proto_udp', 'service_dhcp', 'service_dns', 'service_ftp', 'service_ftp-data', 'service_http', 'service_irc', 'service_pop3', 'service_radius', 'service_smtp', 'service_snmp', 'service_ssh', 'service_ssl', 'state_CON', 'state_FIN', 'state_INT', 'state_REQ', 'state_RST']
train_x shape: (12195, 58)
train_y shape: (12195,)
test_x shape: (4065, 58)
test_y shape: (4065,)
############ Begin  ############ Result withou GWO:
accuracy_score (final): 0.984009840098401
Completed in