# **GAN-based Autoencoder for Anomaly Detection in SIoT**

This study delves into the intricacies of SIoT networks, characterized by diverse data modalities, sensor data, device interactions, and social connections. In order address evolving threats, a comprehensive approach is proposed, integrating advanced ML models—CNN(Convolutional Neural Network), GAN(Generative Adversarial Network), LR(Logistic Regression)— in order to detection of intrusions in SIoT environments. The method encompasses rigorous data collection, preprocessing, feature selection, and model training. Performance evaluation reveals CNN+GAN's superiority with an 85% accuracy, surpassing other models. Detailed metrics include precision, accuracy, recall, ROC AUC, and F1-score, emphasizing that effectiveness of the proposed approach. This research significantly advances SIoT security, offering insights crucial for designing secure and resilient networks in the increasingly interconnected landscape.

## **Data and Dependencies Load**

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

pd.options.display.max_rows = 100
pd.options.display.max_columns = None

In [2]:
# Data Loader

import os

def data_maker(directory_path):
    """
    This function reads all CSV files from a specified directory and concatenates them into a single DataFrame.

    Parameters:
    directory_path (str): The path to the directory containing the CSV files.

    Returns:
    pd.DataFrame: The concatenated DataFrame.
    """
    # List to hold the individual DataFrames
    dataframes = []

    # Iterate over all files in the directory
    for filename in os.listdir(directory_path):
        if filename.endswith(".csv"):
            file_path = os.path.join(directory_path, filename)
            # Read the CSV file into a DataFrame
            df = pd.read_csv(file_path)
            # Append the DataFrame to the list
            dataframes.append(df)

    # Concatenate all DataFrames in the list
    concatenated_df = pd.concat(dataframes, ignore_index=True)

    return concatenated_df

# Directory containing the CSV files
directory_path = "./data/"

# Get the concatenated DataFrame
df = data_maker(directory_path)

In [3]:
# Data head

df.head(10)

Unnamed: 0,flow_duration,Header_Length,Protocol Type,Duration,Rate,Srate,Drate,fin_flag_number,syn_flag_number,rst_flag_number,psh_flag_number,ack_flag_number,ece_flag_number,cwr_flag_number,ack_count,syn_count,fin_count,urg_count,rst_count,HTTP,HTTPS,DNS,Telnet,SMTP,SSH,IRC,TCP,UDP,DHCP,ARP,ICMP,IPv,LLC,Tot sum,Min,Max,AVG,Std,Tot size,IAT,Number,Magnitue,Radius,Covariance,Variance,Weight,label
0,0.037456,15099.0,17.0,64.0,10001.102371,10001.102371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,525.0,50.0,50.0,50.0,0.0,50.0,83102150.0,9.5,10.0,0.0,0.0,0.0,141.55,DDoS-UDP_Flood
1,0.0,54.0,6.0,64.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,83331770.0,9.5,10.392305,0.0,0.0,0.0,141.55,DDoS-PSHACK_Flood
2,0.010346,9662.5,17.0,64.0,21380.056228,21380.056228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,525.0,50.0,50.0,50.0,0.0,50.0,83098790.0,9.5,10.0,0.0,0.0,0.0,141.55,DDoS-UDP_Flood
3,0.0,54.0,6.0,64.0,241.333973,241.333973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,82951120.0,9.5,10.392305,0.0,0.0,0.0,141.55,DoS-TCP_Flood
4,0.195109,95.58,6.0,64.0,6.762174,6.762174,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.77,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,83365400.0,9.5,10.392305,0.0,0.0,0.0,141.55,DDoS-SynonymousIP_Flood
5,0.0,54.0,6.0,64.0,1.502265,1.502265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,83067230.0,9.5,10.392305,0.0,0.0,0.0,141.55,DDoS-TCP_Flood
6,0.0,54.0,6.0,64.0,60.667438,60.667438,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,83348610.0,9.5,10.392305,0.0,0.0,0.0,141.55,DDoS-RSTFINFlood
7,0.0,54.0,6.0,64.0,163.291443,163.291443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,567.0,54.0,54.0,54.0,0.0,54.0,83034000.0,9.5,10.392305,0.0,0.0,0.0,141.55,DDoS-TCP_Flood
8,0.0,0.0,1.0,64.0,2.062152,2.062152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,441.0,42.0,42.0,42.0,0.0,42.0,83149750.0,9.5,9.165151,0.0,0.0,0.0,141.55,DDoS-ICMP_Flood
9,0.036378,1618.78,1.05,64.0,46.947385,46.947385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,1.71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,470.44,42.0,71.44,43.510737,6.49508,56.72,83124650.0,9.5,9.291007,9.296842,2160.781828,0.02,141.55,DDoS-ICMP_Flood


In [4]:
# Dataset shape and missing values check

print(f"There are {df.isnull().sum().sum()} Missing Values \n")
print(f"The dataset consists of{df.shape[0]} rows and {df.shape[1]} columns")

There are 0 Missing Values 

The dataset consists of46686579 rows and 47 columns


## **Data Preprocessing**

In [5]:
# Preprocess the "label" column
df['label'] = df['label'].apply(lambda x: 'Benign' if x == 'BenignTraffic' else 'Attack')

In [6]:
# Undersampling the data

from imblearn.under_sampling import RandomUnderSampler

# Balance the data using undersampling
rus = RandomUnderSampler(random_state=42)
X = df.drop(columns=['label'])
y = df['label']

X_resampled, y_resampled = rus.fit_resample(X, y)

# Combine the resampled features and labels into a single dataframe
df_balanced = pd.concat([X_resampled, y_resampled], axis=1)

# Check results
df_balanced['label'].value_counts()

# Change labels from "Benign" to 0 and "Attack" to 1
df_balanced['label'] = df_balanced['label'].apply(lambda x: 0 if x == 'Benign' else 1)

In [7]:
# Convert all float64 columns to float32
float64_cols = df_balanced.select_dtypes(include=['float64']).columns
df_balanced[float64_cols] = df_balanced[float64_cols].astype('float32')

# Export the dataset as a pickle file
df_balanced.to_pickle('iot2023.pkl')