# Real-Time Network Intrusion Detection System (NIDS)

In this notebook, we prepare flow-based network traffic data for training the anomaly detection component of our Real-Time NIDS. We begin by loading and exploring the **CSE-CIC-IDS2018** dataset, then extract and engineer key flow-level features such as `packet_rate` and `byte_rate`. These features are used to train an unsupervised machine learning model **(IsolationForest)** that can detect **anomalies** in real-time traffic. The resulting model will be integrated into the RealTimeNIDS system to identify suspicious activity based on flow behavior.

**Imports**:

In [1]:
import pandas as pd
import numpy as np

## Exploratory data analysis (EDA)

In this section, we examine the structure and quality of the dataset to understand the available features, identify missing or inconsistent values, and ensure the data is suitable for feature engineering. This step helps us prepare the dataset for training a robust anomaly detection model.

**Load Dataset**:

In [2]:
df = pd.read_csv('../dataset/02-14-2018.csv')
df.head()

Unnamed: 0,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,0,14/02/2018 08:31:01,112641719,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320859.5,139.300036,56320958,56320761,Benign
1,0,0,14/02/2018 08:33:50,112641466,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320733.0,114.551299,56320814,56320652,Benign
2,0,0,14/02/2018 08:36:39,112638623,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56319311.5,301.934596,56319525,56319098,Benign
3,22,6,14/02/2018 08:40:13,6453966,15,10,1239,2273,744,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign
4,22,6,14/02/2018 08:40:23,8804066,14,11,1143,2209,744,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign


In [3]:
df.columns

Index(['Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Tot Fwd Pkts',
       'Tot Bwd Pkts', 'TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Max',
       'Fwd Pkt Len Min', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Std',
       'Bwd Pkt Len Max', 'Bwd Pkt Len Min', 'Bwd Pkt Len Mean',
       'Bwd Pkt Len Std', 'Flow Byts/s', 'Flow Pkts/s', 'Flow IAT Mean',
       'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Tot',
       'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
       'Bwd IAT Tot', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
       'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags',
       'Bwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
       'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
       'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
       'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
       'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
      

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 80 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1048575 non-null  int64  
 1   Protocol           1048575 non-null  int64  
 2   Timestamp          1048575 non-null  object 
 3   Flow Duration      1048575 non-null  int64  
 4   Tot Fwd Pkts       1048575 non-null  int64  
 5   Tot Bwd Pkts       1048575 non-null  int64  
 6   TotLen Fwd Pkts    1048575 non-null  int64  
 7   TotLen Bwd Pkts    1048575 non-null  int64  
 8   Fwd Pkt Len Max    1048575 non-null  int64  
 9   Fwd Pkt Len Min    1048575 non-null  int64  
 10  Fwd Pkt Len Mean   1048575 non-null  float64
 11  Fwd Pkt Len Std    1048575 non-null  float64
 12  Bwd Pkt Len Max    1048575 non-null  int64  
 13  Bwd Pkt Len Min    1048575 non-null  int64  
 14  Bwd Pkt Len Mean   1048575 non-null  float64
 15  Bwd Pkt Len Std    1048575 non-n

In [17]:
df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,Dst Port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,...,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min
count,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,...,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0
mean,4876.262,8.107557,6255555.0,6.206622,7.211191,447.9936,4521.803,174.5736,8.389535,38.79579,...,2.793536,23.2797,51524.49,21361.51,87891.57,39954.77,3101206.0,729721.8,4812391.0,2126920.0
std,14443.44,4.460625,1260291000.0,44.47851,104.8682,15735.41,151502.1,287.6713,19.48279,53.31882,...,5.557106,11.06185,581558.6,218640.5,739572.5,560269.3,541478000.0,382003100.0,1522117000.0,18170130.0
min,0.0,0.0,-919011000000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,22.0,6.0,7.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,53.0,6.0,1023.0,2.0,1.0,36.0,55.0,34.0,0.0,25.66667,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,443.0,6.0,406669.0,7.0,6.0,455.0,768.0,199.0,0.0,55.5,...,4.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,65533.0,17.0,120000000.0,5115.0,9198.0,8591554.0,13397730.0,64440.0,1460.0,11217.03,...,1031.0,48.0,110240100.0,57234460.0,110240100.0,110240100.0,339450300000.0,243268200000.0,979781000000.0,12603000000.0


**Check Missing Values and Duplicates**:

In [19]:
df.isna().sum().sort_values(ascending=False)

Flow Byts/s     2277
Dst Port           0
Timestamp          0
Protocol           0
Tot Fwd Pkts       0
                ... 
Idle Mean          0
Idle Std           0
Idle Max           0
Idle Min           0
Label              0
Length: 80, dtype: int64

In [21]:
len(df)

1048575

In [20]:
df.duplicated().sum()

np.int64(225628)

**Seperate to `benign` and `attack`**:

In [11]:
df_benign = df[df['Label'].str.strip().str.lower() == 'benign']
df_benign['Label'].nunique()

1

In [13]:
df_attack = df[df['Label'].str.strip().str.lower() != 'benign']
df_attack['Label'].nunique()

2