# NSL-KDD -- Preprocessing

This notebook focuses on all preprocessing steps required to prepare the data for effective model training. The insights and visualizations from the EDA notebook can serve as a helpful basis for identifying promising features. However, that step will be displayed in detail in the report document.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Own modules
from dataprepper import DataPrepper

Import and inspect the dataset at EDA-state.

In [34]:
concatenated_df = pd.read_csv('../output/EDA.csv')

data = DataPrepper(df=concatenated_df, target_col=['attack_category', 'attack_label', 'anomaly_bool'], name="NSL-KDD Combined Dataset")
data.inspect()


[32m=== NSL-KDD Combined Dataset: SHAPE ===[0m
(144157, 45)

[34m=== NSL-KDD Combined Dataset: HEAD ===[0m
   duration  protocol_type  service  flag  src_bytes  dst_bytes  land  \
0        13              1       60     9        118       2425     0   
1         0              2       49     9         53         55     0   
2         0              1       49    10          0          0     0   
3         0              1       24     9      54540       8314     0   
4         0              1       28     1          0          0     0   

   wrong_fragment  urgent  hot  ...  dst_host_srv_diff_host_rate  \
0               0       0    0  ...                          0.0   
1               0       0    0  ...                          0.0   
2               0       0    0  ...                          0.0   
3               0       0    2  ...                          0.0   
4               0       0    0  ...                          0.0   

   dst_host_serror_rate  dst_host_srv_se

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration,144157.0,255.59397,2455.803,0.0,0.0,0.0,0.0,54451.0
protocol_type,144157.0,1.056265,0.4202767,0.0,1.0,1.0,1.0,2.0
service,144157.0,31.426986,16.1883,0.0,20.0,24.0,49.0,69.0
flag,144157.0,6.993486,2.7541,0.0,5.0,9.0,9.0,10.0
src_bytes,144157.0,40560.318923,5487682.0,0.0,0.0,45.0,278.0,1379964000.0
dst_bytes,144157.0,17592.985648,3759113.0,0.0,0.0,0.0,607.0,1309937000.0
land,144157.0,0.000201,0.01418205,0.0,0.0,0.0,0.0,1.0
wrong_fragment,144157.0,0.020963,0.2431196,0.0,0.0,0.0,0.0,3.0
urgent,144157.0,0.000125,0.01580234,0.0,0.0,0.0,0.0,3.0
hot,144157.0,0.194586,2.042863,0.0,0.0,0.0,0.0,101.0


As described in the EDA notebook, we will apply the StandardScaler to normalize the feature values. Before performing the scaling step, however, we must remove the target columns to ensure that only numerical input features are transformed. This prevents the possible target values from being altered and guarantees that the scaler operates solely on the actual feature set.

In [35]:
feature_cols = [c for c in data.df.columns if c not in data.target_col + ['difficulty', 'attack_number']]

scaler = StandardScaler()
data.df[feature_cols] = scaler.fit_transform(data.df[feature_cols])

data.df =data.df.round(4)

We inspect the final dataset and save it to a `.csv` file.

In [36]:
data.inspect()


[32m=== NSL-KDD Combined Dataset: SHAPE ===[0m
(144157, 45)

[34m=== NSL-KDD Combined Dataset: HEAD ===[0m
   duration  protocol_type  service    flag  src_bytes  dst_bytes    land  \
0   -0.0988        -0.1339   1.7650  0.7286    -0.0074    -0.0040 -0.0142   
1   -0.1041         2.2455   1.0855  0.7286    -0.0074    -0.0047 -0.0142   
2   -0.1041        -0.1339   1.0855  1.0917    -0.0074    -0.0047 -0.0142   
3   -0.1041        -0.1339  -0.4588  0.7286     0.0025    -0.0025 -0.0142   
4   -0.1041        -0.1339  -0.2117 -2.1762    -0.0074    -0.0047 -0.0142   

   wrong_fragment  urgent     hot  ...  dst_host_srv_diff_host_rate  \
0         -0.0862 -0.0079 -0.0953  ...                      -0.2836   
1         -0.0862 -0.0079 -0.0953  ...                      -0.2836   
2         -0.0862 -0.0079 -0.0953  ...                      -0.2836   
3         -0.0862 -0.0079  0.8838  ...                      -0.2836   
4         -0.0862 -0.0079 -0.0953  ...                      -0.2836   

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration,144157.0,-4.731793e-18,1.000003,-0.104078,-0.104078,-0.104078,-0.104078,22.068383
protocol_type,144157.0,2.391527e-16,1.000003,-2.51327,-0.133877,-0.133877,-0.133877,2.245516
service,144157.0,3.233392e-17,1.000003,-1.941346,-0.705882,-0.458789,1.085542,2.321006
flag,144157.0,-1.624582e-16,1.000003,-2.539309,-0.723827,0.728558,0.728558,1.091654
src_bytes,144157.0,-9.857900999999999e-20,1.000003,-0.007391,-0.007391,-0.007383,-0.007341,251.459196
dst_bytes,144157.0,-7.886320999999999e-19,1.000003,-0.00468,-0.00468,-0.00468,-0.004519,348.46637
land,144157.0,-3.647423e-18,1.000003,-0.014185,-0.014185,-0.014185,-0.014185,70.497738
wrong_fragment,144157.0,-1.97158e-17,1.000003,-0.086226,-0.086226,-0.086226,-0.086226,12.253422
urgent,144157.0,-9.857900999999999e-19,1.000003,-0.007902,-0.007902,-0.007902,-0.007902,189.838089
hot,144157.0,-8.674953e-18,1.000003,-0.095252,-0.095252,-0.095252,-0.095252,49.345338


In [37]:
data.df.to_csv("../output/SCA_processed.csv", index=False)

A balanced dataset is essential in machine learning because it ensures that all classes are represented proportionally. When each class contains a comparable number of samples, the model avoids developing a bias toward the majority class. An imbalanced dataset can cause weak performance, especially for minority classes, as the model struggles to recognize and predict them accurately.
Since the dataset is imbalance dwe upsample the minority classes and create a balanced dataset for multi-class classification. This results in a well-balanced dataset that can be used effectively for training the classification models.

In [38]:
# Creating a balanced dataset for non pca dataset
normal_traffic = data.df.loc[data.df['anomaly_bool'] == 0]
intrusions = data.df.loc[data.df['anomaly_bool'] == 1]

normal_traffic = normal_traffic.sample(n = len(intrusions), replace = False)

ids_data = pd.concat([intrusions, normal_traffic])
ids_data['anomaly_bool'] = np.where((ids_data['anomaly_bool'] == 0), 0, 1)
bc_data = ids_data.sample(n = 15000)

print(bc_data['anomaly_bool'].value_counts())

anomaly_bool
0    7534
1    7466
Name: count, dtype: int64


In [39]:
bc_data.to_csv("../output/SCA_balanced.csv", index=False)

At this stage, preprocessing is complete, and the dataset is fully prepared for model development.