# CICIDS2017 -- Preprocessing

This notebook focuses on all preprocessing steps required to prepare the data for effective model training. The insights and visualizations from the EDA notebook can serve as a helpful basis for identifying promising features. However, that step will be displayed in detail in the report document.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import IncrementalPCA
from sklearn.preprocessing import StandardScaler
import joblib

# Own modules
from dataprepper import DataPrepper

Import and inspect the dataset at EDA-state.

In [28]:
concatenated_df = pd.read_csv('../output/EDA.csv')

data_sca = DataPrepper(df=concatenated_df, target_col=['anomaly_bool', 'Attack Number', 'Attack Type'], name="CICIDS2017 Combined Dataset")

data_sca.inspect()


[32m=== CICIDS2017 Combined Dataset: SHAPE ===[0m
(2522362, 73)

[34m=== CICIDS2017 Combined Dataset: HEAD ===[0m
   Destination Port  Flow Duration  Total Fwd Packets  Total Backward Packets  \
0             54865              3                  2                       0   
1             55054            109                  1                       1   
2             55055             52                  1                       1   
3             46236             34                  1                       1   
4             54863              3                  2                       0   

   Total Length of Fwd Packets  Total Length of Bwd Packets  \
0                           12                            0   
1                            6                            6   
2                            6                            6   
3                            6                            6   
4                           12                            0   

   Fwd Packet L

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Destination Port,2522362.0,8.704762e+03,1.902507e+04,0.0,53.0,80.0,443.00,65535.0
Flow Duration,2522362.0,1.658132e+07,3.522426e+07,-13.0,208.0,50577.0,5329717.25,119999998.0
Total Fwd Packets,2522362.0,1.027627e+01,7.941738e+02,1.0,2.0,2.0,6.00,219759.0
Total Backward Packets,2522362.0,1.156596e+01,1.056594e+03,0.0,1.0,2.0,5.00,291922.0
Total Length of Fwd Packets,2522362.0,6.115751e+02,1.058499e+04,0.0,12.0,66.0,332.00,12900000.0
...,...,...,...,...,...,...,...,...
Active Min,2522362.0,6.542300e+04,6.109712e+05,0.0,0.0,0.0,0.00,110000000.0
Idle Mean,2522362.0,9.331578e+06,2.484157e+07,0.0,0.0,0.0,0.00,120000000.0
Idle Std,2522362.0,5.654433e+05,4.872678e+06,0.0,0.0,0.0,0.00,76900000.0
Idle Max,2522362.0,9.757716e+06,2.561067e+07,0.0,0.0,0.0,0.00,120000000.0


As described in the EDA notebook, we will apply the StandardScaler to normalize the feature values. Before performing the scaling step, however, we must remove the target columns to ensure that only numerical input features are transformed. This prevents the class labels from being altered and guarantees that the scaler operates solely on the actual feature set.

In [29]:
feature_cols = [c for c in data_sca.df.columns if c not in data_sca.target_col]

scaler = StandardScaler()
data_sca.df[feature_cols] = scaler.fit_transform(data_sca.df[feature_cols])

data_sca.df =data_sca.df.round(4)

Since the CICIDS2017 dataset is relatively large, a PCA transformation will be applied.  
Before performing PCA, the scaled dataset is saved as a `.csv` file.  
This ensures that both the scaled version and the scaled + PCA-transformed version are available during the model training process.

In [None]:
data_sca.inspect()
data_sca.df.to_csv("../output/SCA_processed.csv", index=False)


[32m=== CICIDS2017 Combined Dataset: SHAPE ===[0m
(2522362, 73)

[34m=== CICIDS2017 Combined Dataset: HEAD ===[0m
   Destination Port  Flow Duration  Total Fwd Packets  Total Backward Packets  \
0            2.4263        -0.4707            -0.0104                 -0.0109   
1            2.4362        -0.4707            -0.0117                 -0.0100   
2            2.4363        -0.4707            -0.0117                 -0.0100   
3            1.9727        -0.4707            -0.0117                 -0.0100   
4            2.4262        -0.4707            -0.0104                 -0.0109   

   Total Length of Fwd Packets  Total Length of Bwd Packets  \
0                      -0.0566                      -0.0076   
1                      -0.0572                      -0.0076   
2                      -0.0572                      -0.0076   
3                      -0.0572                      -0.0076   
4                      -0.0566                      -0.0076   

   Fwd Packet L

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Destination Port,2522362.0,-3.461497e-17,1.0,-0.457542,-0.454756,-0.453337,-0.434257,2.987125
Flow Duration,2522362.0,2.307665e-17,1.0,-0.470736,-0.470730,-0.469300,-0.319428,2.936008
Total Fwd Packets,2522362.0,-9.014316e-19,1.0,-0.011680,-0.010421,-0.010421,-0.005385,276.701111
Total Backward Packets,2522362.0,1.802863e-19,1.0,-0.010946,-0.010000,-0.009054,-0.006214,276.274878
Total Length of Fwd Packets,2522362.0,-4.687444e-18,1.0,-0.057778,-0.056644,-0.051542,-0.026412,1218.648808
...,...,...,...,...,...,...,...,...
Active Min,2522362.0,5.769162e-18,1.0,-0.107080,-0.107080,-0.107080,-0.107080,179.934185
Idle Mean,2522362.0,-1.038449e-16,1.0,-0.375644,-0.375644,-0.375644,-0.375644,4.454971
Idle Std,2522362.0,-3.461497e-17,1.0,-0.116044,-0.116044,-0.116044,-0.116044,15.665835
Idle Max,2522362.0,0.000000e+00,1.0,-0.381002,-0.381002,-0.381002,-0.381002,4.304546


We will apply dimensionality reduction using IncrementalPCA.
The number of components is set to half of all feature columns.
To handle the large dataset efficiently, the scaled features are split into batches of 500 rows, and each batch is used to update the PCA model step by step.
We need to ensure, that the percentage of variance preserved by the reduced feature space is sufficient enough to guarantee proper data quality.

In [None]:
data_pca = DataPrepper(df=data_sca.df, target_col=data_sca.target_col, name="CICIDS2017 Scaled Dataset for PCA")
features = data_pca.df.drop(data_sca.target_col, axis=1)
attacks = data_pca.df[data_sca.target_col]

size = len(features.columns) // 2
ipca = IncrementalPCA(n_components = size, batch_size = 500)
for batch in np.array_split(data_pca.df[feature_cols], len(features) // 500):
    ipca.partial_fit(batch)

print(f'information retained: {sum(ipca.explained_variance_ratio_):.2%}')

  return bound(*args, **kwds)


information retained: 99.23%


If we reached a sufficient variance percentage, we apply the fitted IncrementalPCA to the scaled dataset.
The transformed output is stored in a new DataFrame, where each column represents one of the extracted principal components.
Finally, the corresponding target columns are added back to the DataFrame so the reduced feature set remains paired with the correct class information.

In [23]:
trans_features = ipca.transform(data_pca.df.drop(data_sca.target_col, axis=1)) #.astype('float32')
pca_data = pd.DataFrame(trans_features, columns = [f'PC{i+1}' for i in range(size)])

pca_data = pca_data.round(4)

pca_data[attacks.columns] = attacks.values

To ensure consistent preprocessing settings and enable full reproducibility, the scaler and the Incremental PCA (IPCA) model are saved as `.pkl` files. It is important to do so because the model must apply exactly the same scaling parameters and PCA components during inference as were used during training. Any deviation in these preprocessing steps would lead to inconsistent feature distributions and significantly impact the model’s performance.


In [37]:
joblib.dump(scaler, './scaler.pkl')
joblib.dump(ipca, './ipca.pkl')

['./ipca.pkl']

Finally, we inspect the PCA transformed data and save it to the `.csv` file.

In [24]:
pca_data = DataPrepper(df=pca_data, target_col='Attack Type', name="CICIDS2017 PCA Transformed Dataset")
pca_data.inspect()


[32m=== CICIDS2017 PCA Transformed Dataset: SHAPE ===[0m
(2522362, 38)

[34m=== CICIDS2017 PCA Transformed Dataset: HEAD ===[0m
      PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9  \
0 -2.3111 -0.0526  0.5159  0.6165  3.8407  0.3953 -0.0179  0.1866  0.3700   
1 -2.2465 -0.0492  0.4679  0.3955  2.0015 -0.1410 -0.0165 -0.7810 -0.8900   
2 -2.2588 -0.0495  0.4736  0.4086  2.0814 -0.1329 -0.0167 -0.7697 -0.8775   
3 -2.2492 -0.0507  0.4671  0.3468  2.0138 -0.1065 -0.0162 -0.7452 -0.8402   
4 -2.3111 -0.0526  0.5159  0.6165  3.8407  0.3953 -0.0179  0.1865  0.3701   

     PC10  ...    PC29    PC30    PC31    PC32    PC33    PC34    PC35  \
0 -0.6803  ... -0.5398 -0.0353  0.0231  0.0017  0.0457  0.1513  0.0519   
1  2.6606  ...  0.7856  0.2130  0.0309  0.0012  0.0259  0.0092 -0.0583   
2  2.6341  ...  0.7809  0.2036  0.0347  0.0012  0.0259  0.0059 -0.0644   
3  2.5068  ...  0.8028  0.0639  0.0476  0.0011  0.0093  0.0190 -0.0333   
4 -0.6804  ... -0.5397 -0.0353  0.

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PC1,2522362.0,5.712899e-08,3.879151,-4.437,-2.0245,-1.9469,0.1371,54.498
PC2,2522362.0,8.198665e-08,2.680336,-5.0348,-0.0494,-0.0363,0.0048,758.2354
PC3,2522362.0,6.117282e-08,2.46569,-14.4236,-0.122,0.2113,0.392,68.2345
PC4,2522362.0,-6.105389e-08,2.058735,-61.9632,-0.6509,-0.3237,0.3772,306.5292
PC5,2522362.0,9.368203e-08,1.896683,-103.0344,-1.1305,-0.767,0.7982,20.7916
PC6,2522362.0,-9.150154e-08,1.754657,-39.8059,-0.7817,-0.1247,0.5792,146.0638
PC7,2522362.0,1.247244e-07,1.601963,-1873.4573,-0.0649,-0.0166,0.0866,0.7912
PC8,2522362.0,-3.762347e-08,1.564327,-19.4929,-0.6548,-0.4978,0.6324,92.0547
PC9,2522362.0,6.275864e-08,1.529373,-84.2699,-0.9682,0.2653,0.7673,141.3122
PC10,2522362.0,4.039864e-08,1.439131,-57.2884,-0.4661,-0.2357,0.4009,77.8411


### Why we do not delete introduced duplicates:

PCA reduces the dimensionality of the dataset by projecting the original features into a lower-dimensional space that captures the most important variance. As a consequence of this projection, different samples can collapse onto the same coordinates even if they were distinct in the original space. These duplicates are a natural mathematical outcome of information loss and should not be removed.

In [39]:
pca_data.df.to_csv("../output/PCA_processed.csv", index=False)

A balanced dataset is essential in machine learning because it ensures that all classes are represented proportionally. When each class contains a comparable number of samples, the model avoids developing a bias toward the majority class. An imbalanced dataset can cause weak performance, especially for minority classes, as the model struggles to recognize and predict them accurately.
Since the dataset is highly imbalanced, we upsample the minority classes and create a balanced dataset for multi-class classification. This results in a well-balanced dataset that can be used effectively for training the classification models.

In [25]:
# Creating a balanced dataset for non pca dataset
normal_traffic = data_sca.df.loc[data_sca.df['Attack Type'] == 'BENIGN']
intrusions = data_sca.df.loc[data_sca.df['Attack Type'] != 'BENIGN']

normal_traffic = normal_traffic.sample(n = len(intrusions), replace = False)

ids_data = pd.concat([intrusions, normal_traffic])
ids_data['Attack Type'] = np.where((ids_data['Attack Type'] == 'BENIGN'), 0, 1)
bc_data_sca = ids_data.sample(n = 15000)

print(bc_data_sca['Attack Type'].value_counts())

Attack Type
1    7506
0    7494
Name: count, dtype: int64


In [41]:
bc_data_sca.to_csv("../output/SCA_balanced.csv", index=False)

In [26]:
# Creating a balanced dataset for pca dataset
normal_traffic = pca_data.df.loc[pca_data.df['Attack Type'] == 'BENIGN']
intrusions = pca_data.df.loc[pca_data.df['Attack Type'] != 'BENIGN']

normal_traffic = normal_traffic.sample(n = len(intrusions), replace = False)

ids_data = pd.concat([intrusions, normal_traffic])
ids_data['Attack Type'] = np.where((ids_data['Attack Type'] == 'BENIGN'), 0, 1)
bc_data_pca = ids_data.sample(n = 15000)

print(bc_data_pca['Attack Type'].value_counts())

Attack Type
1    7505
0    7495
Name: count, dtype: int64


In [43]:
bc_data_pca.to_csv("../output/PCA_balanced.csv", index=False)

At this stage, preprocessing is complete, and the dataset is fully prepared for model development.