# Data Preprocessing and Feature Engineering

This notebook handles the preprocessing of the CICIDS2017 dataset, including:
- Data loading and exploration
- Cleaning and preprocessing
- Feature scaling
- Label encoding
- Saving processed data for model training

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler


In [2]:
df = pd.read_csv("data/CICIDS2017.csv")
print(df.shape)


(547557, 91)


## 2. Load and Explore Dataset

In [3]:
# Examine the dataset structure
print("Dataset Info:")
print(df.info())
print("\nDataset head:")
print(df.head())
print("\nLabel distribution:")
print(df['Label'].value_counts())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547557 entries, 0 to 547556
Data columns (total 91 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   id                          547557 non-null  int64  
 1   Flow ID                     547557 non-null  object 
 2   Src IP                      547557 non-null  object 
 3   Src Port                    547557 non-null  int64  
 4   Dst IP                      547557 non-null  object 
 5   Dst Port                    547557 non-null  int64  
 6   Protocol                    547557 non-null  int64  
 7   Timestamp                   547557 non-null  object 
 8   Flow Duration               547557 non-null  int64  
 9   Total Fwd Packet            547557 non-null  int64  
 10  Total Bwd packets           547557 non-null  int64  
 11  Total Length of Fwd Packet  547557 non-null  int64  
 12  Total Length of Bwd Packet  547557 non-null  int64  
 13  

In [4]:
# Drop unnecessary columns
columns_to_drop = []

# Add columns to drop if they exist
potential_drops = ['id', 'Flow ID', 'Src IP', 'Dst IP', 'Timestamp', 'Attempted Category', 'Src Port', 'Dst Port']
for col in potential_drops:
    if col in df.columns:
        columns_to_drop.append(col)

if columns_to_drop:
    df.drop(columns=columns_to_drop, inplace=True)
    print(f"Dropped columns: {columns_to_drop}")

print(f"Remaining columns: {len(df.columns)}")
print(f"Dataset shape after dropping columns: {df.shape}")





Dropped columns: ['id', 'Flow ID', 'Src IP', 'Dst IP', 'Timestamp', 'Attempted Category', 'Src Port', 'Dst Port']
Remaining columns: 83
Dataset shape after dropping columns: (547557, 83)


## 3. Data Cleaning and Preprocessing

In [5]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
print("Shape after cleaning:", df.shape)


Shape after cleaning: (547557, 83)


## 4. Label Encoding

In [6]:
from sklearn.preprocessing import LabelEncoder
import joblib

# Encode labels
le = LabelEncoder()
df['Label'] = le.fit_transform(df['Label'])

# Save the encoder for later use
joblib.dump(le, "models/label_encoder.pkl")
print("Label encoding completed and saved to models/label_encoder.pkl")


Label encoding completed and saved to models/label_encoder.pkl


In [7]:
X = df.drop("Label", axis=1)
y = df["Label"]


## 5. Feature Scaling and Data Preparation

In [8]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


## 6. Save Processed Data

In [9]:
import joblib
import numpy as np

# Save the scaler and processed data
joblib.dump(scaler, "models/scaler.pkl")
np.save("data/X_scaled.npy", X_scaled)
np.save("data/y.npy", y)

print("Scaler saved to models/scaler.pkl")
print("Scaled features saved to data/X_scaled.npy")
print("Labels saved to data/y.npy")
print(f"Final dataset shape: {X_scaled.shape}")
print(f"Number of classes: {len(np.unique(y))}")


Scaler saved to models/scaler.pkl
Scaled features saved to data/X_scaled.npy
Labels saved to data/y.npy
Final dataset shape: (547557, 82)
Number of classes: 5
