#  02 – Data Preprocessing

In this notebook, we prepare the extracted features for model training.
We will:
- Load the saved features dataset
- Split the data into input features `X` and labels `y`
- Create a training and testing set


In [None]:
import pandas as pd

# Load the dataset containing the engineered URL features
features_df = pd.read_csv("../data/features_dataset.csv")

# Display the first few rows to inspect the structure
features_df.head()


Unnamed: 0,url_length,num_digits,num_special_chars,num_dots,num_subdirs,has_ip,has_https,label
0,16,0,3,2,0,0,0,1
1,35,1,4,2,2,0,0,0
2,31,1,5,2,3,0,0,0
3,88,7,16,3,3,0,0,1
4,235,22,13,2,3,0,0,1


##  Separate features and labels

We split the dataset into:
- `X`: input features
- `y`: target label (0 = benign, 1 = malicious)

In [None]:
# Split the dataset into features (X) and target labels (y)
X = features_df.drop('label', axis=1)  # All columns except 'label'
y = features_df['label']              # The target column


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
# - 75% for training, 25% for testing
# - Stratify ensures the class distribution (benign/malicious) is preserved
# - random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


Training set size: (488393, 7)
Test set size: (162798, 7)


In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the feature values to have zero mean and unit variance
# Important: fit the scaler only on the training data to avoid data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Fit + transform on training set
X_test_scaled = scaler.transform(X_test)         # Only transform on test set


In [None]:
import numpy as np

# Check that the training features are properly standardized
# The mean should be ~0 and the standard deviation ~1 for each feature
print("Mean of features (X_train_scaled):", np.mean(X_train_scaled, axis=0))
print("Standard deviation of features (X_train_scaled):", np.std(X_train_scaled, axis=0))


Moyenne des features (X_train_scaled): [ 2.23175303e-17  1.88404183e-17 -5.69431640e-17 -4.71956116e-17
 -2.85734221e-17 -3.31707751e-17  4.18999265e-18]
Écart-type des features (X_train_scaled): [1. 1. 1. 1. 1. 1. 1.]


In [None]:
# Display the first 5 rows of the scaled training data
# This gives a sense of what the standardized features look like
print("Example of scaled data:")
print(X_train_scaled[:5])


Exemple de données après scaling :
[[-0.65200784 -0.21517581 -0.63675689 -0.80300455 -1.02535078 -0.13976043
  -0.15766416]
 [-0.69675067 -0.38602931 -0.63675689 -0.13080544 -0.4995363  -0.13976043
  -0.15766416]
 [-0.40592226 -0.47145606 -0.38020784 -0.13080544  0.55209266 -0.13976043
  -0.15766416]
 [-0.38355084 -0.47145606 -0.38020784 -0.13080544 -0.4995363  -0.13976043
  -0.15766416]
 [-0.27169376 -0.30060256 -0.25193331 -0.80300455  0.02627818 -0.13976043
  -0.15766416]]


In [None]:
# Save the processed datasets as .npy files to reuse them in the next notebook (Model Training)
import numpy as np

np.save("X_train_scaled.npy", X_train_scaled)
np.save("X_test_scaled.npy", X_test_scaled)
np.save("y_train.npy", y_train)
np.save("y_test.npy", y_test)


##  Preprocessing Completed

The dataset has been successfully preprocessed and is now ready for model training.

We performed the following steps:
- Loaded the engineered feature dataset
- Split the data into training and testing sets
- Standardized the feature values using `StandardScaler`
- Saved the processed arrays as `.npy` files for later use

You can now proceed to the next notebook: **Model_training.ipynb** to build and evaluate classification models.
