## Descriptive analysis of your data ##

In [2]:
import pandas as pd
import numpy as np 

# 1. Load data
intrusion_data = pd.read_csv("cybersecurity_intrusion_data.csv", sep=',')

print("="*50)
print("BASIC INFORMATIONS (intrusion_data.info())\n")
intrusion_data.info()
print("="*50)

print("\nDESCRIPTIVES STATISTICS (intrusion_data.describe())\n")
print(intrusion_data.describe())
print("="*50)

print("\nDATA OVERVIEW (intrusion_data.head())\n")
print(intrusion_data.head())
print("="*50)

BASIC INFORMATIONS (intrusion_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   session_id           9537 non-null   object 
 1   network_packet_size  9537 non-null   int64  
 2   protocol_type        9537 non-null   object 
 3   login_attempts       9537 non-null   int64  
 4   session_duration     9537 non-null   float64
 5   encryption_used      7571 non-null   object 
 6   ip_reputation_score  9537 non-null   float64
 7   failed_logins        9537 non-null   int64  
 8   browser_type         9537 non-null   object 
 9   unusual_time_access  9537 non-null   int64  
 10  attack_detected      9537 non-null   int64  
dtypes: float64(2), int64(5), object(4)
memory usage: 819.7+ KB

DESCRIPTIVES STATISTICS (intrusion_data.describe())

       network_packet_size  login_attempts  session_duration  \
count          95

- Thanks to this analysis, we can see that the column "encryption" has some missing values (1966 precisely)
- As we can see the column "attack_detected" is a type int64 (with his values being only 0 or 1), so we can conclude that this is a binary classification problem.
- The "session_id" column is useless for our problem because it is a unique identifier, so we can drop the column later. 

- Also, the column "session_duration" needs some scaling, as we can see a big difference between the max value and the 75% percentile. 
- Here, we are confronted to a relatively balanced problem, with approximatively 45% of sessions classified as attacks 

## Implementation of the necessary pre-processing ##

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

#Dropping the unique identifier column, and putting the column attack_detected as the target variable 
X = intrusion_data.drop(['attack_detected', 'session_id'], axis=1)
y = intrusion_data['attack_detected']

#Splitting dataset into training and testing 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) 

print(f"Size of X_Train(X_train): {X_train.shape}")
print(f"Size of X_Test(X_test): {X_test.shape}")

#Indentifing column by type
numerical_features = X.select_dtypes(include=np.number).columns
categorical_features = X.select_dtypes(include='object').columns

Size of X_Train(X_train): (7629, 9)
Size of X_Test(X_test): (1908, 9)


In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Transformations : Standardisation (scaling)
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), 
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [20]:
from sklearn.compose import ColumnTransformer

# On combine les deux pipelines en ciblant les colonnes correspondantes
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
)

In [22]:
# fit
X_train_processed = preprocessor.fit_transform(X_train)

# transform
X_test_processed = preprocessor.transform(X_test)

print("\nAfter pre-processing")
print(f"Taille de X_train_processed: {X_train_processed.shape}")
print(f"Taille de X_test_processed: {X_test_processed.shape}")

# Optional: Afficher le nombre de features créées
num_new_features = X_train_processed.shape[1]
print(f"Nombre total de features après encodage : {num_new_features}")


After pre-processing
Taille de X_train_processed: (7629, 16)
Taille de X_test_processed: (1908, 16)
Nombre total de features après encodage : 16


## Formalisation of the problem ##

The project aims to solve a Binary Classification problem using supervised learning. The main objective is to 
build a model capable of predicting, for each network session, whether it constitutes an Intrusion/Attack (class 1) or Normal Traffic (class 0). The target variable is the attack_detected column. The predictive variables are all other columns after pre-processing, encoding, and scaling. Model evaluation will primarily focus on Recall and the F1-Score for the 'Attack' class. These metrics are prioritized because the most critical risk in cybersecurity is the False Negative (an undetected attack), and Accuracy alone is insufficient to guarantee system reliability.

## Selection of a baseline model and implementation of the model ##

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import time

# Model Training 
start_time = time.time()
baseline_model = LogisticRegression(random_state=42, solver='liblinear', max_iter=500) 

baseline_model.fit(X_train_processed, y_train)
end_time = time.time()

# Prediction 
y_pred = baseline_model.predict(X_test_processed)

# Predict probabilities for AUC-ROC calculation
y_proba = baseline_model.predict_proba(X_test_processed)[:, 1] 

# Evaluation
print("="*60)
print("             BASELINE MODEL: LOGISTIC REGRESSION RESULTS")
print("="*60)

# 1. Classification Report (F1, Precision, Recall, Accuracy)
print("\n1. CLASSIFICATION REPORT:\n")
print(classification_report(y_test, y_pred, target_names=['Normal (0)', 'Attack (1)']))

# 2. AUC-ROC Score
auc_roc = roc_auc_score(y_test, y_proba)
print(f"2. AUC-ROC Score: {auc_roc:.4f}")

# 3. Training Time
print(f"\n3. Training Time: {end_time - start_time:.4f} seconds")

# 4. Confusion Matrix (Visualizing Errors)
conf_matrix = confusion_matrix(y_test, y_pred)
print("\n4. CONFUSION MATRIX:")
# Visual representation helps to clearly see FN (False Negatives)
print(f"True Negatives (TN): {conf_matrix[0, 0]}")
print(f"False Positives (FP): {conf_matrix[0, 1]}")
print(f"False Negatives (FN - CRITICAL ERROR): {conf_matrix[1, 0]}")
print(f"True Positives (TP): {conf_matrix[1, 1]}")
print("\n----------------------------------------------------------")

             BASELINE MODEL: LOGISTIC REGRESSION RESULTS

1. CLASSIFICATION REPORT:

              precision    recall  f1-score   support

  Normal (0)       0.74      0.79      0.76      1055
  Attack (1)       0.72      0.65      0.68       853

    accuracy                           0.73      1908
   macro avg       0.73      0.72      0.72      1908
weighted avg       0.73      0.73      0.73      1908

2. AUC-ROC Score: 0.7873

3. Training Time: 0.0426 seconds

4. CONFUSION MATRIX:
True Negatives (TN): 837
False Positives (FP): 218
False Negatives (FN - CRITICAL ERROR): 300
True Positives (TP): 553

----------------------------------------------------------
