# Machine Learning Project

In this step, we are using the dataset of my group's projet : Cybersecurity Intrusion Detection from Kaggle

### 1. Descriptive analysis of our data

In [29]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [42]:
data = pd.read_csv("cybersecurity_intrusion_data.csv")

data.head()

Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
0,SID_00001,599,TCP,4,492.983263,DES,0.606818,1,Edge,0,1
1,SID_00002,472,TCP,3,1557.996461,DES,0.301569,0,Firefox,0,0
2,SID_00003,629,TCP,3,75.044262,DES,0.739164,2,Chrome,0,1
3,SID_00004,804,UDP,4,601.248835,DES,0.123267,0,Unknown,0,1
4,SID_00005,453,TCP,5,532.540888,AES,0.054874,1,Firefox,0,0


In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   session_id           9537 non-null   object 
 1   network_packet_size  9537 non-null   int64  
 2   protocol_type        9537 non-null   object 
 3   login_attempts       9537 non-null   int64  
 4   session_duration     9537 non-null   float64
 5   encryption_used      7571 non-null   object 
 6   ip_reputation_score  9537 non-null   float64
 7   failed_logins        9537 non-null   int64  
 8   browser_type         9537 non-null   object 
 9   unusual_time_access  9537 non-null   int64  
 10  attack_detected      9537 non-null   int64  
dtypes: float64(2), int64(5), object(4)
memory usage: 819.7+ KB


In [32]:
data.describe()

Unnamed: 0,network_packet_size,login_attempts,session_duration,ip_reputation_score,failed_logins,unusual_time_access,attack_detected
count,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0
mean,500.430639,4.032086,792.745312,0.331338,1.517773,0.149942,0.447101
std,198.379364,1.963012,786.560144,0.177175,1.033988,0.357034,0.49722
min,64.0,1.0,0.5,0.002497,0.0,0.0,0.0
25%,365.0,3.0,231.953006,0.191946,1.0,0.0,0.0
50%,499.0,4.0,556.277457,0.314778,1.0,0.0,0.0
75%,635.0,5.0,1105.380602,0.453388,2.0,0.0,1.0
max,1285.0,13.0,7190.392213,0.924299,5.0,1.0,1.0


In [33]:
data.apply(lambda x: x.unique(), axis=0)

session_id             [SID_00001, SID_00002, SID_00003, SID_00004, S...
network_packet_size    [599, 472, 629, 804, 453, 815, 653, 406, 608, ...
protocol_type                                           [TCP, UDP, ICMP]
login_attempts               [4, 3, 5, 2, 6, 9, 8, 1, 7, 10, 12, 13, 11]
session_duration       [492.9832634426563, 1557.9964611204384, 75.044...
encryption_used                                          [DES, AES, nan]
ip_reputation_score    [0.606818080396889, 0.3015689675960893, 0.7391...
failed_logins                                         [1, 0, 2, 3, 4, 5]
browser_type                    [Edge, Firefox, Chrome, Unknown, Safari]
unusual_time_access                                               [0, 1]
attack_detected                                                   [1, 0]
dtype: object

In [1]:
data.isnull().sum()

NameError: name 'data' is not defined

So we have :  
- `11 features`
- numeric columns : `network_packet_size, login_attempts, session_duration, ip_reputation_score, failed_logins, unusual_time_access, attack_detected`
- categorical `columns : session_id, protocol_type, encryption_used, browser_type`
- target : `attack_detected`

only `encryption_used` has missing values

| Column | Description | Type | Example Values |
| :--- | :--- | :--- | :--- |
| session_id | Unique identifier for each network session | string | SID_00001 |
| network_packet_size | Average packet size during the session | int | 472 |
| protocol_type | Network protocol used | categorical | TCP, UDP, ICMP |
| login_attempts | Number of login attempts made during session | int | 4 |
| session_duration | Total session length (in seconds) | float | 492.98 |
| encryption_used | Encryption algorithm applied | categorical | DES, AES, Unknown |
| ip_reputation_score | IP trust score between 0 (bad) and 1 (good) | float | 0.60 |
| failed_logins | Number of failed login attempts | int | 1 |
| browser_type | Client browser used during session | categorical | Chrome, Firefox |
| unusual_time_access | Indicates off-hour access (1) or normal (0) | binary | 0 / 1 |
| attack_detected | Target variable: 1 = attack detected, 0 = normal | binary | 0 / 1 |

**Data Quality Assessment :**  
**Accuracy** : All variables seems to be logical  
**Completeness** : Like we said befors, only `encryption_used` has missing values  
**Consistency** : Strong internal consistency, categorical columns don't have many different values  
**Timeliness** : the only time-based feature is `unusual_time_access` and it's just binary values  
**Believability** : Values appear internally coherent  
**Interpretability** : Columns follow clear naming conventions

### 2. Implementation of the necessary pre-processing

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Handle missing values :

In [35]:
for col in data.columns:
    if data[col].dtype == 'object':
        data[col].fillna(data[col].mode()[0], inplace=True)
    else:
        data[col].fillna(data[col].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mean(), inplace=True)


Encode categorical variables and remove duplicates

In [36]:
label_encoders = {}
for col in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

data.drop_duplicates(inplace=True)

In [37]:
data.head()

Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
0,0,599,1,4,492.983263,1,0.606818,1,1,0,1
1,1,472,1,3,1557.996461,1,0.301569,0,2,0,0
2,2,629,1,3,75.044262,1,0.739164,2,0,0,1
3,3,804,2,4,601.248835,1,0.123267,0,4,0,1
4,4,453,1,5,532.540888,0,0.054874,1,2,0,0


Split into feature and target

In [38]:
X = data.drop(columns=['attack_detected'])
y = data['attack_detected']

Feature scalling

In [39]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Split into train/test sets

In [40]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")

Training samples: 7629, Testing samples: 1908


### 3. Formalisation of the problem

#### 1. Problem Definition

This project is developing a **machine learning model** that can identify and categorize potential **cybersecurity intrusions** through data generated from network traffic or system activity.

The project includes the dataset `cybersecurity_intrusion_data.csv`, which includes different **network features** (ex: protocol type, service, duration, byte counts, etc.). Additionally, it includes a **label** indicating whether the activity is normal or a potential cyber attack.

Thus, from this information, we can formulate the problem as a **supervised classification problem**, where each observation (network connection) is assigned a class label that must be predicted.

---

#### 2. Inputs and Outputs

- **Input (X):**
  A set of features based on network traffic or connection-related features, for example:
  - Duration of connection
  - Protocol type (TCP, UDP, ICMP)
  - Source and destination bytes
  - Flag indicators
  - Other features extracted representing behaviors based on system or networks

- **Output (y):**
  - A categorical label describing the type of connection, which is either:
    - "normal," which indicates legitimate behavior, or
    - "DoS," "Probe," "R2L," or "U2R" which indicates an attack.

#### 3. Type of Learning

This is a **Supervised Machine Learning** task, specifically:
- **Type:** Classification
- **Evaluation Metrics:** Depending on class balance, we will use:
  - Accuracy
  - Precision, Recall, and F1-score
  - Confusion Matrix
  - ROC–AUC (if binary classification)

---

#### 5. Expected Outcome 

In short, the trained model should be able to find out if the connection is **normal** or **malicious**; the test data should effectively classify malicious connections into their **appropriate attack categories**; and the model should generalize to **unseen data**, i.e. new network traffic. 

This formalization serves as a basis for the subsequent stages: **Data Preprocessing**, **Model Training**, and **Model Evaluation**.

### 4. Baseline model selection and implementation
Before constructing more intricate machine learning architectures, it is essential to create a **baseline model**. 
A baseline model yields a **point of reference** — enabling us to ascertain whether any future, more elaborate models will improve performance.

To begin, we adopt a **basic, interpretable classifier** capable of processing both numerical and categorical features efficiently, for our intrusion detection task.

In this case, we chose **Logistic Regression** as the baseline model because:

- It is the **simplest supervised classification algorithm**.
- It provides **probabilistic predictions** which are helpful for understanding model confidence.
- It is **quick to fit** and therefore provides a **good reference point**.
- It performs well on **linearly separable** data.

If the dataset contains multiple attack types, we will use **Multinomial Logistic Regression** to deal with the **multi-class** problem.

Since the dataset may be **imbalanced** (some attack types may appear more frequently than others), we cannot simply rely on **accuracy**. 
We will evaluate the performance of the baseline model using:
- **Accuracy**
- **Precision**
- **Recall**
- **F1-score**
- **Confusion Matrix**

These metrics will reflect the model's strengths and weaknesses.

In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Initialize and train Logistic Regression model
model = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Baseline Model Evaluation")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Baseline Model Evaluation
Accuracy: 0.7248

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.78      0.76      1055
           1       0.71      0.65      0.68       853

    accuracy                           0.72      1908
   macro avg       0.72      0.72      0.72      1908
weighted avg       0.72      0.72      0.72      1908


Confusion Matrix:
 [[827 228]
 [297 556]]




The Logistic Regression model set the baseline with an accuracy level of 72.5%, while having balanced precision and recall scores for both classes. The model differentiates between normal connections and connections under attack at such a satisfactory level, however, it still misclassifies some instances, especially in reference to attacks (under-representing attacks through a lower recall for class 1). Overall, this sets a baseline to build upon, indicating that the model generally captures the shape of data to identify patterns, however, more robust models will be needed to increase accuracy and reduce false negatives.