## 🧪 Anomaly Detection using Gaussian Mixture Models (GMM)

In this notebook, we explore an unsupervised anomaly detection approach using **Gaussian Mixture Models (GMM)**.  
The goal is to detect legitimate traffic (normal behavior) as outliers, having trained the model exclusively on malicious traffic (attacks).

This setup reflects a realistic intrusion detection scenario where only attack data is available during training.  
The model learns the underlying distribution of the attack class, and any deviation from it is treated as an anomaly.

### 🎯 Objective

- Train a GMM using only attack samples.
- Evaluate the model on a test set containing:
  - Unseen attacks (positive class)
  - All available normal traffic (to be detected as anomalies)

### ⚙️ Method

- Use `sklearn.mixture.GaussianMixture` to learn the probabilistic distribution of attack flows.
- Score test samples using the log-likelihood under the learned model.
- Apply a threshold to the log-likelihood to classify samples as attack (inlier) or normal (outlier).
- Evaluate performance using precision, recall, F1-score, ROC AUC, and confusion matrix.

This approach is well-suited for highly imbalanced datasets where the minority class (normal traffic) is rare or partially labeled.

### 📦 Import libraries and configure environment

We import the necessary libraries for data handling, modeling with Gaussian Mixture Models (GMM), and evaluation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.mixture import GaussianMixture
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split

sns.set(style="whitegrid")

### 📂 Load and prepare the dataset

We load the CSV file and prepare the data by:
- Converting the `label` column to a binary variable (`1` = attack, `0` = normal).
- Dropping irrelevant columns such as timestamps, IPs, and ports.

In [2]:
# Load dataset
df = pd.read_csv("20240625_Flooding_Heartbeat_filtered_ordered_OcppFlows_120_labelled.csv")

# Convert label to binary: 1 = attack, 0 = normal
y = df['label'].apply(lambda x: 1 if x == 'cyberattack_ocpp16_dos_flooding_heartbeat' else 0)

# Drop irrelevant columns
columns_to_drop = ['flow_id', 'flow_start_timestamp', 'flow_end_timestamp', 'src_ip', 'dst_ip', 'src_port', 'dst_port', 'label']

X = df.drop(columns=columns_to_drop)

# Check shapes and distribution
print("Feature matrix shape:", X.shape)
print("Label distribution:\n", y.value_counts())

Feature matrix shape: (8787, 48)
Label distribution:
 label
1    8700
0      87
Name: count, dtype: int64


### 🧪 Split data into training and test sets

We train the Gaussian Mixture Model only on attack samples (`y = 1`).  
The test set includes:
- 20% of unseen attack samples
- 100% of normal traffic

This allows us to evaluate how well the model distinguishes between known attack behavior and previously unseen normal traffic.

In [None]:
# Separate attack and normal samples
X_attack = X[y == 1]
X_legit = X[y == 0]

# Split attack samples: 80% for training, 20% for testing
X_train, X_test_attack = train_test_split(X_attack, test_size=0.2, random_state=13)

# Use all normal samples for test
X_test_legit = X_legit.copy()

# Combine test set
X_test = pd.concat([X_test_attack, X_test_legit])
y_test = [1] * len(X_test_attack) + [0] * len(X_test_legit)

# Confirm shapes
print("Training set (attacks only):", X_train.shape)
print("Test set (attacks + normal):", X_test.shape)
print("Test label distribution:", pd.Series(y_test).value_counts())

### 🧠 Train the Gaussian Mixture Model (GMM)

We train a `GaussianMixture` model using only attack samples.  
This model learns the probability distribution of the attack traffic.  
Later, we will use the log-likelihood scores of the test samples to detect outliers (normal traffic).

In [None]:
# Initialize and train the GMM
gmm = GaussianMixture(n_components=1, covariance_type='full', random_state=13)
gmm.fit(X_train)

print("GMM trained on attack samples only.")

### 📉 Score test samples and apply threshold

We compute the log-likelihood of each test sample using the trained GMM.  
To classify samples, we define a threshold on the log-score:

- Samples with log-likelihood **above the threshold** are considered **attacks** (inliers).
- Samples with log-likelihood **below the threshold** are considered **normal traffic** (anomalies).

We compute the threshold as the **5th percentile** of the log-scores on the training (attack) data.
This means that roughly 95% of known attacks will be retained as inliers.

In [None]:
# Get log-likelihoods for training (to define threshold)
train_scores = gmm.score_samples(X_train)
threshold = np.percentile(train_scores, 5)  # 5th percentile

# Get log-likelihoods for test samples
test_scores = gmm.score_samples(X_test)

# Predict: 1 = attack (log-score above threshold), 0 = normal (below threshold)
y_pred = [1 if s > threshold else 0 for s in test_scores]

# Show threshold value
print(f"Log-likelihood threshold (5th percentile of training data): {threshold:.2f}")

### 📊 Evaluate classification performance

We evaluate the predictions made by the GMM using standard classification metrics:
- Precision, Recall, F1-score
- Confusion Matrix
- ROC AUC Score

This shows how well the GMM identifies normal traffic as anomalies, and retains attack samples as inliers.

In [None]:
# Convert y_test to Series for consistency
y_test_series = pd.Series(y_test)

# Classification report
print("=== Classification Report ===")
print(classification_report(y_test_series, y_pred, target_names=["Normal", "Attack"]))

# Confusion matrix
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test_series, y_pred))

# ROC AUC
roc_score = roc_auc_score(y_test_series, test_scores)
print(f"ROC AUC Score (using log-likelihoods): {roc_score:.4f}")

### 📈 Visualize ROC curve and Confusion Matrix

We plot:
- The ROC Curve using the log-likelihood scores to evaluate the model’s ability to separate attacks and normal traffic.
- The Confusion Matrix to visualize correct and incorrect classifications.

The ROC curve gives a threshold-independent view of performance, while the confusion matrix shows the final decision quality based on the chosen threshold.

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test_series, test_scores)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})", linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label="Random classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Gaussian Mixture Model")
plt.legend()
plt.grid(True)
plt.show()

# Confusion matrix heatmap
cm = confusion_matrix(y_test_series, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted Normal', 'Predicted Attack'],
            yticklabels=['Actual Normal', 'Actual Attack'])
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.title("Confusion Matrix – Gaussian Mixture Model")
plt.show()


### ✅ Gaussian Mixture Model – Performance Analysis

We evaluated the GMM-based anomaly detector by training it exclusively on attack samples  
and testing it on both unseen attacks and normal traffic.

#### 📊 Interpretation of Results:

- **Log-likelihood threshold**: defined as the 5th percentile of the log-scores on the training (attack) data.
- **Prediction rule**:
  - If log-score > threshold → classified as **Attack**
  - Else → classified as **Normal (anomaly)**

#### 📋 Key Metrics:
- **Precision (Attack class)**: High precision indicates few false positives (normal samples mistakenly classified as attack).
- **Recall (Attack class)**: Measures the ability to correctly identify attack traffic.
- **F1-score**: Balance between precision and recall.
- **ROC AUC**: Evaluates how well the model separates the two classes across all thresholds.

#### 🧠 Model Behavior:
- If **recall is high**, the model detects most attacks.
- If **precision is low**, it triggers many false alarms.
- If **ROC AUC is high**, the model is good at ranking attacks above legitimate flows, regardless of threshold.

#### 📌 Conclusion:

The GMM approach is effective for modeling attack behavior and identifying deviations (i.e., normal traffic).  
The choice of threshold is critical and can be tuned further to balance false negatives and false positives depending on the use case (e.g., security sensitivity vs. alert fatigue).

This method complements One-Class SVM and is especially valuable in highly imbalanced or partially labeled datasets.

### ✅ Gaussian Mixture Model – Final Performance Analysis

#### 📊 Classification Summary:

| Class      | Precision | Recall | F1-score | Support |
|------------|-----------|--------|----------|---------|
| **Normal** | 0.51      | 1.00   | 0.68     | 87      |
| **Attack** | 1.00      | 0.95   | 0.98     | 1740    |
| **Accuracy**         | –         | –       | **0.96** | 1827    |
| **ROC AUC (log-scores)** | –     | –       | **1.00** |         |

#### 📌 Confusion Matrix:

|                         | Predicted Normal | Predicted Attack |
|-------------------------|------------------|------------------|
| **Actual Normal**       | 87 (True Negative) | 0 (False Positive) |
| **Actual Attack**       | 82 (False Negative) | 1658 (True Positive) |

---

#### 🧠 Interpretation

- ✅ **All normal traffic** was correctly flagged as anomalies (100% recall, 0% false positives).
- ✅ **High recall on attacks**: 95% of attack samples correctly identified.
- ⚠️ **Moderate precision on normal class (0.51)** due to missed attacks.
- ✅ **Perfect ROC AUC (1.00)**: the model perfectly ranks all normal samples below all attacks in log-likelihood space.

The use of the **5th percentile threshold** on the log-scores of attack samples proved to be extremely effective for separating attack and normal traffic in this dataset.

This result confirms that GMM is a powerful model for one-class anomaly detection when the majority class is well-defined and clean.

### 📊 Final Comparative Results: One-Class SVM vs Gaussian Mixture Model (GMM)

This table compares the performance of both models after hyperparameter tuning, using identical train/test splits.

| Metric                    | One-Class SVM       | Gaussian Mixture Model |
|---------------------------|---------------------|-------------------------|
| **Precision (Normal)**    | 0.46                | 0.51                    |
| **Recall (Normal)**       | 0.94                | 1.00                    |
| **F1-score (Normal)**     | 0.62                | 0.68                    |
| **Precision (Attack)**    | 1.00                | 1.00                    |
| **Recall (Attack)**       | 0.95                | 0.95                    |
| **F1-score (Attack)**     | 0.97                | 0.98                    |
| **Overall Accuracy**      | 0.95                | 0.96                    |
| **ROC AUC**               | 0.944               | **1.000**               |

---

### 🧠 Insights:

- ✅ Both models achieve **very high precision** for attacks and excellent overall accuracy.
- ✅ **GMM** has a **higher recall and F1-score** for the normal class, meaning it is better at detecting legitimate traffic as anomalous.
- ⚖️ Both models show **balanced recall for attacks** (`0.95`), but GMM slightly improves the **F1-score**.
- ✅ **ROC AUC is perfect (1.0) in GMM**, indicating ideal separation capability.

**Conclusion:**  
While One-Class SVM performs very well after tuning, the **Gaussian Mixture Model offers slightly better overall detection**, particularly for the legitimate traffic class and in terms of ranking ability.