# Lab 03: Network Anomaly Detection

Build an anomaly detection system for network traffic using Isolation Forest and One-Class SVM.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab03_anomaly_detection.ipynb)

## Learning Objectives
- Network flow feature engineering
- Isolation Forest for anomaly detection
- One-Class SVM and Local Outlier Factor
- Precision-Recall evaluation for imbalanced data

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install scikit-learn pandas numpy matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

plt.style.use("seaborn-v0_8-whitegrid")
np.random.seed(42)

## 1. Generate Network Flow Data

In [None]:
# Generate comprehensive network flow data with diverse attack patterns
n_normal = 2000

# ============================================================
# NORMAL TRAFFIC - Multiple enterprise traffic profiles
# ============================================================

# Web browsing (HTTP/HTTPS)
n_web = 600
web_traffic = {
    "bytes_sent": np.random.lognormal(7, 0.8, n_web),  # Requests
    "bytes_recv": np.random.lognormal(10, 1.5, n_web),  # Responses (pages, images)
    "packets_sent": np.random.poisson(30, n_web),
    "packets_recv": np.random.poisson(100, n_web),
    "duration": np.random.exponential(3, n_web),
    "dst_port": np.random.choice([80, 443], n_web, p=[0.2, 0.8]),
    "protocol": np.full(n_web, "TCP"),
    "src_ip_count": np.ones(n_web, dtype=int),  # Single destination per flow
    "dst_ip_count": np.ones(n_web, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# Email traffic (SMTP/IMAP/POP3)
n_email = 200
email_traffic = {
    "bytes_sent": np.random.lognormal(8, 1.0, n_email),
    "bytes_recv": np.random.lognormal(9, 1.2, n_email),
    "packets_sent": np.random.poisson(40, n_email),
    "packets_recv": np.random.poisson(60, n_email),
    "duration": np.random.exponential(2, n_email),
    "dst_port": np.random.choice([25, 587, 993, 995, 143], n_email),
    "protocol": np.full(n_email, "TCP"),
    "src_ip_count": np.ones(n_email, dtype=int),
    "dst_ip_count": np.ones(n_email, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# DNS queries (normal)
n_dns = 400
dns_traffic = {
    "bytes_sent": np.random.normal(70, 15, n_dns).clip(40, 200),
    "bytes_recv": np.random.normal(150, 40, n_dns).clip(80, 400),
    "packets_sent": np.ones(n_dns, dtype=int),  # Single query
    "packets_recv": np.random.choice([1, 2, 3], n_dns, p=[0.7, 0.2, 0.1]),
    "duration": np.random.uniform(0.001, 0.2, n_dns),  # Fast
    "dst_port": np.full(n_dns, 53),
    "protocol": np.full(n_dns, "UDP"),
    "src_ip_count": np.ones(n_dns, dtype=int),
    "dst_ip_count": np.ones(n_dns, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# SSH sessions
n_ssh = 100
ssh_traffic = {
    "bytes_sent": np.random.lognormal(9, 1.5, n_ssh),
    "bytes_recv": np.random.lognormal(10, 1.8, n_ssh),
    "packets_sent": np.random.poisson(200, n_ssh),
    "packets_recv": np.random.poisson(250, n_ssh),
    "duration": np.random.exponential(300, n_ssh),  # Long sessions
    "dst_port": np.full(n_ssh, 22),
    "protocol": np.full(n_ssh, "TCP"),
    "src_ip_count": np.ones(n_ssh, dtype=int),
    "dst_ip_count": np.ones(n_ssh, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# Database connections
n_db = 200
db_traffic = {
    "bytes_sent": np.random.lognormal(7, 1.2, n_db),
    "bytes_recv": np.random.lognormal(11, 1.5, n_db),
    "packets_sent": np.random.poisson(50, n_db),
    "packets_recv": np.random.poisson(150, n_db),
    "duration": np.random.exponential(1, n_db),
    "dst_port": np.random.choice([3306, 5432, 1433, 27017], n_db),
    "protocol": np.full(n_db, "TCP"),
    "src_ip_count": np.ones(n_db, dtype=int),
    "dst_ip_count": np.ones(n_db, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# API traffic
n_api = 400
api_traffic = {
    "bytes_sent": np.random.lognormal(7, 0.8, n_api),
    "bytes_recv": np.random.lognormal(8, 1.0, n_api),
    "packets_sent": np.random.poisson(10, n_api),
    "packets_recv": np.random.poisson(15, n_api),
    "duration": np.random.exponential(0.5, n_api),
    "dst_port": np.random.choice([443, 8443, 8080], n_api),
    "protocol": np.full(n_api, "TCP"),
    "src_ip_count": np.ones(n_api, dtype=int),
    "dst_ip_count": np.ones(n_api, dtype=int),
    "attack_type": "normal",
    "label": 0,
}

# ============================================================
# ATTACK TRAFFIC - Multiple attack categories with MITRE mapping
# ============================================================

# Attack Type 1: PORT SCANNING (T1046 - Network Service Discovery)
n_scan = 50
port_scan = {
    "bytes_sent": np.random.normal(60, 10, n_scan),  # Small SYN packets
    "bytes_recv": np.random.choice([0, 40], n_scan, p=[0.7, 0.3]),  # Mostly no response
    "packets_sent": np.random.randint(100, 1000, n_scan),  # Many probes
    "packets_recv": np.random.randint(0, 100, n_scan),
    "duration": np.random.uniform(1, 30, n_scan),
    "dst_port": np.random.randint(1, 65535, n_scan),  # Random ports
    "protocol": np.full(n_scan, "TCP"),
    "src_ip_count": np.ones(n_scan, dtype=int),
    "dst_ip_count": np.random.randint(50, 500, n_scan),  # Many destinations
    "attack_type": "port_scan",
    "label": 1,
}

# Attack Type 2: BRUTE FORCE SSH (T1110 - Brute Force)
n_brute = 40
brute_force = {
    "bytes_sent": np.random.normal(500, 100, n_brute),  # Login attempts
    "bytes_recv": np.random.normal(200, 50, n_brute),
    "packets_sent": np.random.randint(50, 200, n_brute),  # Repeated attempts
    "packets_recv": np.random.randint(50, 200, n_brute),
    "duration": np.random.uniform(60, 600, n_brute),  # Long duration
    "dst_port": np.random.choice([22, 3389, 21, 23], n_brute),  # Auth services
    "protocol": np.full(n_brute, "TCP"),
    "src_ip_count": np.ones(n_brute, dtype=int),
    "dst_ip_count": np.ones(n_brute, dtype=int),
    "attack_type": "brute_force",
    "label": 1,
}

# Attack Type 3: C2 BEACONING (T1071 - Application Layer Protocol)
n_c2 = 50
c2_beacon = {
    "bytes_sent": np.random.normal(256, 50, n_c2),  # Regular beacon size
    "bytes_recv": np.random.normal(512, 100, n_c2),  # Command responses
    "packets_sent": np.random.poisson(5, n_c2),
    "packets_recv": np.random.poisson(8, n_c2),
    "duration": np.random.uniform(0.1, 2, n_c2),  # Short transactions
    "dst_port": np.random.choice([443, 80, 8080, 8443], n_c2),  # Blend with web
    "protocol": np.full(n_c2, "TCP"),
    "src_ip_count": np.ones(n_c2, dtype=int),
    "dst_ip_count": np.ones(n_c2, dtype=int),
    "attack_type": "c2_beacon",
    "label": 1,
}

# Attack Type 4: DATA EXFILTRATION (T1048 - Exfiltration Over Alternative Protocol)
n_exfil = 30
data_exfil = {
    "bytes_sent": np.random.lognormal(14, 1, n_exfil),  # Large uploads (10MB+)
    "bytes_recv": np.random.normal(500, 100, n_exfil),  # Small ACKs
    "packets_sent": np.random.randint(1000, 10000, n_exfil),
    "packets_recv": np.random.randint(100, 500, n_exfil),
    "duration": np.random.uniform(60, 3600, n_exfil),  # Long transfers
    "dst_port": np.random.choice([443, 53, 21, 22], n_exfil),
    "protocol": np.full(n_exfil, "TCP"),
    "src_ip_count": np.ones(n_exfil, dtype=int),
    "dst_ip_count": np.ones(n_exfil, dtype=int),
    "attack_type": "data_exfil",
    "label": 1,
}

# Attack Type 5: DNS TUNNELING (T1071.004 - DNS)
n_dns_tunnel = 40
dns_tunnel = {
    "bytes_sent": np.random.randint(200, 500, n_dns_tunnel),  # Large DNS queries
    "bytes_recv": np.random.randint(300, 800, n_dns_tunnel),  # Large TXT responses
    "packets_sent": np.random.randint(50, 200, n_dns_tunnel),  # Many queries
    "packets_recv": np.random.randint(50, 200, n_dns_tunnel),
    "duration": np.random.uniform(60, 600, n_dns_tunnel),
    "dst_port": np.full(n_dns_tunnel, 53),
    "protocol": np.full(n_dns_tunnel, "UDP"),
    "src_ip_count": np.ones(n_dns_tunnel, dtype=int),
    "dst_ip_count": np.ones(n_dns_tunnel, dtype=int),
    "attack_type": "dns_tunnel",
    "label": 1,
}

# Attack Type 6: DDoS VOLUMETRIC (T1498 - Network Denial of Service)
n_ddos = 30
ddos_attack = {
    "bytes_sent": np.random.lognormal(13, 0.5, n_ddos),  # High volume
    "bytes_recv": np.random.normal(0, 10, n_ddos).clip(0),  # Little response
    "packets_sent": np.random.randint(10000, 100000, n_ddos),  # Massive packets
    "packets_recv": np.random.randint(0, 100, n_ddos),
    "duration": np.random.uniform(30, 300, n_ddos),
    "dst_port": np.random.choice([80, 443, 53], n_ddos),
    "protocol": np.random.choice(["TCP", "UDP"], n_ddos),
    "src_ip_count": np.random.randint(100, 1000, n_ddos),  # Spoofed sources
    "dst_ip_count": np.ones(n_ddos, dtype=int),
    "attack_type": "ddos",
    "label": 1,
}

# Attack Type 7: LATERAL MOVEMENT SMB (T1021.002 - SMB/Windows Admin Shares)
n_lateral = 40
lateral_movement = {
    "bytes_sent": np.random.lognormal(10, 1.2, n_lateral),
    "bytes_recv": np.random.lognormal(11, 1.5, n_lateral),
    "packets_sent": np.random.poisson(200, n_lateral),
    "packets_recv": np.random.poisson(250, n_lateral),
    "duration": np.random.exponential(10, n_lateral),
    "dst_port": np.random.choice([445, 135, 139, 5985], n_lateral),  # SMB/RPC/WinRM
    "protocol": np.full(n_lateral, "TCP"),
    "src_ip_count": np.ones(n_lateral, dtype=int),
    "dst_ip_count": np.random.randint(2, 20, n_lateral),  # Multiple internal hosts
    "attack_type": "lateral_movement",
    "label": 1,
}

# Attack Type 8: CRYPTO MINING (T1496 - Resource Hijacking)
n_mining = 30
crypto_mining = {
    "bytes_sent": np.random.normal(1000, 200, n_mining),  # Share submissions
    "bytes_recv": np.random.normal(500, 100, n_mining),  # Work units
    "packets_sent": np.random.poisson(100, n_mining),
    "packets_recv": np.random.poisson(80, n_mining),
    "duration": np.random.uniform(3600, 86400, n_mining),  # Very long (hours)
    "dst_port": np.random.choice([3333, 4444, 8333, 14444], n_mining),  # Mining pools
    "protocol": np.full(n_mining, "TCP"),
    "src_ip_count": np.ones(n_mining, dtype=int),
    "dst_ip_count": np.ones(n_mining, dtype=int),
    "attack_type": "crypto_mining",
    "label": 1,
}

# ============================================================
# Combine all traffic
# ============================================================
all_traffic = [
    # Normal
    web_traffic,
    email_traffic,
    dns_traffic,
    ssh_traffic,
    db_traffic,
    api_traffic,
    # Attacks
    port_scan,
    brute_force,
    c2_beacon,
    data_exfil,
    dns_tunnel,
    ddos_attack,
    lateral_movement,
    crypto_mining,
]

df = pd.concat([pd.DataFrame(t) for t in all_traffic], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"ðŸ“Š Dataset Statistics:")
print(f"   Total flows: {len(df)}")
print(f"   Normal traffic: {len(df[df['label'] == 0])}")
print(f"   Attack traffic: {len(df[df['label'] == 1])}")
print(f"   Attack percentage: {100 * df['label'].mean():.1f}%")

print(f"\nðŸŽ¯ Attack Type Distribution:")
attack_counts = df[df["label"] == 1]["attack_type"].value_counts()
for attack_type, count in attack_counts.items():
    print(f"   {attack_type}: {count}")

print(f"\nðŸ“ˆ Traffic Profile Distribution:")
normal_counts = df[df["label"] == 0].groupby("dst_port").size().sort_values(ascending=False).head(8)
for port, count in normal_counts.items():
    print(f"   Port {port}: {count} flows")

## 2. Feature Engineering

In [None]:
# Engineer comprehensive network features
df["duration"] = df["duration"].clip(lower=0.001)

# ============================================================
# Traffic Volume Features
# ============================================================
df["total_bytes"] = df["bytes_sent"] + df["bytes_recv"]
df["total_packets"] = df["packets_sent"] + df["packets_recv"]

# ============================================================
# Rate Features (key for detecting high-volume attacks)
# ============================================================
df["bytes_per_second"] = df["total_bytes"] / df["duration"]
df["packets_per_second"] = df["total_packets"] / df["duration"]
df["bytes_per_packet"] = df["total_bytes"] / (df["total_packets"] + 1)

# ============================================================
# Ratio Features (asymmetric traffic is suspicious)
# ============================================================
df["bytes_ratio"] = df["bytes_sent"] / (df["total_bytes"] + 1)  # >0.5 = more sent than recv
df["packets_ratio"] = df["packets_sent"] / (df["total_packets"] + 1)
df["send_recv_ratio"] = (df["bytes_sent"] + 1) / (df["bytes_recv"] + 1)

# ============================================================
# Port Features
# ============================================================
WELL_KNOWN_PORTS = [80, 443, 22, 25, 53, 143, 993, 995, 587, 3306, 5432]
SUSPICIOUS_PORTS = [4444, 8888, 31337, 6667, 1337, 3333, 14444, 8333, 4443]

df["is_well_known_port"] = df["dst_port"].isin(WELL_KNOWN_PORTS).astype(int)
df["is_suspicious_port"] = df["dst_port"].isin(SUSPICIOUS_PORTS).astype(int)
df["is_high_port"] = (df["dst_port"] > 1024).astype(int)

# ============================================================
# Connection Pattern Features (fan-out detection)
# ============================================================
df["is_multi_dest"] = (df["dst_ip_count"] > 1).astype(int)  # Many destinations = scan
df["is_multi_src"] = (df["src_ip_count"] > 1).astype(int)  # Many sources = DDoS

# ============================================================
# Protocol Features
# ============================================================
df["is_tcp"] = (df["protocol"] == "TCP").astype(int)
df["is_udp"] = (df["protocol"] == "UDP").astype(int)

# ============================================================
# Log-transformed features (handle extreme values)
# ============================================================
df["log_bytes"] = np.log1p(df["total_bytes"])
df["log_packets"] = np.log1p(df["total_packets"])
df["log_duration"] = np.log1p(df["duration"])
df["log_bps"] = np.log1p(df["bytes_per_second"])
df["log_pps"] = np.log1p(df["packets_per_second"])

print("ðŸ“Š Engineered Features Summary:")
print(f"   Total features created: 18")
print("\nðŸ“ˆ Key Feature Statistics:")
key_features = [
    "bytes_per_second",
    "packets_per_second",
    "bytes_ratio",
    "send_recv_ratio",
    "duration",
]
for feat in key_features:
    print(f"\n   {feat}:")
    print(f"      Normal mean: {df[df['label']==0][feat].mean():.2f}")
    print(f"      Attack mean: {df[df['label']==1][feat].mean():.2f}")

print("\nðŸŽ¯ Attack-specific patterns detected in features:")
print(f"   Multi-dest flows (scanning): {df['is_multi_dest'].sum()}")
print(f"   Multi-source flows (DDoS): {df['is_multi_src'].sum()}")
print(f"   Suspicious port flows: {df['is_suspicious_port'].sum()}")

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

features_to_plot = ["bytes_per_second", "packets_per_second", "bytes_ratio", "duration"]
for ax, feature in zip(axes.flatten(), features_to_plot):
    for label, color in [(0, "green"), (1, "red")]:
        subset = df[df["label"] == label][feature]
        ax.hist(
            subset,
            alpha=0.5,
            label=f"{'Normal' if label==0 else 'Anomaly'}",
            bins=30,
            color=color,
            density=True,
        )
    ax.set_xlabel(feature)
    ax.set_ylabel("Density")
    ax.legend()
    ax.set_title(f"{feature} Distribution")

plt.tight_layout()
plt.show()

## 3. Prepare Features for Anomaly Detection

In [None]:
# Select comprehensive features for anomaly detection
feature_cols = [
    # Rate features (most discriminative)
    "log_bps",
    "log_pps",
    "bytes_per_packet",
    # Ratio features (detect asymmetric traffic)
    "bytes_ratio",
    "packets_ratio",
    "send_recv_ratio",
    # Duration (detect long-running or burst attacks)
    "log_duration",
    # Volume features
    "log_bytes",
    "log_packets",
    # Port-based features
    "is_well_known_port",
    "is_suspicious_port",
    "is_high_port",
    # Connection pattern features
    "is_multi_dest",
    "is_multi_src",
    # Protocol features
    "is_tcp",
    "is_udp",
]

X = df[feature_cols].values
y = df["label"].values
attack_types = df["attack_type"].values

# Use RobustScaler for outlier-robust scaling
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

print(f"ðŸ“Š Feature Matrix:")
print(f"   Shape: {X_scaled.shape}")
print(f"   Features: {len(feature_cols)}")
print(f"\nðŸ“‹ Features used:")
for i, col in enumerate(feature_cols):
    print(f"   {i+1}. {col}")

## 4. Isolation Forest

In [None]:
# Train Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100, contamination=0.1, random_state=42  # Expected proportion of anomalies
)

# Predict: -1 for anomaly, 1 for normal
iso_pred = iso_forest.fit_predict(X_scaled)

# Convert to binary (1 for anomaly, 0 for normal)
iso_pred_binary = (iso_pred == -1).astype(int)

print("Isolation Forest Results:")
print(f"Predicted anomalies: {iso_pred_binary.sum()}")
print(f"Actual anomalies: {y.sum()}")

In [None]:
# Evaluate Isolation Forest
precision = precision_score(y, iso_pred_binary)
recall = recall_score(y, iso_pred_binary)
f1 = f1_score(y, iso_pred_binary)

print("Isolation Forest Metrics:")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

# Confusion matrix
cm = confusion_matrix(y, iso_pred_binary)
plt.figure(figsize=(6, 5))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Normal", "Anomaly"],
    yticklabels=["Normal", "Anomaly"],
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Isolation Forest Confusion Matrix")
plt.show()

## 5. One-Class SVM

In [None]:
# Train One-Class SVM (on normal data only for proper one-class learning)
# In practice, you'd train only on normal traffic
ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.1)  # Upper bound on fraction of outliers

ocsvm_pred = ocsvm.fit_predict(X_scaled)
ocsvm_pred_binary = (ocsvm_pred == -1).astype(int)

print("One-Class SVM Results:")
print(f"Predicted anomalies: {ocsvm_pred_binary.sum()}")

## 6. Local Outlier Factor

In [None]:
# Train Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)

lof_pred = lof.fit_predict(X_scaled)
lof_pred_binary = (lof_pred == -1).astype(int)

print("LOF Results:")
print(f"Predicted anomalies: {lof_pred_binary.sum()}")

## 7. Compare All Models

In [None]:
# Compare all models
models = {
    "Isolation Forest": iso_pred_binary,
    "One-Class SVM": ocsvm_pred_binary,
    "Local Outlier Factor": lof_pred_binary,
}

results = []
for name, pred in models.items():
    results.append(
        {
            "Model": name,
            "Precision": precision_score(y, pred),
            "Recall": recall_score(y, pred),
            "F1": f1_score(y, pred),
        }
    )

results_df = pd.DataFrame(results)
print("Model Comparison:")
print(results_df.to_string(index=False))

# Plot comparison
results_df.set_index("Model")[["Precision", "Recall", "F1"]].plot(kind="bar", figsize=(10, 5))
plt.title("Anomaly Detection Model Comparison")
plt.ylabel("Score")
plt.xticks(rotation=45)
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

## 8. Anomaly Score Analysis

In [None]:
# Get anomaly scores from Isolation Forest
anomaly_scores = -iso_forest.score_samples(X_scaled)
df["anomaly_score"] = anomaly_scores

# Plot score distribution
fig, ax = plt.subplots(figsize=(10, 5))

for label, color, name in [(0, "green", "Normal"), (1, "red", "Anomaly")]:
    subset = df[df["label"] == label]["anomaly_score"]
    ax.hist(subset, alpha=0.5, label=name, bins=30, color=color, density=True)

ax.axvline(
    x=np.percentile(anomaly_scores, 90),
    color="black",
    linestyle="--",
    label="90th percentile threshold",
)
ax.set_xlabel("Anomaly Score")
ax.set_ylabel("Density")
ax.set_title("Anomaly Score Distribution")
ax.legend()
plt.show()

In [None]:
# Show top anomalies with attack type classification
print("ðŸš¨ Top 15 Most Anomalous Flows:")
print("=" * 80)

top_anomalies = df.nlargest(15, "anomaly_score")[
    [
        "attack_type",
        "dst_port",
        "bytes_sent",
        "bytes_recv",
        "packets_sent",
        "duration",
        "anomaly_score",
        "label",
    ]
]
print(top_anomalies.to_string())

# Detection by attack type
print("\n\nðŸ“Š Detection Performance by Attack Type:")
print("=" * 80)

df["predicted"] = iso_pred_binary

for attack_type in df["attack_type"].unique():
    subset = df[df["attack_type"] == attack_type]
    if attack_type == "normal":
        # For normal, we want low false positive rate
        fp = subset["predicted"].sum()
        fp_rate = 100 * fp / len(subset)
        print(f"   {attack_type:18s}: {len(subset):4d} flows, False Positive Rate: {fp_rate:.1f}%")
    else:
        # For attacks, we want high detection rate
        detected = subset["predicted"].sum()
        detection_rate = 100 * detected / len(subset)
        print(
            f"   {attack_type:18s}: {len(subset):4d} flows, Detection Rate: {detection_rate:.1f}%"
        )

# Summary of which attacks are hardest to detect
print("\n\nðŸŽ¯ Attack Detection Summary:")
hardest_to_detect = []
for attack_type in df[df["label"] == 1]["attack_type"].unique():
    subset = df[df["attack_type"] == attack_type]
    detected = subset["predicted"].sum()
    detection_rate = 100 * detected / len(subset)
    hardest_to_detect.append((attack_type, detection_rate, len(subset)))

hardest_to_detect.sort(key=lambda x: x[1])
print("   Attacks ranked by detection difficulty (hardest first):")
for attack, rate, count in hardest_to_detect:
    status = "ðŸ”´" if rate < 50 else "ðŸŸ¡" if rate < 80 else "ðŸŸ¢"
    print(f"   {status} {attack}: {rate:.1f}% ({count} samples)")

<cell_type>markdown</cell_type>## Summary

In this lab, we built a comprehensive network anomaly detection system capable of identifying multiple attack types using unsupervised machine learning.

### Dataset Characteristics
- **Normal Traffic Types**: Web browsing, Email, DNS, SSH, Database, API
- **Attack Types Detected**:
  - Port Scanning (T1046)
  - Brute Force Authentication (T1110)
  - C2 Beaconing (T1071)
  - Data Exfiltration (T1048)
  - DNS Tunneling (T1071.004)
  - DDoS Attacks (T1498)
  - Lateral Movement via SMB (T1021.002)
  - Cryptomining (T1496)

### Feature Engineering
- **Rate Features**: bytes/second, packets/second (detect high-volume attacks)
- **Ratio Features**: send/receive ratio (detect asymmetric exfiltration)
- **Pattern Features**: multi-destination, multi-source (detect scans and DDoS)
- **Port Features**: suspicious ports, well-known ports
- **Log-transformed**: Handle extreme outliers gracefully

### Models Compared
- **Isolation Forest**: Fast, tree-based anomaly isolation
- **One-Class SVM**: Boundary-based detection using kernel methods
- **Local Outlier Factor**: Density-based local anomaly detection

### Key Takeaways
1. Different attacks have different detection rates - C2 beaconing is harder to detect than DDoS
2. Feature engineering is crucial - rate and ratio features are most discriminative
3. Combining multiple detectors improves robustness
4. Contamination parameter should match expected anomaly rate

### MITRE ATT&CK Coverage
| Attack Type | MITRE Technique | Detection Approach |
|-------------|-----------------|-------------------|
| Port Scan | T1046 | High dst_ip_count, many small packets |
| Brute Force | T1110 | Long duration, auth port, many packets |
| C2 Beacon | T1071 | Regular intervals, small packets |
| Exfiltration | T1048 | High bytes_sent, asymmetric ratio |
| DNS Tunnel | T1071.004 | Large DNS packets, high volume |
| DDoS | T1498 | Massive packets, multi-source |
| Lateral Move | T1021.002 | SMB ports, internal multi-dest |
| Cryptomining | T1496 | Very long duration, mining ports |

### Next Steps
1. Add time-series features (hourly patterns, seasonality)
2. Implement ensemble voting from multiple detectors
3. Build real-time streaming detection pipeline
4. Add supervised learning for known attack classification