# Lab 11: Advanced Malware Sample Clustering

Use unsupervised learning to cluster malware samples by behavior and identify families.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab11_malware_clustering.ipynb)

## Learning Objectives
- Comprehensive feature extraction from PE files (imports, sections, resources)
- Dynamic behavioral features (API sequences, registry, network, file operations)
- K-Means, DBSCAN, HDBSCAN, and hierarchical clustering
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Cluster evaluation and malware family identification
- Threat intelligence enrichment

## Malware Families Covered

This lab includes samples from major malware categories:
- **Banking Trojans**: Emotet, TrickBot, Dridex, QakBot, IcedID
- **Ransomware**: LockBit, BlackCat, Conti, Royal, REvil
- **RATs**: Remcos, AsyncRAT, njRAT, Quasar
- **Loaders**: Bumblebee, GuLoader, SocGholish
- **Info Stealers**: RedLine, Raccoon, Vidar, LummaC2
- **APT Tools**: Cobalt Strike, Sliver, Havoc

**Next:** Lab 32 (Anomaly Detection)

In [None]:
# Colab: Install dependencies (skip this cell locally - packages already in venv)
# %pip install -q scikit-learn pandas numpy matplotlib seaborn plotly

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, adjusted_rand_score

# Plotly for interactive visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")
np.random.seed(42)

# Plotly template for Colab
PLOTLY_TEMPLATE = "plotly_white"

## 1. Load and Explore Malware Features

In [None]:
# Comprehensive malware feature dataset with realistic characteristics
np.random.seed(42)

# Malware family profiles with realistic characteristics
MALWARE_PROFILES = {
    # Banking Trojans
    "Emotet": {
        "category": "banking_trojan",
        "entropy_range": (7.0, 7.8),
        "import_range": (200, 400),
        "section_range": (5, 8),
        "file_size_mean": 300000,
        "has_resources": True,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "key_apis": ["CreateRemoteThread", "VirtualAllocEx", "WriteProcessMemory"],
    },
    "TrickBot": {
        "category": "banking_trojan",
        "entropy_range": (6.8, 7.5),
        "import_range": (150, 350),
        "section_range": (4, 7),
        "file_size_mean": 500000,
        "has_resources": True,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "key_apis": ["HttpSendRequest", "InternetConnect", "CryptEncrypt"],
    },
    "Dridex": {
        "category": "banking_trojan",
        "entropy_range": (7.2, 7.9),
        "import_range": (180, 320),
        "section_range": (5, 7),
        "file_size_mean": 250000,
        "has_resources": False,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "key_apis": ["NtMapViewOfSection", "NtUnmapViewOfSection", "RegSetValueEx"],
    },
    "QakBot": {
        "category": "banking_trojan",
        "entropy_range": (7.1, 7.7),
        "import_range": (120, 280),
        "section_range": (4, 6),
        "file_size_mean": 400000,
        "has_resources": True,
        "packing": "upx",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "key_apis": ["CreateProcessA", "VirtualAllocEx", "GetProcAddress"],
    },
    # Ransomware
    "LockBit": {
        "category": "ransomware",
        "entropy_range": (6.5, 7.5),
        "import_range": (100, 250),
        "section_range": (4, 6),
        "file_size_mean": 180000,
        "has_resources": False,
        "packing": "none",
        "network_behavior": False,
        "registry_mods": True,
        "process_injection": False,
        "file_encryption": True,
        "shadow_copy_del": True,
        "key_apis": ["CryptEncrypt", "FindFirstFile", "MoveFileEx", "DeleteFileW"],
    },
    "BlackCat": {
        "category": "ransomware",
        "entropy_range": (6.8, 7.6),
        "import_range": (80, 200),
        "section_range": (3, 5),
        "file_size_mean": 2500000,  # Rust binary = larger
        "has_resources": False,
        "packing": "none",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": False,
        "file_encryption": True,
        "shadow_copy_del": True,
        "key_apis": ["BCryptEncrypt", "FindFirstFile", "CreateThread"],
    },
    "Conti": {
        "category": "ransomware",
        "entropy_range": (6.6, 7.4),
        "import_range": (90, 220),
        "section_range": (4, 6),
        "file_size_mean": 200000,
        "has_resources": False,
        "packing": "none",
        "network_behavior": True,  # Exfiltration
        "registry_mods": True,
        "process_injection": False,
        "file_encryption": True,
        "shadow_copy_del": True,
        "key_apis": ["ChaCha20", "FindFirstFileW", "GetLogicalDrives"],
    },
    # RATs
    "Remcos": {
        "category": "rat",
        "entropy_range": (6.0, 7.2),
        "import_range": (100, 250),
        "section_range": (5, 8),
        "file_size_mean": 600000,
        "has_resources": True,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "keylogging": True,
        "key_apis": ["GetAsyncKeyState", "SetWindowsHookEx", "recv", "send"],
    },
    "AsyncRAT": {
        "category": "rat",
        "entropy_range": (5.5, 6.8),
        "import_range": (150, 300),
        "section_range": (4, 6),
        "file_size_mean": 45000,  # .NET = smaller
        "has_resources": True,
        "packing": "none",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": False,
        "keylogging": True,
        "dotnet": True,
        "key_apis": ["Socket", "TcpClient", "WebClient"],
    },
    # Info Stealers
    "RedLine": {
        "category": "stealer",
        "entropy_range": (5.8, 6.9),
        "import_range": (80, 200),
        "section_range": (4, 6),
        "file_size_mean": 150000,
        "has_resources": True,
        "packing": "none",
        "network_behavior": True,
        "registry_mods": False,
        "browser_theft": True,
        "crypto_theft": True,
        "dotnet": True,
        "key_apis": ["CryptUnprotectData", "SQLite", "HttpWebRequest"],
    },
    "Raccoon": {
        "category": "stealer",
        "entropy_range": (6.2, 7.1),
        "import_range": (100, 220),
        "section_range": (5, 7),
        "file_size_mean": 250000,
        "has_resources": True,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": False,
        "browser_theft": True,
        "crypto_theft": True,
        "key_apis": ["InternetReadFile", "CryptUnprotectData", "RegEnumKeyEx"],
    },
    # APT Tools
    "CobaltStrike": {
        "category": "apt_tool",
        "entropy_range": (7.0, 7.95),
        "import_range": (30, 80),  # Reflective loading = few imports
        "section_range": (3, 5),
        "file_size_mean": 300000,
        "has_resources": False,
        "packing": "custom",
        "network_behavior": True,
        "registry_mods": True,
        "process_injection": True,
        "reflective_loading": True,
        "key_apis": ["VirtualAlloc", "CreateThread", "RtlMoveMemory"],
    },
}


def generate_malware_samples(num_samples: int = 500) -> pd.DataFrame:
    """Generate realistic malware sample features."""
    from datetime import datetime, timedelta

    samples = []
    families = list(MALWARE_PROFILES.keys())

    # Base date for sample discovery (simulate 60-day collection period)
    base_date = datetime(2024, 1, 1)

    for i in range(num_samples):
        family = families[i % len(families)]  # Rotate through families
        profile = MALWARE_PROFILES[family]

        # Generate features based on profile
        entropy = np.random.uniform(*profile["entropy_range"])
        num_imports = np.random.randint(*profile["import_range"])
        num_sections = np.random.randint(*profile["section_range"])
        file_size = int(np.random.lognormal(np.log(profile["file_size_mean"]), 0.3))

        # Generate first_seen timestamp (malware campaigns cluster in waves)
        # Different families emerge at different times
        family_idx = families.index(family)
        campaign_start = family_idx * 4  # Stagger family emergence
        day_offset = campaign_start + np.random.randint(0, 20)
        hour = np.random.randint(0, 24)
        first_seen = base_date + timedelta(days=day_offset, hours=hour, minutes=np.random.randint(0, 60))

        # Behavioral features
        sample = {
            "sha256": f"sample_{i:04d}_{family.lower()[:3]}",
            "family": family,
            "category": profile["category"],
            "first_seen": first_seen,
            "file_size": file_size,
            "entropy": entropy,
            "num_imports": num_imports,
            "num_sections": num_sections,
            "has_debug": np.random.choice([0, 1], p=[0.85, 0.15]),
            "has_signature": np.random.choice([0, 1], p=[0.95, 0.05]),
            "has_resources": 1 if profile.get("has_resources") else 0,
            "is_packed": 1 if profile.get("packing") != "none" else 0,
            "is_dotnet": 1 if profile.get("dotnet") else 0,
            "network_behavior": 1 if profile.get("network_behavior") else 0,
            "registry_mods": 1 if profile.get("registry_mods") else 0,
            "process_injection": 1 if profile.get("process_injection") else 0,
            "file_encryption": 1 if profile.get("file_encryption") else 0,
            "keylogging": 1 if profile.get("keylogging") else 0,
            "browser_theft": 1 if profile.get("browser_theft") else 0,
            "reflective_loading": 1 if profile.get("reflective_loading") else 0,
        }

        # Add noise
        sample["entropy"] += np.random.normal(0, 0.1)
        sample["num_imports"] += np.random.randint(-20, 20)

        samples.append(sample)

    return pd.DataFrame(samples)


# Generate comprehensive dataset
df = generate_malware_samples(num_samples=500)

print(f"Generated {len(df)} malware samples")
print(f"Time range: {df['first_seen'].min().strftime('%Y-%m-%d')} to {df['first_seen'].max().strftime('%Y-%m-%d')}")
print(f"\nFamily distribution:")
print(df["family"].value_counts())
print(f"\nCategory distribution:")
print(df["category"].value_counts())

# Show sample
print(f"\nSample features:")
print(df[["sha256", "family", "first_seen", "entropy", "num_imports"]].head(10))

In [None]:
# Interactive feature distributions with Plotly
# Box plots show distribution: median (line), quartiles (box), range (whiskers), outliers (dots)

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        "📊 Entropy (Randomness/Packing)",
        "📦 Import Count (API Usage)",
        "📁 File Size (Bytes)",
        "🔢 PE Sections"
    ],
    vertical_spacing=0.18,
    horizontal_spacing=0.1,
)

features = ["entropy", "num_imports", "file_size", "num_sections"]
positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for feature, (row, col) in zip(features, positions):
    for family in df["family"].unique():
        subset = df[df["family"] == family]
        fig.add_trace(
            go.Box(
                y=subset[feature],
                name=family,
                legendgroup=family,
                showlegend=(row == 1 and col == 1),
                hovertemplate=(
                    f"<b>{family}</b><br>"
                    f"{feature}: %{{y:.2f}}<br>"
                    "<extra></extra>"
                ),
            ),
            row=row, col=col
        )

fig.update_layout(
    height=800,
    width=1100,
    template=PLOTLY_TEMPLATE,
    title=dict(
        text="🦠 Feature Distributions by Malware Family<br><sup>How do different malware families differ in their characteristics?</sup>",
        font=dict(size=16),
    ),
    showlegend=True,
    legend=dict(
        orientation="h",
        yanchor="top",
        y=-0.08,
        xanchor="center",
        x=0.5,
        font=dict(size=10),
    ),
    margin=dict(b=120),  # More space for legend
)

# Rotate x-axis labels to prevent cutoff
for row in [1, 2]:
    for col in [1, 2]:
        fig.update_xaxes(tickangle=45, tickfont=dict(size=9), row=row, col=col)

fig.show()

# Interpretation guide
print("\n📖 How to Read Box Plots:")
print("=" * 55)
print("  ┌─────┐  ← Upper whisker (max, excluding outliers)")
print("  │     │")
print("  ├─────┤  ← 75th percentile (Q3)")
print("  │  ─  │  ← Median (50th percentile)")
print("  ├─────┤  ← 25th percentile (Q1)")
print("  │     │")
print("  └─────┘  ← Lower whisker (min, excluding outliers)")
print("     •     ← Outliers (unusual samples)")
print()
print("🔍 What Each Feature Tells Us:")
print("-" * 55)
print("  📊 Entropy (0-8): Higher = more random/encrypted/packed")
print("     • >7.5: Likely packed or encrypted")
print("     • 5-7: Normal executable")
print("     • <5: Lots of resources or padding")
print()
print("  📦 Imports: Number of Windows API functions used")
print("     • High (>300): Feature-rich, many capabilities")
print("     • Low (<100): Minimal or uses dynamic loading")
print("     • Very low + high entropy = suspicious (reflective loading)")
print()
print("  📁 File Size: Varies by malware type")
print("     • Large (>1MB): May include resources, Rust/Go binary")
print("     • Small (<100KB): Droppers, shellcode loaders")
print()
print("  🔢 Sections: PE file structure")
print("     • 3-5: Normal")
print("     • >7: May have added sections (packing, injection)")

In [None]:
# Malware Discovery Timeline - when were samples first seen?
if 'first_seen' in df.columns:
    df['discovery_date'] = pd.to_datetime(df['first_seen']).dt.date
    daily_discoveries = df.groupby(['discovery_date', 'family']).size().reset_index(name='count')
    daily_discoveries['discovery_date'] = pd.to_datetime(daily_discoveries['discovery_date'])

    fig = make_subplots(
        rows=2, cols=1,
        subplot_titles=['Daily Malware Discoveries by Family', 'Cumulative Samples Over Time'],
        vertical_spacing=0.15,
    )

    # Stacked area chart of discoveries
    for family in df['family'].unique():
        family_data = daily_discoveries[daily_discoveries['family'] == family]
        fig.add_trace(
            go.Scatter(
                x=family_data['discovery_date'],
                y=family_data['count'],
                name=family,
                mode='lines',
                stackgroup='one',
            ),
            row=1, col=1
        )

    # Cumulative discoveries
    cumulative = df.sort_values('first_seen').reset_index(drop=True)
    cumulative['cumulative'] = range(1, len(cumulative) + 1)
    fig.add_trace(
        go.Scatter(
            x=cumulative['first_seen'],
            y=cumulative['cumulative'],
            name='Total Samples',
            mode='lines',
            line=dict(color='#e74c3c', width=3),
            fill='tozeroy',
            fillcolor='rgba(231,76,60,0.2)',
        ),
        row=2, col=1
    )

    fig.update_layout(
        title='🦠 Malware Discovery Timeline',
        template=PLOTLY_TEMPLATE,
        height=550,
        showlegend=True,
        legend=dict(orientation='h', yanchor='bottom', y=1.02),
    )
    fig.update_xaxes(title_text='Date', row=1, col=1)
    fig.update_xaxes(title_text='Date', row=2, col=1)
    fig.update_yaxes(title_text='Samples Discovered', row=1, col=1)
    fig.update_yaxes(title_text='Cumulative Samples', row=2, col=1)
    fig.show()

    # Family emergence analysis
    family_first = df.groupby('family')['first_seen'].min().sort_values()
    print('📅 Malware Family Emergence Order:')
    for family, date in family_first.items():
        print(f'   {date.strftime("%Y-%m-%d")}: {family}')
else:
    print('Note: first_seen column not available in this dataset')


## 2. Feature Engineering

In [None]:
# Prepare features for clustering
feature_cols = ["entropy", "num_imports", "num_sections", "has_debug", "has_signature"]

# Log transform file_size (highly skewed)
df["log_file_size"] = np.log1p(df["file_size"])
feature_cols.append("log_file_size")

# Create feature matrix
X = df[feature_cols].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Feature matrix shape: {X_scaled.shape}")
print(f"Features: {feature_cols}")

## 3. Dimensionality Reduction

In [None]:
# PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# t-SNE for better separation
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

In [None]:
# Interactive PCA and t-SNE with Plotly
df["pca_1"] = X_pca[:, 0]
df["pca_2"] = X_pca[:, 1]
df["tsne_1"] = X_tsne[:, 0]
df["tsne_2"] = X_tsne[:, 1]

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=[
        "📐 PCA (Linear projection)",
        "🎯 t-SNE (Non-linear, preserves clusters)"
    ]
)

# PCA scatter
for family in df["family"].unique():
    mask = df["family"] == family
    fig.add_trace(
        go.Scatter(
            x=df.loc[mask, "pca_1"],
            y=df.loc[mask, "pca_2"],
            mode="markers",
            name=family,
            legendgroup=family,
            marker=dict(size=8, opacity=0.7),
            hovertemplate=f"<b>{family}</b><br>PC1: %{{x:.2f}}<br>PC2: %{{y:.2f}}<extra></extra>",
        ),
        row=1, col=1
    )

# t-SNE scatter
for family in df["family"].unique():
    mask = df["family"] == family
    fig.add_trace(
        go.Scatter(
            x=df.loc[mask, "tsne_1"],
            y=df.loc[mask, "tsne_2"],
            mode="markers",
            name=family,
            legendgroup=family,
            showlegend=False,
            marker=dict(size=8, opacity=0.7),
            hovertemplate=f"<b>{family}</b><br>t-SNE 1: %{{x:.2f}}<br>t-SNE 2: %{{y:.2f}}<extra></extra>",
        ),
        row=1, col=2
    )

fig.update_xaxes(title_text="PC1", row=1, col=1)
fig.update_yaxes(title_text="PC2", row=1, col=1)
fig.update_xaxes(title_text="t-SNE 1", row=1, col=2)
fig.update_yaxes(title_text="t-SNE 2", row=1, col=2)

fig.update_layout(
    height=500,
    width=1100,
    template=PLOTLY_TEMPLATE,
    title=dict(
        text="🔬 Visualizing High-Dimensional Malware Features in 2D<br><sup>Each dot = 1 malware sample | Colors = known malware families</sup>",
        font=dict(size=16),
    ),
    legend=dict(orientation="h", yanchor="bottom", y=-0.2),
)
fig.show()

# Interpretation
print("\n📖 Understanding These Projections:")
print("=" * 55)
print("\n  We have 6+ features per sample (entropy, imports, etc.)")
print("  Humans can only see 2-3 dimensions, so we compress!")
print()
print("  📐 PCA (Principal Component Analysis):")
print("     • Linear projection: finds axes of maximum variance")
print("     • Fast, preserves global structure")
print("     • May not show cluster separation well")
print()
print("  🎯 t-SNE (t-distributed Stochastic Neighbor Embedding):")
print("     • Non-linear: focuses on preserving local neighborhoods")
print("     • Better at revealing clusters")
print("     • Slower, distances between clusters less meaningful")
print()
print("💡 What to Look For:")
print("   • Tight, separated color groups = features distinguish families")
print("   • Overlapping colors = families share similar characteristics")
print("   • Outliers = unusual samples (worth investigating!)")

In [None]:
# t-SNE colored by malware CATEGORY (broader grouping)
# Categories group families by their primary purpose/behavior

# Define category colors and explanations
category_info = {
    "banking_trojan": {"color": "#e74c3c", "emoji": "🏦", "desc": "Steals banking credentials"},
    "ransomware": {"color": "#9b59b6", "emoji": "🔒", "desc": "Encrypts files for ransom"},
    "rat": {"color": "#3498db", "emoji": "🖥️", "desc": "Remote Access Trojan"},
    "stealer": {"color": "#f39c12", "emoji": "🔑", "desc": "Steals passwords/data"},
    "apt_tool": {"color": "#1abc9c", "emoji": "🎯", "desc": "Advanced persistent threat"},
}

fig = px.scatter(
    df,
    x="tsne_1",
    y="tsne_2",
    color="category",
    symbol="category",
    template=PLOTLY_TEMPLATE,
    color_discrete_map={k: v["color"] for k, v in category_info.items()},
    hover_data=["family", "entropy", "num_imports", "file_size"],
)

fig.update_traces(marker=dict(size=10, opacity=0.7))
fig.update_layout(
    height=550,
    width=900,
    title=dict(
        text="🎯 Malware Categories: Do Similar Threats Cluster Together?<br><sup>Grouping by malware PURPOSE rather than specific family</sup>",
        font=dict(size=16),
    ),
    xaxis_title="t-SNE Dimension 1 (behavioral similarity →)",
    yaxis_title="t-SNE Dimension 2 (behavioral similarity →)",
    legend_title="Malware Category",
)

fig.show()

# Category explanation
print("\n📚 Malware Category Guide:")
print("=" * 55)
for cat, info in category_info.items():
    print(f"   {info['emoji']} {cat:15s}: {info['desc']}")

print("\n🔍 How to Interpret This Chart:")
print("-" * 55)
print("   • Each DOT = one malware sample")
print("   • POSITION = behavioral similarity (close = similar)")
print("   • COLOR = malware category (purpose)")
print()
print("💡 What the Clustering Reveals:")
print("   • Tight category clusters: Categories have distinct behaviors")
print("   • Category overlap: Different malware types using similar techniques")
print("   • Mixed regions: Multi-purpose malware (e.g., stealer + RAT)")
print()
print("🔬 Security Application:")
print("   • Unknown sample near ransomware cluster → likely ransomware")
print("   • Sample between clusters → may have multiple capabilities")
print("   • Isolated sample → novel/unique threat variant")

## 4. Clustering with K-Means

### 🤔 Why Cluster When We Already Have Labels?

You might wonder: "We already know the malware families - why bother with unsupervised clustering?"

**Great question!** In real-world scenarios:

1. **New/Unknown Malware**: Most samples you encounter are *unlabeled*. Clustering helps group similar threats before you can analyze them.

2. **Discover Variants**: Even within known families, clustering can reveal sub-variants with different behaviors.

3. **Campaign Detection**: Samples from the same attack campaign often cluster together, revealing threat actor patterns.

4. **Validate Labels**: Sometimes vendor labels are wrong! Clustering can expose mislabeled samples.

5. **Prioritize Analysis**: Instead of analyzing 500 samples, analyze 1 representative from each cluster.

### 🎯 How K-Means Works (Step by Step)

K-Means is one of the simplest clustering algorithms. Here's how it works:

```
Step 1: INITIALIZE
   Pick K random points as initial cluster centers (centroids)
   
   Example with K=3:
   ┌───────────────────────────┐
   │     ●₁                    │  ● = data points
   │  ●        ★               │  ★ = centroids (cluster centers)
   │      ●          ★         │
   │    ●     ●                │
   │              ★     ●      │
   │   ●                       │
   └───────────────────────────┘

Step 2: ASSIGN
   Assign each point to the NEAREST centroid
   
   ┌───────────────────────────┐
   │     🔴₁                   │  
   │  🔴       ★🔴             │  Points colored by nearest centroid
   │      🔴         ★🟢       │
   │    🔴    🟡               │
   │             ★🟡    🟢     │
   │   🔴                      │
   └───────────────────────────┘

Step 3: UPDATE
   Move each centroid to the CENTER of its assigned points
   
   ┌───────────────────────────┐
   │     🔴                    │  
   │  🔴    ★                  │  Centroids moved to cluster centers
   │      🔴              🟢   │
   │    🔴    🟡         ★     │
   │          ★   🟡    🟢     │
   │   🔴                      │
   └───────────────────────────┘

Step 4: REPEAT Steps 2-3 until centroids stop moving
```

**Key Insight**: K-Means finds spherical clusters. Malware behaviors aren't always spherical, which is why we also try DBSCAN!

In [None]:
# Find optimal k using elbow method and silhouette score
k_range = list(range(2, 11))
inertias = []
silhouettes = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

optimal_k = k_range[np.argmax(silhouettes)]

# Interactive elbow and silhouette plots
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=["Elbow Method", "Silhouette Score"]
)

# Elbow plot
fig.add_trace(
    go.Scatter(
        x=k_range, y=inertias,
        mode="lines+markers",
        name="Inertia",
        line=dict(color="#3498db", width=2),
        marker=dict(size=10),
        hovertemplate="k=%{x}<br>Inertia: %{y:.0f}<extra></extra>",
    ),
    row=1, col=1
)

# Silhouette plot
fig.add_trace(
    go.Scatter(
        x=k_range, y=silhouettes,
        mode="lines+markers",
        name="Silhouette",
        line=dict(color="#2ecc71", width=2),
        marker=dict(size=10),
        hovertemplate="k=%{x}<br>Silhouette: %{y:.3f}<extra></extra>",
    ),
    row=1, col=2
)

# Highlight optimal k
fig.add_trace(
    go.Scatter(
        x=[optimal_k], y=[silhouettes[optimal_k - 2]],
        mode="markers",
        name=f"Optimal (k={optimal_k})",
        marker=dict(size=15, color="#e74c3c", symbol="star"),
        hovertemplate=f"Optimal k={optimal_k}<br>Silhouette: {silhouettes[optimal_k - 2]:.3f}<extra></extra>",
    ),
    row=1, col=2
)

fig.update_xaxes(title_text="Number of Clusters (k)", row=1, col=1)
fig.update_yaxes(title_text="Inertia (Within-cluster variance)", row=1, col=1)
fig.update_xaxes(title_text="Number of Clusters (k)", row=1, col=2)
fig.update_yaxes(title_text="Silhouette Score (-1 to 1)", row=1, col=2)

# Add annotation for elbow
fig.add_annotation(
    x=5, y=inertias[3],
    text="← Look for 'elbow'<br>where curve bends",
    showarrow=True,
    arrowhead=2,
    ax=60, ay=-30,
    font=dict(size=10),
    row=1, col=1
)

# Add annotation for silhouette
fig.add_annotation(
    x=optimal_k, y=max(silhouettes) + 0.02,
    text=f"Best k={optimal_k}<br>Higher = better",
    showarrow=True,
    arrowhead=2,
    ax=0, ay=-40,
    font=dict(size=10, color="#e74c3c"),
    row=1, col=2
)

fig.update_layout(
    height=450,
    width=1000,
    template=PLOTLY_TEMPLATE,
    title=dict(
        text="🔍 Finding the Right Number of Clusters<br><sup>How many malware families should we group samples into?</sup>",
        font=dict(size=16),
    ),
    showlegend=True,
)
fig.show()

# Interpretation guide
print("\n📖 How to Read These Charts:")
print("=" * 55)
print("\n📉 ELBOW METHOD (Left):")
print("   • Inertia = how spread out samples are within clusters")
print("   • Lower inertia = tighter clusters (good!)")
print("   • Look for the 'elbow' - where adding more clusters")
print("     stops significantly reducing inertia")
print("   • Too few clusters → samples forced into wrong groups")
print("   • Too many clusters → overfitting, less meaningful groups")
print()
print("📈 SILHOUETTE SCORE (Right):")
print("   • Measures how similar samples are to their own cluster")
print("     vs. other clusters (-1 to +1)")
print("   • +1 = perfect clustering (samples very distinct)")
print("   •  0 = overlapping clusters")
print("   • -1 = samples assigned to wrong clusters")
print()
print(f"✅ Recommendation: k={optimal_k} (highest silhouette score)")
print(f"   This suggests {optimal_k} distinct malware behavior patterns")

In [None]:
# Apply K-Means clustering
# K-Means groups samples into k clusters by minimizing distance to cluster centers
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df["kmeans_cluster"] = kmeans.fit_predict(X_scaled)

print("🎯 K-Means Clustering Results")
print("=" * 55)
print("\n📊 What K-Means Did:")
print("   1. Placed 5 'center points' in feature space")
print("   2. Assigned each malware sample to nearest center")
print("   3. Repeated until clusters stabilized")
print()
print("📦 Cluster Distribution:")
cluster_counts = df["kmeans_cluster"].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    pct = count / len(df) * 100
    bar = "█" * int(pct / 2)
    print(f"   Cluster {cluster_id}: {count:3d} samples ({pct:5.1f}%) {bar}")
print()
print("💡 Interpretation:")
print("   • Similar-sized clusters → distinct malware behaviors")
print("   • One large cluster → many samples share characteristics")
print("   • Very small cluster → rare/unique malware variant")

## 5. Clustering with DBSCAN

### 🔬 How DBSCAN Works

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) works completely differently from K-Means:

```
Instead of K centers, DBSCAN asks: "Where are the DENSE regions?"

Key Concepts:
┌────────────────────────────────────────────────────────┐
│  ● Core Point: Has at least min_samples neighbors      │
│                within distance eps                      │
│                                                        │
│  ○ Border Point: Within eps of a core point,           │
│                  but not enough neighbors               │
│                                                        │
│  × Noise Point: Not within eps of any core point       │
│                 (outlier!)                              │
└────────────────────────────────────────────────────────┘

Example with eps=1.0, min_samples=3:

   ╭─────╮
   │ ●  ●│        ×         ● = Core (3+ neighbors)
   │●  ● │                  ○ = Border point
   │ ●○  │                  × = Noise (outlier)
   ╰─────╯
     Cluster 1              Alone = Noise
```

### 📐 Choosing DBSCAN Parameters

This is the tricky part! The two parameters are:

| Parameter | What it means | Too small | Too large |
|-----------|--------------|-----------|-----------|
| **eps** | Maximum distance between neighbors | Many small clusters + lots of noise | Everything in one cluster |
| **min_samples** | Minimum points to form a dense region | Noise becomes clusters | Real clusters become noise |

**Rules of Thumb:**

1. **eps**: Start with values around 0.5-1.0 for standardized data
   - Look at k-distance plot (see below) to find the "elbow"

2. **min_samples**: Start with `2 * n_features` or at least 3
   - Higher = stricter definition of "dense"
   - For malware: 5-10 is reasonable (rare variants become noise)

### 🎯 Why DBSCAN for Malware?

| K-Means | DBSCAN |
|---------|--------|
| Must specify K (number of clusters) | Finds clusters automatically |
| Assumes spherical clusters | Finds any shape clusters |
| Forces ALL points into clusters | Identifies outliers as "noise" |
| Sensitive to initialization | Deterministic results |

**For malware analysis**: DBSCAN is great because:
- Novel malware variants become "noise" → flag for manual review!
- Doesn't force unrelated samples into same cluster
- Handles the irregular shapes of real behavioral clusters

In [None]:
# K-Distance Plot: Finding the optimal eps for DBSCAN
# This technique helps us choose eps by looking at the distance to the k-th nearest neighbor

from sklearn.neighbors import NearestNeighbors

# Calculate distance to k-th nearest neighbor (k = min_samples)
k = 5  # min_samples we'll use
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)

# Sort distances to k-th neighbor (ascending)
k_distances = np.sort(distances[:, k-1])

# Plot k-distance graph
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(len(k_distances))),
    y=k_distances,
    mode='lines',
    line=dict(color='#3498db', width=2),
    hovertemplate="Sample %{x}<br>Distance to 5th neighbor: %{y:.3f}<extra></extra>"
))

# Add annotation for elbow
elbow_idx = 400  # Approximate elbow location
fig.add_trace(go.Scatter(
    x=[elbow_idx],
    y=[k_distances[elbow_idx]],
    mode='markers',
    marker=dict(size=15, color='#e74c3c', symbol='star'),
    name='Elbow (suggested eps)',
    hovertemplate=f"Elbow<br>eps ≈ {k_distances[elbow_idx]:.2f}<extra></extra>"
))

fig.add_annotation(
    x=elbow_idx, y=k_distances[elbow_idx] + 0.2,
    text=f"← 'Elbow' point<br>eps ≈ {k_distances[elbow_idx]:.2f}",
    showarrow=True,
    arrowhead=2,
    ax=60, ay=-40,
    font=dict(size=12, color='#e74c3c')
)

fig.update_layout(
    title=dict(
        text="📐 K-Distance Plot: Finding the Right eps for DBSCAN<br><sup>The 'elbow' suggests a good eps value</sup>",
        font=dict(size=16)
    ),
    xaxis_title="Samples (sorted by distance)",
    yaxis_title=f"Distance to {k}th Nearest Neighbor",
    template=PLOTLY_TEMPLATE,
    height=450,
    width=900,
)

fig.show()

print("\n📖 How to Read the K-Distance Plot:")
print("=" * 55)
print("  • Each point's distance to its 5th nearest neighbor")
print("  • Sorted from closest to farthest")
print()
print("  📈 The Curve Tells a Story:")
print("     • FLAT region (left): Dense areas - points have close neighbors")
print("     • STEEP region (right): Sparse areas - points are isolated")
print("     • ELBOW: Transition point - good eps candidate!")
print()
print("  💡 Interpretation:")
print(f"     • Below elbow (eps < {k_distances[elbow_idx]:.2f}): Too strict, many outliers")
print(f"     • At elbow (eps ≈ {k_distances[elbow_idx]:.2f}): Good balance")
print(f"     • Above elbow (eps > {k_distances[elbow_idx]:.2f}): Too loose, few outliers")
print()
print(f"  🎯 Recommendation: Try eps = {k_distances[elbow_idx]:.1f} with min_samples = {k}")

In [None]:
# DBSCAN clustering - density-based approach
# Unlike K-Means, DBSCAN finds clusters of any shape and identifies outliers
dbscan = DBSCAN(eps=0.8, min_samples=5)
df["dbscan_cluster"] = dbscan.fit_predict(X_scaled)

print("🔬 DBSCAN Clustering Results")
print("=" * 55)
print("\n📊 What DBSCAN Did:")
print("   • Found dense regions of similar malware samples")
print("   • Marked isolated samples as 'noise' (outliers)")
print("   • No need to specify number of clusters!")
print()
print("📦 Cluster Distribution:")
cluster_counts = df["dbscan_cluster"].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    pct = count / len(df) * 100
    bar = "█" * int(pct / 2)
    label = "NOISE (outliers)" if cluster_id == -1 else f"Cluster {cluster_id}"
    print(f"   {label}: {count:3d} samples ({pct:5.1f}%) {bar}")

noise_count = (df["dbscan_cluster"] == -1).sum()
noise_pct = noise_count / len(df) * 100
print()
print("🚨 Noise Analysis:")
print(f"   • {noise_count} samples ({noise_pct:.1f}%) marked as outliers")
if noise_pct > 20:
    print("   • High noise % suggests many unique/rare malware variants")
    print("   • Consider: these could be polymorphic or newly emerged threats")
elif noise_pct > 5:
    print("   • Moderate noise suggests some uncommon variants")
else:
    print("   • Low noise = most samples fit into clear behavioral groups")
print()
print("💡 DBSCAN vs K-Means:")
print("   • DBSCAN: Better for finding irregular-shaped clusters")
print("   • K-Means: Better when clusters are roughly spherical")
print("   • Noise points in DBSCAN = samples that don't fit any group")

## 6. Evaluate Clustering Results

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode true labels (actual malware families)
le = LabelEncoder()
true_labels = le.fit_transform(df["family"])

# Calculate metrics
kmeans_silhouette = silhouette_score(X_scaled, df["kmeans_cluster"])
kmeans_ari = adjusted_rand_score(true_labels, df["kmeans_cluster"])

# DBSCAN (excluding noise points)
dbscan_mask = df["dbscan_cluster"] != -1
if dbscan_mask.sum() > 1:
    dbscan_silhouette = silhouette_score(
        X_scaled[dbscan_mask], df.loc[dbscan_mask, "dbscan_cluster"]
    )
    dbscan_ari = adjusted_rand_score(
        true_labels[dbscan_mask], df.loc[dbscan_mask, "dbscan_cluster"]
    )
else:
    dbscan_silhouette = 0
    dbscan_ari = 0

print("📊 Clustering Quality Metrics")
print("=" * 55)

print("\n📖 What These Metrics Mean:")
print("-" * 55)
print("  SILHOUETTE SCORE (-1 to +1):")
print("    • How well-separated are the clusters?")
print("    • +1 = perfect (tight clusters, far apart)")
print("    •  0 = overlapping clusters")
print("    • -1 = samples in wrong clusters")
print()
print("  ADJUSTED RAND INDEX (0 to 1):")
print("    • How well do clusters match TRUE malware families?")
print("    • 1.0 = perfect match to known families")
print("    • 0.0 = random grouping (no correlation)")
print()

print("📈 Results Comparison:")
print("-" * 55)
print(f"                      K-Means     DBSCAN")
print(f"  Silhouette Score:   {kmeans_silhouette:7.3f}     {dbscan_silhouette:7.3f}")
print(f"  Adjusted Rand:      {kmeans_ari:7.3f}     {dbscan_ari:7.3f}")
print()

# Interpret results
print("💡 Interpretation:")
if kmeans_ari > 0.5:
    print(f"   ✅ K-Means ARI={kmeans_ari:.2f}: Good match to actual families!")
    print("      Clustering captures real malware behavioral groups")
elif kmeans_ari > 0.3:
    print(f"   ⚠️ K-Means ARI={kmeans_ari:.2f}: Moderate match")
    print("      Some families cluster together, others mixed")
else:
    print(f"   ❌ K-Means ARI={kmeans_ari:.2f}: Weak match to families")
    print("      Features may not distinguish family-level differences")

if kmeans_silhouette > dbscan_silhouette:
    print(f"   • K-Means produced tighter clusters (higher silhouette)")
else:
    print(f"   • DBSCAN produced tighter clusters (higher silhouette)")

In [None]:
# Interactive clustering comparison with Plotly
# This shows how well our unsupervised clustering matches known malware families
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=[
        "🏷️ True Families (Ground Truth)",
        "🎯 K-Means (k=5)",
        "🔬 DBSCAN (Density-based)"
    ],
    horizontal_spacing=0.08
)

# True labels scatter
for family in df["family"].unique():
    mask = df["family"] == family
    fig.add_trace(
        go.Scatter(
            x=df.loc[mask, "tsne_1"],
            y=df.loc[mask, "tsne_2"],
            mode="markers",
            name=family,
            marker=dict(size=7, opacity=0.7),
            hovertemplate=f"<b>{family}</b><br>t-SNE 1: %{{x:.2f}}<br>t-SNE 2: %{{y:.2f}}<extra></extra>",
        ),
        row=1, col=1
    )

# K-Means clusters
fig.add_trace(
    go.Scatter(
        x=df["tsne_1"],
        y=df["tsne_2"],
        mode="markers",
        marker=dict(
            size=7,
            color=df["kmeans_cluster"],
            colorscale="Viridis",
            opacity=0.7,
            showscale=True,
            colorbar=dict(title="Cluster", x=0.65, len=0.8),
        ),
        showlegend=False,
        hovertemplate="Cluster: %{marker.color}<br>t-SNE 1: %{x:.2f}<br>t-SNE 2: %{y:.2f}<extra></extra>",
    ),
    row=1, col=2
)

# DBSCAN clusters
fig.add_trace(
    go.Scatter(
        x=df["tsne_1"],
        y=df["tsne_2"],
        mode="markers",
        marker=dict(
            size=7,
            color=df["dbscan_cluster"],
            colorscale="Viridis",
            opacity=0.7,
            showscale=True,
            colorbar=dict(title="Cluster", x=1.02, len=0.8),
        ),
        showlegend=False,
        hovertemplate="Cluster: %{marker.color}<br>t-SNE 1: %{x:.2f}<br>t-SNE 2: %{y:.2f}<extra></extra>",
    ),
    row=1, col=3
)

fig.update_layout(
    height=500,
    width=1200,
    template=PLOTLY_TEMPLATE,
    title=dict(
        text="🔍 Clustering Comparison: How Well Did We Group Malware?<br><sup>Left = known families | Middle/Right = our unsupervised clustering results</sup>",
        font=dict(size=16),
    ),
    legend=dict(orientation="h", yanchor="bottom", y=-0.25, x=0.1),
)
fig.show()

# Interpretation guide
print("\n📖 How to Read This Comparison:")
print("=" * 55)
print("  • Each DOT is a malware sample")
print("  • Position (x,y) comes from t-SNE dimensionality reduction")
print("  • Samples close together have similar behaviors")
print()
print("  LEFT (True Families):")
print("    Shows actual malware family labels (our 'answer key')")
print()
print("  MIDDLE (K-Means):")
print("    Colors = clusters found by K-Means algorithm")
print("    Goal: colors should match the LEFT panel")
print()
print("  RIGHT (DBSCAN):")
print("    Colors = clusters found by DBSCAN")
print("    Dark points = 'noise' (outliers that don't fit any cluster)")
print()
print("💡 What to Look For:")
print("   ✅ Good: Cluster colors align with true family groupings")
print("   ⚠️ Mixed: Clusters contain multiple families (need more features)")
print("   ❌ Split: Same family split across clusters (over-clustering)")

## 7. Cluster Analysis

In [None]:
# Analyze cluster composition - what malware families ended up together?
print("🔬 Cluster Composition Analysis (K-Means)")
print("=" * 60)
print("\nThis shows which malware families were grouped together.")
print("Similar families in same cluster = good feature selection!")
print()

for cluster_id in sorted(df["kmeans_cluster"].unique()):
    cluster_data = df[df["kmeans_cluster"] == cluster_id]
    dominant_family = cluster_data["family"].mode()[0]
    dominant_pct = (cluster_data["family"] == dominant_family).sum() / len(cluster_data) * 100

    # Determine cluster "purity"
    if dominant_pct > 80:
        purity = "🟢 High purity"
    elif dominant_pct > 50:
        purity = "🟡 Mixed"
    else:
        purity = "🔴 Low purity"

    print(f"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
    print(f"📦 CLUSTER {cluster_id} | {len(cluster_data)} samples | {purity}")
    print(f"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")

    # Family breakdown
    family_counts = cluster_data["family"].value_counts()
    print("   Family Breakdown:")
    for family, count in family_counts.head(5).items():
        pct = count / len(cluster_data) * 100
        bar = "█" * int(pct / 5)
        print(f"      {family:15s}: {count:3d} ({pct:5.1f}%) {bar}")

    # Behavioral characteristics
    print(f"\n   📊 Behavioral Profile:")
    print(f"      Avg Entropy:     {cluster_data['entropy'].mean():.2f} ", end="")
    if cluster_data['entropy'].mean() > 7.2:
        print("(high - likely packed/encrypted)")
    elif cluster_data['entropy'].mean() > 6.0:
        print("(moderate)")
    else:
        print("(low - lots of resources)")

    print(f"      Avg Imports:     {cluster_data['num_imports'].mean():.0f} ", end="")
    if cluster_data['num_imports'].mean() > 250:
        print("(feature-rich)")
    elif cluster_data['num_imports'].mean() > 100:
        print("(moderate API usage)")
    else:
        print("(minimal - may use dynamic loading)")

    print(f"      Avg File Size:   {cluster_data['file_size'].mean()/1024:.0f} KB")
    print()

print("💡 Security Insight:")
print("   Clusters with similar behavioral profiles suggest:")
print("   • Shared malware toolkit or framework")
print("   • Code reuse between threat actors")
print("   • Related malware families (same campaign)")

## Summary

In this lab, we:
- Extracted features from malware samples (entropy, imports, sections)
- Applied dimensionality reduction (PCA, t-SNE) for visualization
- Clustered samples using K-Means and DBSCAN
- Evaluated clustering quality with silhouette score and ARI

### Key Insights:
- **High entropy** often indicates packed/encrypted malware
- **Import patterns** can distinguish malware families
- **t-SNE** provides better visual separation than PCA
- **DBSCAN** can identify outliers (noise points)

### Next Steps:
1. Add more features (strings, API calls, PE headers)
2. Try hierarchical clustering for dendrogram visualization
3. Build a classification model using cluster labels