# Graph Type and Purpose

You are constructing a **heterogeneous directed multigraph** using `NetworkX`’s `MultiDiGraph()` to model complex cyber network interactions. This design is particularly effective for advanced cybersecurity applications such as:

- **Graph-based threat detection**
- **Anomaly identification in multi-modal behaviors**
- **Learning embeddings for heterogeneous entities**

### Key Characteristics

- **Heterogeneous nodes**  
  Represents diverse entities: IP addresses, domain names, HTTP URIs, SSL certificate subjects/issuers, protocol violation types, etc.

- **Multi-view relationships**  
  Multiple directed edge types between the same pair of nodes allow different interaction views (e.g., flows, DNS queries, HTTP requests).

- **Directed edges**  
  Encode **temporal or causal flow** (e.g., `src_ip ➝ dst_ip`, `IP ➝ domain`), reflecting who initiated what.

# Node Types (Entities)

Each node represents a real-world entity, extracted from one or more dataset columns:

| Node Type         | Source Column(s)    | Description                                                                 |
|-------------------|---------------------|-----------------------------------------------------------------------------|
| **IP Address**     | `src_ip`, `dst_ip`  | Devices or interfaces on the network (e.g., `192.168.1.37`).                |
| **Domain Name**    | `dns_query`         | Fully qualified domain names queried by IPs (e.g., `www.example.com`).      |
| **HTTP URI**       | `http_uri`          | HTTP resource paths (e.g., `/login`, `/index.html`).                        |
| **SSL Subject**    | `ssl_subject`       | Distinguished Name of the certificate subject (e.g., `/C=US/O=Let's Encrypt`). |
| **SSL Issuer**     | `ssl_issuer`        | Distinguished Name of the certificate issuer (e.g., `/C=US/O=Google Trust Services`). |
| **Protocol Violation** | `weird_name`     | Descriptive label of detected anomalies (e.g., `bad_TCP_checksum`).         |

---

# Edge Types (Views)

Each directed edge represents an interaction or behavioral relationship, often enriched with protocol metadata:

## 1. `flow` — (IP ➝ IP)

Represents a network flow between two IP addresses.

- **Source:** `src_ip`  
- **Target:** `dst_ip`  
- **Attributes:**
  - `proto`, `service`, `duration`, `conn_state`
  - `src_bytes`, `dst_bytes`
  - `label`, `attack_type`

**Usefulness:**  
Defines the **structural backbone** of the graph, enabling analysis of traffic patterns and attack topologies.

## 2. `dns_query` — (IP ➝ Domain Name)

Represents a DNS lookup initiated by a host.

- **Source:** `src_ip`  
- **Target:** `dns_query`  
- **Attributes:**
  - `qclass`, `qtype`, `rcode`
  - `dns_AA`, `dns_RD`, `dns_RA`, `dns_rejected`

**Usefulness:**  
Reveals **host intent** and can indicate access to suspicious or malicious domains.

## 3. `http_request` — (IP ➝ HTTP URI)

Captures web resource requests made by a host.

- **Source:** `src_ip`  
- **Target:** `http_uri`  
- **Attributes:**
  - `method`, `version`, `status_code`
  - `trans_depth`, `req_body_len`, `resp_body_len`
  - `user_agent`, `orig_mime`, `resp_mime`

**Usefulness:**  
Reflects **web behavior**; useful for detecting scanning, reconnaissance, and probing activity.

## 4. `protocol_violation` — (IP ➝ Violation Label)

Links an IP to a protocol anomaly observed during communication.

- **Source:** `src_ip`  
- **Target:** `weird_name`  
- **Attributes:**
  - `weird_addl`, `weird_notice`

**Usefulness:**  
Highlights **anomalous or misconfigured hosts**. Many such events are early indicators of compromise or malicious activity.

# Semantic Graph Properties

- **IP nodes are central:**  
  Most interaction types originate from or are directed to IP addresses, making them critical in graph topology.

- **Multi-modal behavioral modeling:**  
  Combines HTTP, DNS, SSL, and flow-level information into one unified representation.

- **Multi-view learning ready:**  
  The graph supports training models on **protocol-specific subgraphs or jointly across views**.

- **Temporal/causal interpretation:**  
  Directed edges preserve **who initiated the interaction**, enabling traceability and behavioral profiling.

## Grid Search 

In [1]:
import sys
import os
import torch
from torch_geometric.data import Data
from sklearn.preprocessing import LabelEncoder, StandardScaler

parent_dir = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
utils_path = os.path.join(parent_dir, "project_utils")
sys.path.append(utils_path)

from project_utils import graph_creator_b_c

In [2]:
G, df = graph_creator_b_c.create_graph_from_file('../datasets/train_test_network.csv')

Graph built with 1605 nodes and 2554 edges.
Edge types (views) include: {'flow', 'ssl_subject', 'dns_query', 'protocol_violation', 'ssl_issuer', 'http_request'}


In [3]:
# Convert flow_G (DiGraph) to PyG format
ip_nodes = [n for n in G.nodes if isinstance(n, str) and '.' in n]

node_to_idx = {node: i for i, node in enumerate(ip_nodes)}
edge_index = []

features = []
labels = []

for node in ip_nodes:
    out_deg = len([1 for _, _, k in G.out_edges(node, keys=True) if k == "flow"])
    in_deg = len([1 for _, _, k in G.in_edges(node, keys=True) if k == "flow"])

    features.append([in_deg, out_deg])

    label = "Normal"
    for _, _, k, d in G.out_edges(node, keys=True, data=True):
        if k == "flow" and d.get("label"):
            label = "Attack" if str(d["label"]).lower() != "normal" else "Normal"
            break
    labels.append(label)

# Encode features and labels
X = StandardScaler().fit_transform(features)
y = LabelEncoder().fit_transform(labels)
X = torch.tensor(X, dtype=torch.float)
y = torch.tensor(y, dtype=torch.long)

# Build edge index
for u, v in G.edges():
    if u in node_to_idx and v in node_to_idx:
        edge_index.append([node_to_idx[u], node_to_idx[v]])

edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()

# Define PyG Data
data = Data(x=X, edge_index=edge_index, y=y)

# Split train/test
torch.manual_seed(42)
num_nodes = data.num_nodes
perm = torch.randperm(num_nodes)
train_idx = perm[:int(0.8 * num_nodes)]
test_idx = perm[int(0.8 * num_nodes):]

data.train_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.train_mask[train_idx] = True
data.test_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.test_mask[test_idx] = True


# Convert flow_G (DiGraph) to PyG format
ip_nodes = [n for n in G.nodes if isinstance(n, str) and '.' in n]

node_to_idx = {node: i for i, node in enumerate(ip_nodes)}
edge_index = []

features = []
labels = []

for node in ip_nodes:
    out_deg = len([1 for _, _, k in G.out_edges(node, keys=True) if k == "flow"])
    in_deg = len([1 for _, _, k in G.in_edges(node, keys=True) if k == "flow"])

    features.append([in_deg, out_deg])

    label = "Normal"
    for _, _, k, d in G.out_edges(node, keys=True, data=True):
        if k == "flow" and d.get("label"):
            label = "Attack" if str(d["label"]).lower() != "normal" else "Normal"
            break
    labels.append(label)

# Encode features and labels
X = StandardScaler().fit_transform(features)
y = LabelEncoder().fit_transform(labels)
X = torch.tensor(X, dtype=torch.float)
y = torch.tensor(y, dtype=torch.long)

# Build edge index
for u, v in G.edges():
    if u in node_to_idx and v in node_to_idx:
        edge_index.append([node_to_idx[u], node_to_idx[v]])

edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()

# Define PyG Data
data = Data(x=X, edge_index=edge_index, y=y)

# Split train/test
torch.manual_seed(42)
num_nodes = data.num_nodes
perm = torch.randperm(num_nodes)
train_idx = perm[:int(0.8 * num_nodes)]
test_idx = perm[int(0.8 * num_nodes):]

data.train_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.train_mask[train_idx] = True
data.test_mask = torch.zeros(num_nodes, dtype=torch.bool)
data.test_mask[test_idx] = True


In [6]:
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv

class GraphSAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return x


def train(model, data, train_mask, optimizer, criterion):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = criterion(out[train_mask], data.y[train_mask])
    loss.backward()
    optimizer.step()


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def test(model, data, test_mask):
    model.eval()
    with torch.no_grad():
        logits = model(data.x, data.edge_index)
        preds = logits[test_mask].argmax(dim=1).cpu()
        labels = data.y[test_mask].cpu()

    return (
        accuracy_score(labels, preds),
        precision_score(labels, preds, zero_division=0),
        recall_score(labels, preds, zero_division=0),
        f1_score(labels, preds, zero_division=0),
    )


from sklearn.model_selection import train_test_split, StratifiedKFold


def run_holdout(data, test_sizes=None):
    if test_sizes is None:
        test_sizes = [0.1, 0.3, 0.5]
    results = []
    X = data.x.cpu().numpy()
    y = data.y.cpu().numpy()

    for test_size in test_sizes:
        train_idx, test_idx = train_test_split(
            range(len(y)), test_size=test_size, stratify=y, random_state=42
        )
        train_mask = torch.zeros(len(y), dtype=torch.bool)
        test_mask = torch.zeros(len(y), dtype=torch.bool)
        train_mask[train_idx] = True
        test_mask[test_idx] = True

        model = GraphSAGE(data.num_node_features, 32, int(data.y.max().item()) + 1).to(data.x.device)
        optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
        criterion = torch.nn.CrossEntropyLoss()

        for epoch in range(100):
            train(model, data, train_mask, optimizer, criterion)

        acc, prec, rec, f1 = test(model, data, test_mask)
        label = f"{int((1 - test_size) * 100)}/{int(test_size * 100)}"
        results.append((label, acc, prec, rec, f1))
    return results


def run_cv(data, splits=None):
    if splits is None:
        splits = [5, 10]
    results = []
    X = data.x.cpu().numpy()
    y = data.y.cpu().numpy()

    for k in splits:
        skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
        accs, precs, recs, f1s = [], [], [], []

        for train_idx, test_idx in skf.split(X, y):
            train_mask = torch.zeros(len(y), dtype=torch.bool)
            test_mask = torch.zeros(len(y), dtype=torch.bool)
            train_mask[train_idx] = True
            test_mask[test_idx] = True

            model = GraphSAGE(data.num_node_features, 32, int(data.y.max().item()) + 1).to(data.x.device)
            optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
            criterion = torch.nn.CrossEntropyLoss()

            for epoch in range(100):
                train(model, data, train_mask, optimizer, criterion)

            acc, prec, rec, f1 = test(model, data, test_mask)
            accs.append(acc)
            precs.append(prec)
            recs.append(rec)
            f1s.append(f1)

        results.append((str(k), sum(accs) / k, sum(precs) / k, sum(recs) / k, sum(f1s) / k))
    return results


# Run evaluations
holdout_results = run_holdout(data)
cv_results = run_cv(data)

print("Split/CV,Accuracy,precision,recal,f1-score")
for r in holdout_results + cv_results:
    print(f"{r[0]},{r[1]:.4f},{r[2]:.4f},{r[3]:.4f},{r[4]:.4f}")

Split/CV,Accuracy,precision,recal,f1-score
90/10,0.9925,0.9924,1.0000,0.9962
70/30,0.9874,0.9874,1.0000,0.9936
50/50,0.9909,0.9909,1.0000,0.9954
5,0.9894,0.9909,0.9985,0.9947
10,0.9894,0.9909,0.9985,0.9947


# WTF IS THIS

## CV and Grid-Search for GCN

In [8]:
from itertools import product
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
import numpy as np


# GCN Model Definition
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x


# Train function (no class weighting needed since we balanced the folds)
def train(model, data, train_idx, test_idx, epochs=100, lr=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    data.train_mask = torch.zeros(data.num_nodes, dtype=torch.bool)
    data.train_mask[train_idx] = True
    data.test_mask = torch.zeros(data.num_nodes, dtype=torch.bool)
    data.test_mask[test_idx] = True

    for _ in range(epochs):
        model.train()
        out = model(data.x, data.edge_index)
        loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        logits = model(data.x, data.edge_index)
        pred = logits[data.test_mask].argmax(dim=1)
        true = data.y[data.test_mask]
        report = classification_report(true.cpu(), pred.cpu(), target_names=["Normal", "Attack"], output_dict=True,
                                       zero_division=0)
        return report['accuracy'], report['weighted avg']['f1-score']


# Create balanced dataset (as before)
normal_idx = (y == 0).nonzero(as_tuple=True)[0]
attack_idx = (y == 1).nonzero(as_tuple=True)[0]
num_normals = len(normal_idx)
undersampled_attack_idx = attack_idx[torch.randperm(len(attack_idx))[:num_normals]]
balanced_idx = torch.cat([normal_idx, undersampled_attack_idx])
balanced_idx = balanced_idx[torch.randperm(len(balanced_idx))]

X_np = data.x[balanced_idx].cpu().numpy()  # just to satisfy StratifiedKFold
y_np = y[balanced_idx].cpu().numpy()

# Grid Search with Cross-Validation
hidden_sizes = [8, 16]
learning_rates = [0.02, 0.03, 0.05]
epochs_list = [50, 100, 200]
grid = list(product(hidden_sizes, learning_rates, epochs_list))

best_f1 = -1
best_params = None

for hidden, lr, epochs in grid:
    fold_f1s = []
    fold_acc = []
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for train_idx_np, test_idx_np in skf.split(X_np, y_np):
        train_idx = balanced_idx[train_idx_np]
        test_idx = balanced_idx[test_idx_np]

        torch.manual_seed(42)
        model = GCN(2, hidden, 2)
        acc, f1 = train(model, data, train_idx, test_idx, epochs=epochs, lr=lr)
        fold_f1s.append(f1)
        fold_acc.append(acc)

    avg_f1 = np.mean(fold_f1s)
    avg_acc = np.mean(fold_acc)
    print(f"GCN(hid={hidden}, lr={lr}, epochs={epochs}) → CV acc: {avg_acc:.4f}, F1: {avg_f1:.4f}")

    if avg_f1 > best_f1:
        best_f1 = avg_f1
        best_params = (hidden, lr, epochs)

print(
    f"\nBest GCN config: hidden={best_params[0]}, lr={best_params[1]}, epochs={best_params[2]} with CV F1={best_f1:.4f}")

GCN(hid=8, lr=0.02, epochs=50) → CV acc: 0.7393, F1: 0.7216
GCN(hid=8, lr=0.02, epochs=100) → CV acc: 0.7893, F1: 0.7810
GCN(hid=8, lr=0.02, epochs=200) → CV acc: 0.7393, F1: 0.7276
GCN(hid=8, lr=0.03, epochs=50) → CV acc: 0.7893, F1: 0.7810
GCN(hid=8, lr=0.03, epochs=100) → CV acc: 0.7893, F1: 0.7810
GCN(hid=8, lr=0.03, epochs=200) → CV acc: 0.7143, F1: 0.7022
GCN(hid=8, lr=0.05, epochs=50) → CV acc: 0.7893, F1: 0.7810
GCN(hid=8, lr=0.05, epochs=100) → CV acc: 0.7929, F1: 0.7794
GCN(hid=8, lr=0.05, epochs=200) → CV acc: 0.6893, F1: 0.6743
GCN(hid=16, lr=0.02, epochs=50) → CV acc: 0.7143, F1: 0.6949
GCN(hid=16, lr=0.02, epochs=100) → CV acc: 0.7893, F1: 0.7810
GCN(hid=16, lr=0.02, epochs=200) → CV acc: 0.7357, F1: 0.7184
GCN(hid=16, lr=0.03, epochs=50) → CV acc: 0.7643, F1: 0.7556
GCN(hid=16, lr=0.03, epochs=100) → CV acc: 0.7393, F1: 0.7276
GCN(hid=16, lr=0.03, epochs=200) → CV acc: 0.7393, F1: 0.7276
GCN(hid=16, lr=0.05, epochs=50) → CV acc: 0.7643, F1: 0.7470
GCN(hid=16, lr=0.05, ep

## Grid Search with CV on GraphSAGE

In [None]:
from itertools import product
from sklearn.metrics import classification_report
from torch_geometric.nn import SAGEConv
import torch
import torch.nn.functional as F
import numpy as np


# GraphSAGE Model Definition
class GraphSAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x


# Compute class weights based on imbalance
def get_class_weights(y):
    counts = torch.bincount(y)
    weights = 1.0 / counts.float()
    weights = weights * (len(y) / weights.sum())  # normalize
    return weights


# Training Function (same logic)
def train(model, data, epochs=100, lr=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    class_weights = get_class_weights(data.y[data.train_mask]).to(data.x.device)

    for epoch in range(epochs):
        model.train()
        out = model(data.x, data.edge_index)
        loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask], weight=class_weights)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        logits = model(data.x, data.edge_index)
        pred = logits[data.test_mask].argmax(dim=1)
        true = data.y[data.test_mask]
        report = classification_report(true.cpu(), pred.cpu(), target_names=["Normal", "Attack"], output_dict=True,
                                       zero_division=0)
        return report['accuracy'], report['weighted avg']['f1-score']


# Grid Search Parameters
hidden_sizes = [8, 16, 32]
learning_rates = [0.01, 0.02, 0.03]
epochs_list = [100, 200, 150]

grid = list(product(hidden_sizes, learning_rates, epochs_list))

best_f1 = -1
best_params = None

# Grid Search Execution for GraphSAGE
for hidden, lr, epochs in grid:
    torch.manual_seed(42)
    model = GraphSAGE(2, hidden, 2)
    acc, f1 = train(model, data, epochs=epochs, lr=lr)
    print(f"GraphSAGE(hid={hidden}, lr={lr}, epochs={epochs}) → Acc: {acc:.4f}, F1: {f1:.4f}")
    if f1 > best_f1:
        best_f1 = f1
        best_params = (hidden, lr, epochs)

print(
    f"\nBest GraphSAGE config: hidden={best_params[0]}, lr={best_params[1]}, epochs={best_params[2]} with F1={best_f1:.4f}")