# GCL-NIDS — Graph Contrastive Learning for NIDS 

### Extension: Contrastive Pretraining
Notebook này không chỉ huấn luyện GNN theo cách supervised, mà còn bổ sung một nhánh **Graph Contrastive Learning (GCL)**:
1. Pretrain encoder bằng contrastive loss (InfoNCE) trên đồ thị augmented.
2. Fine-tune classifier trên tập gán nhãn nhỏ.
3. So sánh với baseline supervised.

## 1. Import & setup GPU

In [1]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from torch_geometric.data import Data
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import NearestNeighbors

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "d:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "d:\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "d:\Python\Python310\lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "d:\Python\Python310\lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "d:\Python\Python310\lib\site-packages\ipykernel\kernelapp

Device: cuda


## 2. Load & Preprocess Data

### Dataset

**Datasets used in this notebook:**
- BoT-IoT (5% subset) — `./processed/bot/`
- CIC-IDS2018 (9 selected days) — `./processed/cic18/`
- UNSW — `./processed/unsw/`

**Notes for reproducibility**
1. Download data (link in README).
2. This notebook reads processed `.npy` files. If your dataset is too big, set `N_sample` or run on a subset.


In [2]:
# Load dữ liệu UNSW
X_train = np.load("./processed/unsw/X_train_unsw.npy")
X_test = np.load("./processed/unsw/X_test_unsw.npy")
y_train = np.load("./processed/unsw/y_train_unsw.npy")
y_test = np.load("./processed/unsw/y_test_unsw.npy")

print("Trước khi gom lớp:", np.unique(y_train, return_counts=True))

# Gom về binary: 0 = normal, 1 = attack
y_train = (y_train != 0).astype(int)
y_test = (y_test != 0).astype(int)

print("Sau khi gom lớp:", np.unique(y_train, return_counts=True))

# Chuẩn hóa feature
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Trước khi gom lớp: (array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]), array([37000,  3496,   583,  4089, 11132,   677,  6062,    44,   378,
       18871]))
Sau khi gom lớp: (array([0, 1]), array([37000, 45332]))


## 3. Graph Construction

Design choices and knobs (explained):
- **Node definition:** each node = a network flow (Netflow record). Alternative: host-level nodes (endpoint-centric).
- **Edge definition:** KNN on feature vectors (approximate/FAISS recommended) or explicit IP→IP edges (flow source→dest). Current notebook: **KNN with k=5** to capture local similarity in feature space.
- **Sampling options:** `N_sample` (for speed), `balance` (stratified sampling to avoid class collapse).
- **Symmetrize edges:** we add both (i→j) and (j→i) for undirected message passing.
- **Performance tradeoffs:** building full KNN on millions of points is expensive — use `faiss-cpu` or sampling.

**What to report in experiments:** nodes, edges, avg degree, build time, peak RAM.


In [3]:
def build_graph(X, y=None, k=5, log_every=5000, N_sample=100000, random_state=42):
    """
    X: numpy array (N, d)
    y: label (N,)
    k: số láng giềng
    N_sample: số mẫu tối đa để xây đồ thị (default=100k)
    """
    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    from torch_geometric.data import Data
    import time

    start = time.time()
    N = X.shape[0]
    if N > N_sample:
        rng = np.random.RandomState(random_state)
        idx = rng.choice(N, size=N_sample, replace=False)
        X = X[idx]
        y = y[idx] if y is not None else None
        print(f"[build_graph] Sampled {N_sample}/{N} samples")
    else:
        print(f"[build_graph] Using full dataset: {N} samples")

    print(f"[build_graph] Fitting NearestNeighbors (n_neighbors={k+1})...")
    nbrs = NearestNeighbors(
        n_neighbors=k+1, algorithm='auto', n_jobs=-1).fit(X)
    _, neighbors = nbrs.kneighbors(X)

    edge_index = []
    for i in range(X.shape[0]):
        if i % log_every == 0 and i > 0:
            print(
                f"  processed {i}/{X.shape[0]} (elapsed {time.time()-start:.1f}s)")
        for j in neighbors[i][1:]:
            edge_index.append([i, j])
            edge_index.append([j, i])  # symmetrize

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()

    data = Data(
        x=torch.tensor(X, dtype=torch.float),
        edge_index=edge_index,
        y=torch.tensor(y, dtype=torch.long) if y is not None else None
    )
    print(
        f"[build_graph] Done! nodes={data.num_nodes}, edges={data.num_edges}, time={time.time()-start:.1f}s")
    return data


train_data = build_graph(X_train, y_train, k=5).to(device)
test_data = build_graph(X_test,  y_test,  k=5).to(device)

[build_graph] Using full dataset: 82332 samples
[build_graph] Fitting NearestNeighbors (n_neighbors=6)...
  processed 5000/82332 (elapsed 23.2s)
  processed 10000/82332 (elapsed 23.2s)
  processed 15000/82332 (elapsed 23.3s)
  processed 20000/82332 (elapsed 23.4s)
  processed 25000/82332 (elapsed 23.5s)
  processed 30000/82332 (elapsed 23.5s)
  processed 35000/82332 (elapsed 23.5s)
  processed 40000/82332 (elapsed 23.6s)
  processed 45000/82332 (elapsed 23.6s)
  processed 50000/82332 (elapsed 23.6s)
  processed 55000/82332 (elapsed 23.6s)
  processed 60000/82332 (elapsed 23.8s)
  processed 65000/82332 (elapsed 23.8s)
  processed 70000/82332 (elapsed 23.8s)
  processed 75000/82332 (elapsed 23.8s)
  processed 80000/82332 (elapsed 23.8s)
[build_graph] Done! nodes=82332, edges=823320, time=25.8s
[build_graph] Sampled 100000/175341 samples
[build_graph] Fitting NearestNeighbors (n_neighbors=6)...
  processed 5000/100000 (elapsed 33.2s)
  processed 10000/100000 (elapsed 33.2s)
  processed 15

## Graph Contrastive Learning (GCL) — Pretraining Stage

Ý tưởng chính:
- Thay vì chỉ dựa vào nhãn, ta huấn luyện encoder học embedding bằng cách phân biệt giữa cặp **positive** (2 augmentation của cùng node/graph) và **negative** (augmentation của node khác).
- Các augmentation phổ biến:
  - Node drop (ngẫu nhiên bỏ một số node).
  - Edge drop (ngẫu nhiên bỏ cạnh).
  - Feature masking (che một số thuộc tính).
- Loss: InfoNCE = maximize similarity của cặp positive, minimize similarity với negative.

Pipeline GCL: Graph → Augment (view1, view2) → Encoder → Projection head → Contrastive loss

In [4]:
def drop_edges(data, drop_prob=0.2):
    """Randomly drop edges with probability drop_prob."""
    edge_index = data.edge_index
    mask = torch.rand(edge_index.size(1)) > drop_prob
    edge_index = edge_index[:, mask]
    return Data(x=data.x, edge_index=edge_index, y=data.y)


def mask_features(data, mask_prob=0.2):
    """Randomly mask node features."""
    x = data.x.clone()
    mask = torch.rand_like(x) < mask_prob
    x[mask] = 0
    return Data(x=x, edge_index=data.edge_index, y=data.y)


def graph_augment(data):
    """Apply a sequence of augmentations (can be randomized)."""
    data_aug = drop_edges(data, drop_prob=0.2)
    data_aug = mask_features(data_aug, mask_prob=0.2)
    return data_aug

## 4. Model: Supervised baseline (GraphSAGE)

Architecture:
- Encoder: 2-layer GraphSAGE
- Classifier: 2-layer MLP (embedding → hidden → logits)
- Loss: CrossEntropy (or BCE if binary)
- Training knobs: lr=1e-3, optimizer=Adam, epochs=..., weight decay, class weights if imbalanced

In [5]:
class GraphSAGE(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, dropout=0.5):
        super().__init__()
        self.conv1 = SAGEConv(in_dim, hidden_dim)
        self.bn1 = torch.nn.BatchNorm1d(hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)
        self.bn2 = torch.nn.BatchNorm1d(hidden_dim)
        self.conv3 = SAGEConv(hidden_dim, out_dim)
        self.dropout = dropout

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = self.bn1(x)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout, training=self.training)

        x = self.conv2(x, edge_index)
        x = self.bn2(x)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout, training=self.training)

        x = self.conv3(x, edge_index)
        return x


encoder = GraphSAGE(
    in_dim=train_data.num_node_features,
    hidden_dim=128,
    out_dim=128
).to(device)

In [6]:
# class MLP(torch.nn.Module):
#     def __init__(self, in_dim, hidden_dim, num_classes):
#         super().__init__()
#         self.net = torch.nn.Sequential(
#             torch.nn.Linear(in_dim, hidden_dim),
#             torch.nn.ReLU(),
#             torch.nn.Dropout(0.5),
#             torch.nn.Linear(hidden_dim, num_classes)
#         )

#     def forward(self, x):
#         return self.net(x)


# classifier = MLP(in_dim=128, hidden_dim=128, num_classes=2).to(device)

In [7]:
class ProjectionHead(torch.nn.Module):
    def __init__(self, in_dim, out_dim=64):
        super().__init__()
        self.fc1 = torch.nn.Linear(in_dim, out_dim)
        self.fc2 = torch.nn.Linear(out_dim, out_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)
    

projection = ProjectionHead(in_dim=128, out_dim=64).to(device)

In [8]:
def info_nce_loss(z1, z2, temperature=0.5):
    # normalize
    z1 = F.normalize(z1, dim=1)
    z2 = F.normalize(z2, dim=1)
    N = z1.size(0)
    representations = torch.cat([z1, z2], dim=0)  # 2N x d
    sim_matrix = torch.mm(representations, representations.t())  # 2N x 2N

    # mask out self-contrast
    mask = torch.eye(2*N, dtype=torch.bool, device=z1.device)
    sim_matrix = sim_matrix / temperature
    sim_matrix = sim_matrix.masked_fill(mask, -9e15)

    # positives: i-th sample in z1 with i-th in z2
    positives = torch.cat(
        [torch.arange(N, 2*N), torch.arange(0, N)]).to(z1.device)
    logits = sim_matrix
    labels = positives

    loss = F.cross_entropy(logits, labels)
    return loss

## Training & Evaluation

- Train loop outputs per-epoch: loss, training time/epoch.
- Evaluation metrics:
  - Detection metrics: Precision, Recall, F1 (per-class), AUC.
  - Operational metrics: average inference latency per sample (ms), throughput (flows/sec), GPU/CPU memory usage.

**Logging recommendations (in code):**
- print time per epoch,
- measure `inference_time = timeit()` per batch and report mean ms/sample,
- keep `torch.cuda.empty_cache()` after heavy operations to reduce fragmentation (when debugging).


In [9]:
def sample_nodes(data, num_samples=5000):
    idx = torch.randperm(data.num_nodes)[:num_samples]
    return Data(
        x=data.x[idx],
        edge_index=data.edge_index,  # có thể filter subgraph cho đúng
        y=data.y[idx] if data.y is not None else None
    )

In [None]:
def pretrain_contrastive(data, encoder, projection, epochs=10, lr=1e-3):
    encoder.train()
    projection.train()
    optimizer = torch.optim.Adam(
        list(encoder.parameters()) + list(projection.parameters()), lr=lr
    )
    for epoch in range(epochs):
        data1 = graph_augment(sample_nodes(data, num_samples=5000))
        data2 = graph_augment(sample_nodes(data, num_samples=5000))
        z1 = encoder(data1.x.to(device), data1.edge_index.to(device))
        z2 = encoder(data2.x.to(device), data2.edge_index.to(device))
        z1 = projection(z1)
        z2 = projection(z2)
        loss = info_nce_loss(z1, z2)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"[Pretrain] Epoch {epoch}, Loss={loss.item():.4f}")
        

print("=== Pretraining with GCL ===")
pretrain_contrastive(train_data, encoder, projection, epochs=20, lr=1e-3)

In [None]:
class Classifier(torch.nn.Module):
    def __init__(self, in_dim, num_classes):
        super().__init__()
        self.fc = torch.nn.Linear(in_dim, num_classes)

    def forward(self, x):
        return self.fc(x)


def train_finetune(data, encoder, epochs=10, lr=1e-3):
    encoder.eval()
    classifier = Classifier(
        in_dim=encoder.hidden_channels, num_classes=2).to(device)
    optimizer = torch.optim.Adam(classifier.parameters(), lr=lr)

    for epoch in range(epochs):
        with torch.no_grad():
            z = encoder(data.x.to(device), data.edge_index.to(device))
        logits = classifier(z)
        loss = F.cross_entropy(logits, data.y.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"[Finetune] Epoch {epoch}, Loss={loss.item():.4f}")
    return classifier

print("=== Fine-tuning Classifier ===")
classifier = train_finetune(train_data, encoder, epochs=20, lr=1e-3)

In [None]:
class_weights = compute_class_weight(
    "balanced", classes=np.unique(y_train), y=y_train)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

optimizer = torch.optim.Adam(
    list(encoder.parameters()) + list(classifier.parameters()), lr=1e-3
)

def evaluate(data):
    encoder.eval()
    classifier.eval()
    with torch.no_grad():
        z = encoder(data.x.to(device), data.edge_index.to(device))
        logits = classifier(z)
        probs = F.softmax(logits, dim=1).cpu().numpy()
        preds = probs.argmax(axis=1)
        y_true = data.y.cpu().numpy()

    f1_w = f1_score(y_true, preds, average="weighted")
    f1_m = f1_score(y_true, preds, average="macro")
    auc = roc_auc_score(y_true, probs[:, 1])
    print(f"[Eval] F1_weighted={f1_w:.4f}, F1_macro={f1_m:.4f}, AUC={auc:.4f}")

    cm = confusion_matrix(y_true, preds)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.show()

    print(classification_report(y_true, preds))
    return f1_w, f1_m, auc

In [None]:
print("=== Evaluation on Test ===")
evaluate(test_data)

## Explainability (XAI)

We will provide multi-level explanations:
1. **GNNExplainer (subgraph + feature mask)** — run per-alert (node). Output: compact subgraph (highlight edges/nodes) and feature importance mask for the target node.
2. **Attention weights** (if using GAT/E-GraphSAGE) — visualize edge attention as thickness / opacity on edges.
3. **Gradient-based heatmap** (Integrated Gradients / GraphGrad-CAM) — color nodes by importance (blue→red scale).
4. **Feature importance (LIME / SHAP)** — per-flow attribute contribution (presented as small bar chart).

**How to read the figures generated here:**
- Subgraph highlight (GNNExplainer): red edges/nodes are most important; node labels show `(IP, port, count)` where available.
- Gradient heatmap: node colors map to importance score (blue low → red high).
- LIME/SHAP bar chart: top-k contributing features for the predicted label.

**Evaluation of explanations (we report):** fidelity, sparsity, and a small human-evaluation checklist (10 alerts).

In [None]:
device_cpu = torch.device("cpu")

In [None]:
# Định nghĩa full model để nối encoder + classifier
class FullModel(torch.nn.Module):
    def __init__(self, encoder, classifier):
        super().__init__()
        self.encoder = encoder
        self.classifier = classifier

    def forward(self, x, edge_index):
        z = self.encoder(x, edge_index)   # embedding
        out = self.classifier(z)          # logits
        return F.log_softmax(out, dim=1)  # return_type='log_probs'

In [None]:
from torch_geometric.explain import Explainer, GNNExplainer

full_model_cpu = FullModel(encoder, classifier).to(device_cpu)
test_data_cpu = test_data.to(device_cpu)

explainer = Explainer(
    model=full_model_cpu,
    algorithm=GNNExplainer(epochs=100),  # giảm epochs để tiết kiệm
    explanation_type='model',
    node_mask_type='attributes',
    edge_mask_type='object',
    model_config=dict(
        mode='multiclass_classification',
        task_level='node',
        return_type='log_probs',
    ),
)

node_id = 0
explanation = explainer(
    x=test_data_cpu.x, edge_index=test_data_cpu.edge_index, index=node_id
)

In [None]:
print("Feature mask shape:", explanation.node_mask.shape)
print("Edge mask shape:", explanation.edge_mask.shape)

# Trực quan subgraph
explanation.visualize_graph()