# Reflection: Applying Bayesian Optimisation (BO) to Network Intrusion Detection — **Benign-First Training for Non-Benign Identification**

## Executive summary (non-technical)

- **Goal.** Train detectors to **model benign traffic precisely** (high TNR) so we can **flag non-benign** whenever the model’s benign probability falls below a threshold τ. We **must** still maintain high attack detection (TPR ≥ 90%) and low latency on CPU. Corpora: **UNSW-NB15** and **CIC-IDS2017** labelled PCAPs. ([Kaggle][1], [UNSW Sites][2], [unb.ca][3])  
- **Why BO.** BO **must** tune hyper-parameters, calibration, sampling ratios and τ under noisy/expensive evaluation with constraints (FPR cap, latency). It **should** prefer safe/constrained exploration and **may** reuse priors between datasets for sample efficiency. ([NeurIPS Proceedings][4], [arXiv][5], [PMC][6])  
- **Outcome.** A CPU-deployable detector that **passes benign** (minimises false alerts) and **signals non-benign** reliably at a bounded **FPR** and **p95 latency**, with explainable outputs ready for SOC ingestion.

---

## 1) How the BO skills/code transfer (technical)

### A. Direct applications (FN-aware but **benign-first** modelling)

| BO skill | IDS use-case | **Optimisation target & constraints** | Key refs |
| --- | --- | --- | --- |
| GP surrogate + EI/UCB/KG | Hyper-parameter tuning (LightGBM/CatBoost/SGD, optional 1D-CNN) | **Objective:** maximise **TNR** (benign pass-rate). **Constraints:** **TPR ≥ 0.90**, **FPR ≤ 0.5%**, **p95 latency ≤ 10 ms**. Use **Noisy-EI**. | [4],[5] |
| Noisy BO + replications | Stability under label noise & drift | 3–5× GroupKFold or repeated splits; fold variance → GP noise | [5] |
| **Constrained / Safe BO** | Keep search within SOC limits | **SafeOpt/feasible filters:** reject configs violating latency/FPR constraints | [6],[7] |
| Threshold & calibration BO | Operating point selection | Calibrate (Platt/Isotonic), then BO over τ with **FPR cap**; objective **TNR↑** under **TPR floor** | [5] |
| Mixed discrete/continuous | End-to-end pipeline choice | Search over model family, class-weights, sampling, features | [4] |
| Transfer/meta-BO | UNSW ↔ CIC generalisation | Warm-start from source dataset; compare TNR/TPR stability | [3] |

**Why benign-first?** In production SIEM/SOAR, benign dominates. If the model **knows benign well**, anything that deviates (low benign probability) is a **non-benign candidate**. We still **must** uphold an attack-TPR floor to avoid missing true attacks.

### B. What is optimised

- **Features.** CICFlowMeter-style flow stats, byte histograms, temporal aggregates; **time-grouped** CV to avoid leakage. ([3],[8])  
- **Models.** Baselines (SGD/LogReg), LightGBM/CatBoost; optional 1D-CNN on payload bytes if present.  
- **Objective/constraints.** **Maximise TNR** under **TPR ≥ 0.90**, **FPR ≤ 0.5%**, **p95 latency ≤ 10 ms**.  
- **Explainability.** SHAP stability on benign motifs; aide for triage playbooks.  

---

## 2) Questions addressed

- **Benign pass-rate at fixed safety.** Which configurations yield **highest TNR** while keeping **TPR ≥ 0.90**, **FPR ≤ 0.5%**, **p95 latency ≤ 10 ms**?  
- **Cross-dataset robustness.** Do UNSW-tuned τ and calibrations hold on CIC with acceptable TNR/TPR?  
- **Alert budget discipline.** What benign false-alert reduction is achieved at the set FPR cap?  

---

## 3) Datasets

- **Primary:** **UNSW-NB15 & CIC-IDS2017 Labelled PCAPs** (Kaggle/Zenodo). Integer matrices (e.g., **N×1504**) aligned to official CSVs. **Must** be used. ([1],[9])  
- **Metadata:** **UNSW** official description; **CIC-IDS2017** official page; **CICFlowMeter** tooling. ([2],[3],[8])  

**Splitting:** **GroupKFold by day/pcap/session** to prevent temporal leakage. UNSW↔CIC out-of-domain testing for generalisation.

---

## 4) Alignment

- **SOC/DFIR (must).** High TNR reduces benign alert noise; TPR floor reduces missed attacks.  
- **MLOps (should).** Constrained/safe BO, model cards, reproducible CV.  
- **Deployment (must).** CPU latency bounds, τ at FPR cap, calibration stability.  

---

## Proposed project blueprint (updated)

1. **Problem framing & KPIs (must).** **Objective:** maximise **TNR (benign pass-rate)**. **Constraints:** **TPR ≥ 0.90**, **FPR ≤ 0.5%**, **p95 latency ≤ 10 ms**; secondary: macro-F1, PR-AUC.  
2. **Data engineering (must).** Load labelled PCAPs; derive flow/payload features; de-dup; **time-grouped CV**. ([1],[3])  
3. **Model space (should).** SGD/LogReg; LightGBM/CatBoost; optional 1D-CNN (payload).  
4. **BO setup (must).** GP-Matern(5/2) + **Noisy-EI**; batch 4–8; **feasibility filters** for FPR/latency/TPR. ([4],[5],[6])  
5. **Imbalance handling (must).** BO over class-weights/sampling ratios with benign-first objective and TPR floor.  
6. **Calibration & τ (should).** Platt vs Isotonic; τ picked under **FPR cap**.  
7. **Cross-dataset (must).** UNSW↔CIC transfer; report ΔTNR/ΔTPR and τ drift.  
8. **Explainability (may).** SHAP motifs → defensive heuristics/rules.  
9. **Deployment (should).** CPU perf profile; CI job for small BO refresh on rolling data.  

---

## Kernel-crash avoidance & reliability controls

- **Row cap for BO:** subsample to `MAX_TRAIN_ROWS` with stratification when datasets are huge.  
- **Aggressive dtype down-cast & constant-column drop.**  
- **Thread control:** limit OpenMP/BLAS threads to avoid oversubscription.  
- **Safe plotting:** skip heavy skopt plots if memory constrained; free figures (`plt.close()`).  
- **Checkpoint trials:** write a light **trials.parquet** to resume analysis after interruption.  
- **Guard optional deps:** only search installed model families.  

---

## Explanation

*We teach a computer what “normal (benign) internet traffic” looks like. If a new connection looks **unlike** benign, we treat it as **non-benign**.*

1. We try different model settings. **Bayesian Optimisation** helps us *choose smartly* rather than guessing randomly.  
2. We **calibrate** model scores so “0.8” really means “~80% chance benign”.  
3. We pick a **threshold** (τ) so that **false alarms on normal traffic** stay under our limit.  
4. We also check that **at least 90% of attacks** are still caught and that **predictions are fast** on a normal CPU.  
5. We save the best settings and a small “model card” (a factsheet) for the SOC team.  

---

## Mathematical/statistical notes (used in code; quick glossary)

- **Sigmoid:** maps any number to 0–1:  \( \sigma(z)=1/(1+e^{-z}) \).  
- **Calibration:** learn a mapping \( g(p) \) so calibrated probabilities match observed frequencies (Platt = logistic; Isotonic = monotone step-wise).  
- **Confusion matrix:** TN/FP/FN/TP counts; **TNR** \(= \text{TN}/(\text{TN}+\text{FP})\) (benign passed), **TPR** \(= \text{TP}/(\text{TP}+\text{FN})\).  
- **Percentiles:** p95 latency = time under which 95% of single-row predictions finish.  
- **Bayesian Optimisation:** build a surrogate (GP with Matern-5/2 kernel) over hyper-params; pick next point by **Expected Improvement (EI)**; handle noise by modelling observation variance.  
- **Feasible set:** we keep only configs with **TPR ≥ 0.90** and **latency ≤ 10 ms**; τ is chosen to satisfy **FPR cap**.  

---

[1]: https://www.kaggle.com/datasets/yasiralifarrukh/unsw-and-cicids2017-labelled-pcap-data/code?utm_source=chatgpt.com  
[2]: https://research.unsw.edu.au/projects/unsw-nb15-dataset?utm_source=chatgpt.com  
[3]: https://www.unb.ca/cic/datasets/ids-2017.html?utm_source=chatgpt.com  
[4]: https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf?utm_source=chatgpt.com  
[5]: https://arxiv.org/pdf/1807.02811?utm_source=chatgpt.com  
[6]: https://pmc.ncbi.nlm.nih.gov/articles/PMC10485113/?utm_source=chatgpt.com  
[7]: https://arxiv.org/abs/2403.12948?utm_source=chatgpt.com  
[8]: https://github.com/ahlashkari/CICFlowMeter?utm_source=chatgpt.com  
[9]: https://zenodo.org/records/7258579?utm_source=chatgpt.com  


## 0) Imports & configuration


In [24]:
# %%
import os, json, warnings, joblib
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    precision_recall_curve, average_precision_score, roc_auc_score,
    confusion_matrix
)
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.ensemble import IsolationForest

# HPO
from skopt import BayesSearchCV, gp_minimize
from skopt.space import Real, Integer

warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (7.5, 5)

CSV_PATH     = "archive/Payload_data_UNSW.csv"
ART_DIR      = "ids_artifacts"
MODEL_PATH   = os.path.join(ART_DIR, "model.pkl")
SCALER_PATH  = os.path.join(ART_DIR, "scaler.pkl")
ENCODER_PATH = os.path.join(ART_DIR, "label_encoder.pkl")
META_PATH    = os.path.join(ART_DIR, "metadata.json")
TRIALS_PATH  = os.path.join(ART_DIR, "trials.parquet")

Path(ART_DIR).mkdir(parents=True, exist_ok=True)
Path("reports/figures").mkdir(parents=True, exist_ok=True)
Path("reports/tables").mkdir(parents=True, exist_ok=True)

## 1. Problem & Data
- **Task:** Model **benign** traffic; later, flag **non‑benign** if anomaly score ≥ τ.
- **Metric for optimisation:** **AP (PR‑AUC)** on validation. (Attacks are rare → PR is more informative.)
- **Dataset:** `archive/Payload_data_UNSW.csv` with binary `label` (0=benign, 1=non‑benign).
- **Split (demo):** Stratified random train/val/test. In production, prefer **time‑based** or **grouped** splits.


In [25]:
# --- Load CSV ---
df = pd.read_csv(CSV_PATH)
print("Original shape:", df.shape)
assert "label" in df.columns, "Expected a 'label' column with string classes."

# --- Preserve string labels, create binary label ---
df = df.copy()
df["label_str"] = df["label"].astype(str)

# Map: benign = 'normal' -> 0, non-benign (everything else) -> 1
df["label"] = (df["label_str"] != "normal").astype(int)

# Quick checks
print("Unique string labels and counts:")
print(df["label_str"].value_counts())

print("\nBinary label balance (0=benign 'normal', 1=non-benign):")
print(df["label"].value_counts(normalize=True).rename({0:"benign", 1:"non-benign"}).round(3))

# Optional sanity check: no NaNs
assert not df["label"].isna().any(), "Label has NaNs after mapping."

# --- Stratified splits on the NEW binary label ---
train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["label"]
)
train_df, val_df  = train_test_split(
    train_df, test_size=0.2, random_state=42, stratify=train_df["label"]
)

print("\nSplits → train:", train_df.shape, "val:", val_df.shape, "test:", test_df.shape)
print("Train binary balance:\n", train_df["label"].value_counts(normalize=True).round(3))
print("Val   binary balance:\n", val_df["label"].value_counts(normalize=True).round(3))
print("Test  binary balance:\n", test_df["label"].value_counts(normalize=True).round(3))

# (Re)build features with your existing helper, which drops 'label' automatically
X_train = build_X(train_df, fit_scaler=True)
X_val   = build_X(val_df,   fit_scaler=False)
X_test  = build_X(test_df,  fit_scaler=False)

y_train = train_df["label"].values
y_val   = val_df["label"].values
y_test  = test_df["label"].values

print("\nFeature shapes:", X_train.shape, X_val.shape, X_test.shape)

# Benign-only subset for IF training (0 = benign 'normal')
X_train_benign = X_train[y_train == 0]
print("Benign rows in train:", X_train_benign.shape[0])
if X_train_benign.shape[0] == 0:
    raise ValueError("Still no benign rows after mapping. Check the mapping or splits.")


Original shape: (79881, 1505)
Unique string labels and counts:
label_str
normal            21000
generic           17580
exploits          13992
fuzzers           12722
reconnaissance     7562
dos                3397
backdoor           1239
analysis           1208
shellcode          1088
worms                93
Name: count, dtype: int64

Binary label balance (0=benign 'normal', 1=non-benign):
label
non-benign    0.737
benign        0.263
Name: proportion, dtype: float64

Splits → train: (51123, 1506) val: (12781, 1506) test: (15977, 1506)
Train binary balance:
 label
1    0.737
0    0.263
Name: proportion, dtype: float64
Val   binary balance:
 label
1    0.737
0    0.263
Name: proportion, dtype: float64
Test  binary balance:
 label
1    0.737
0    0.263
Name: proportion, dtype: float64

Feature shapes: (51123, 1503) (12781, 1503) (15977, 1503)
Benign rows in train: 13440


## 1.1 Quick diagnosis cell

In [26]:
# Diagnose label distribution overall and per split
print("Overall balance:", df['label'].value_counts().to_dict())
print("Train balance:",  train_df['label'].value_counts().to_dict())
print("Val balance:",    val_df['label'].value_counts().to_dict())
print("Test balance:",   test_df['label'].value_counts().to_dict())

# Choose the benign label automatically (0 by default; fall back to 1 if needed)
BENIGN_LABEL = 0
if (train_df['label'] == 0).sum() == 0 and (train_df['label'] == 1).sum() > 0:
    BENIGN_LABEL = 1
print(f"Using BENIGN_LABEL = {BENIGN_LABEL}")

# If there are too few benign rows in train, re-split with different ratios
MIN_BENIGN_TRAIN = 50  # tweak if needed
if (train_df['label'] == BENIGN_LABEL).sum() < MIN_BENIGN_TRAIN:
    print("Re-splitting to ensure enough benign rows in train...")
    # Try a smaller test/val to increase train size
    tmp_train, tmp_test = train_test_split(
        df, test_size=0.15, random_state=42, stratify=df['label']
    )
    tmp_train, tmp_val  = train_test_split(
        tmp_train, test_size=0.15, random_state=42, stratify=tmp_train['label']
    )
    # Only adopt if it improves benign count
    if (tmp_train['label'] == BENIGN_LABEL).sum() > (train_df['label'] == BENIGN_LABEL).sum():
        train_df, val_df, test_df = tmp_train, tmp_val, tmp_test
        print("Re-split applied.")
    else:
        print("Re-split did not improve benign count; continuing with original split.")


Overall balance: {1: 58881, 0: 21000}
Train balance: {1: 37683, 0: 13440}
Val balance: {1: 9421, 0: 3360}
Test balance: {1: 11777, 0: 4200}
Using BENIGN_LABEL = 0


## 2. Features & Scaling (preprocessing summary)
For a comparable, simple baseline we:
- keep **numeric** columns only,
- drop `label` from features,
- use `StandardScaler` fit on **train**, then transform **val/test**.


In [27]:
# Rebuild features after potentially changing the splits
X_train = build_X(train_df, fit_scaler=True)
X_val   = build_X(val_df,   fit_scaler=False)
X_test  = build_X(test_df,  fit_scaler=False)

y_train = train_df["label"].values
y_val   = val_df["label"].values
y_test  = test_df["label"].values

# Safe benign-only selection (falls back if empty)
benign_mask = (y_train == BENIGN_LABEL)
num_benign  = int(benign_mask.sum())
print(f"Benign rows in train: {num_benign}")

if num_benign == 0:
    # Last-resort fallback: treat entire train as "unlabeled benign" for unsupervised fitting.
    # You’ll still evaluate on val/test with labels.
    print("WARNING: No benign rows in train. Fitting IF on ALL training rows (unsupervised fallback).")
    X_train_benign = X_train
else:
    X_train_benign = X_train[benign_mask]


Benign rows in train: 13440


## 3. Baseline Model (Benign‑only Isolation Forest)
Following the baseline → optimise pattern:
- Train IF on **benign‑only** `X_train`.
- Score = `−score_samples(X)` so **higher = more anomalous**.
- Evaluate on **val/test** with PR curves and rank metrics.


In [28]:
# %%
X_train_benign = X_train[y_train == 0]

base_if = IsolationForest(
    n_estimators=300, max_samples=0.7, contamination=0.01, max_features=1.0,
    random_state=42, n_jobs=-1
).fit(X_train_benign)

joblib.dump(base_if, MODEL_PATH)
print("Saved baseline model →", MODEL_PATH)

def precision_at_k(y_true, scores, k):
    idx = np.argsort(scores)[::-1][:k]
    return (y_true[idx] == 1).mean()

def recall_at_k(y_true, scores, k):
    idx = np.argsort(scores)[::-1][:k]
    tp = (y_true[idx] == 1).sum()
    positives = (y_true == 1).sum()
    return tp / positives if positives > 0 else 0.0

def eval_split(name, model, X, y):
    scores = -model.score_samples(X)
    ap  = average_precision_score(y, scores)
    roc = roc_auc_score(y, scores)
    p, r, _ = precision_recall_curve(y, scores)
    plt.figure(); plt.plot(r, p); plt.xlabel("Recall"); plt.ylabel("Precision")
    plt.title(f"PR Curve — {name} (AP={ap:.3f})"); plt.tight_layout()
    plt.savefig(f"reports/figures/pr_{name}.png", dpi=130); plt.close()
    k = min(1000, max(1, int(0.01 * len(scores))))
    print(f"[{name}] AP={ap:.4f} | ROC-AUC={roc:.4f} | P@{k}={precision_at_k(y, scores, k):.4f} | R@{k}={recall_at_k(y, scores, k):.4f}")
    pd.DataFrame({"score": scores, "label": y}).to_csv(f"reports/tables/scores_{name}.csv", index=False)
    return ap, roc

print("Validation:"); _ = eval_split("val_baseline", base_if, X_val, y_val)
print("Test:");       _ = eval_split("test_baseline", base_if, X_test, y_test)

Saved baseline model → ids_artifacts/model.pkl
Validation:
[val_baseline] AP=0.8478 | ROC-AUC=0.7136 | P@127=0.9528 | R@127=0.0128
Test:
[test_baseline] AP=0.8479 | ROC-AUC=0.7157 | P@159=0.9308 | R@159=0.0126


In [29]:
print("Unique labels in dataset:", np.unique(df['label'].values, return_counts=True))
print("Train unique:", np.unique(y_train, return_counts=True))
print("Val unique:",   np.unique(y_val, return_counts=True))
print("Test unique:",  np.unique(y_test, return_counts=True))
print("Any NaNs in label?", df['label'].isna().sum())
print("Numeric feature count:", build_X(df).shape[1])


Unique labels in dataset: (array([0, 1]), array([21000, 58881]))
Train unique: (array([0, 1]), array([13440, 37683]))
Val unique: (array([0, 1]), array([3360, 9421]))
Test unique: (array([0, 1]), array([ 4200, 11777]))
Any NaNs in label? 0
Numeric feature count: 1503


## 4. Hyperparameter Search Space
Define ranges before random/BO search.
For IF we tune **4 dials**:
- `n_estimators` ∈ [100, 800], `max_samples` ∈ [0.3, 1.0],
- `contamination` ∈ [1e‑3, 2e‑2] (log‑uniform), `max_features` ∈ [0.5, 1.0].

In [30]:
# %%
SPACE = dict(
    n_estimators=(100, 800),
    max_samples=(0.3, 1.0),
    contamination=(0.001, 0.02),
    max_features=(0.5, 1.0),
)

## 5. Automated HPO — Random Search (baseline to beat)
We first run a **simple random search**:
1) Sample params → fit on **benign train**.
2) Score **AP on validation**.
3) Track **best** and keep a trials table.

In [None]:
# %%
def random_search_if(X_train_b, X_val, y_val, n_iter=30, seed=42):
    rng = np.random.RandomState(seed)
    trials, best = [], (-np.inf, None)
    for i in range(n_iter):
        params = {
            "n_estimators": rng.randint(SPACE["n_estimators"][0], SPACE["n_estimators"][1] + 1),
            "max_samples":  rng.uniform(*SPACE["max_samples"]),
            "contamination": 10 ** rng.uniform(np.log10(SPACE["contamination"][0]), np.log10(SPACE["contamination"][1])),
            "max_features": rng.uniform(*SPACE["max_features"]),
            "random_state": 42, "n_jobs": -1
        }
        m = IsolationForest(**params).fit(X_train_b)
        ap = average_precision_score(y_val, -m.score_samples(X_val))
        trials.append({**params, "AP": ap})
        if ap > best[0]: best = (ap, params)
        print(f"Iter {i+1}/{n_iter} | AP={ap:.4f} | {params}")
    tdf = pd.DataFrame(trials)
    tdf.to_parquet(TRIALS_PATH, index=False)
    print("Saved trials →", TRIALS_PATH)
    return best, tdf

best_rs, trials_rs = random_search_if(X_train_benign, X_val, y_val, n_iter=30)
print("Best Random Search AP:", round(best_rs[0], 4)); print("Best params:", best_rs[1])


## 6. Bayesian Optimisation with Scikit‑Optimize — `BayesSearchCV`
Search that **learns** where to try next.
We wrap IF so the scorer uses anomaly scores to compute **AP** in CV folds.

In [None]:
# %%
from sklearn.base import BaseEstimator
from sklearn.metrics import make_scorer

class IF_AP_Wrapper(BaseEstimator):
    def __init__(self, **params): self.params, self.model = params, None
    def set_params(self, **params): self.params.update(params); return self
    def get_params(self, deep=True): return dict(**self.params)
    def fit(self, X, y=None):
        Xb = X[y == 0] if y is not None else X
        self.model = IsolationForest(**self.params).fit(Xb)
        return self
    def decision_function(self, X):  # higher = more positive
        return -self.model.score_samples(X)  # anomaly scores

# Combine train+val to let CV compute AP on labeled folds
X_cv = np.vstack([X_train, X_val])
y_cv = np.concatenate([y_train, y_val])

search_space = {
    "n_estimators":  Integer(100, 800),
    "max_samples":   Real(0.3, 1.0),
    "contamination": Real(0.001, 0.02, prior="log-uniform"),
    "max_features":  Real(0.5, 1.0),
}
tscv = TimeSeriesSplit(n_splits=4)  # if not time-ordered, a StratifiedKFold would be fine
ap_scorer = make_scorer(average_precision_score, needs_threshold=True)

bo_est = IF_AP_Wrapper(
    n_estimators=300, max_samples=0.7, contamination=0.01, max_features=1.0,
    random_state=42, n_jobs=-1
)

bayes_opt = BayesSearchCV(
    bo_est, search_spaces=search_space, n_iter=40, cv=tscv,
    scoring=ap_scorer, n_jobs=-1, random_state=42, refit=True
)
bayes_opt.fit(X_cv, y_cv)

print("Best (BayesSearchCV) AP:", bayes_opt.best_score_)
print("Best params:", bayes_opt.best_params_)


[DATA] Loading CSV from: archive/Payload_data_UNSW.csv
[DATA] Dropped 25 constant columns.
[DATA] Done. Features shape: (79881, 1479), Labels: (79881,)
[SPLIT] Train: (59910, 1479)  Test: (19971, 1479)
[BO] Setup: n_calls=30, n_initial_points=10, acq=EI
[BO] Starting Bayesian Optimisation…
[CNN] Epoch 1/6 - avg loss: 0.6629
[CNN] Epoch 2/6 - avg loss: 0.6003
[CNN] Epoch 3/6 - avg loss: 0.5828
[CNN] Epoch 4/6 - avg loss: 0.5788
[CNN] Epoch 5/6 - avg loss: 0.5771
[CNN] Epoch 6/6 - avg loss: 0.5761
[CNN] Epoch 1/6 - avg loss: 0.5758
[CNN] Epoch 2/6 - avg loss: 0.5748
[CNN] Epoch 3/6 - avg loss: 0.5733
[CNN] Epoch 4/6 - avg loss: 0.5727
[CNN] Epoch 5/6 - avg loss: 0.5720
[CNN] Epoch 6/6 - avg loss: 0.5706
[CNN] Epoch 1/6 - avg loss: 0.5684
[CNN] Epoch 2/6 - avg loss: 0.5672
[CNN] Epoch 3/6 - avg loss: 0.5665
[CNN] Epoch 4/6 - avg loss: 0.5658
[CNN] Epoch 5/6 - avg loss: 0.5646
[CNN] Epoch 6/6 - avg loss: 0.5641
Training until validation scores don't improve for 20 rounds
Early stopping, be

[LightGBM] [Fatal] Cannot use bagging in GOSS


LightGBMError: Cannot use bagging in GOSS

### 6.1 BO Convergence
We plot the **best‑so‑far** AP vs iteration to mirror the “learning curve” style output.


In [None]:
# %%
def plot_bo_convergence(bayes_cv, title="Bayesian Optimisation Convergence"):
    # BayesSearchCV may store list if multiple search_spaces; we assume single here
    opt_res = bayes_cv.optimizer_results_[0] if isinstance(bayes_cv.optimizer_results_, list) else bayes_cv.optimizer_results_
    ys = np.minimum.accumulate(opt_res.func_vals)  # func_vals are losses if scorer minimizes; we used AP (maximize) via scorer
    # Our scorer returns AP directly (higher=better). BayesSearchCV maximizes when scorer higher=better,
    # but optimizer stores negative? If func_vals look inverted, handle robustly:
    f = np.array(opt_res.func_vals)
    # Try make "best so far" in terms of AP:
    ap_series = np.maximum.accumulate(-f) if (f[:3].mean() > 0) else np.maximum.accumulate(f)
    plt.figure()
    plt.plot(np.arange(1, len(ap_series)+1), ap_series)
    plt.xlabel("Iteration")
    plt.ylabel("Best AP so far")
    plt.title(title); plt.tight_layout()
    plt.savefig("reports/figures/bo_convergence.png", dpi=130); plt.close()

plot_bo_convergence(bayes_opt, "BO Convergence — Best AP over iterations")


## 7. Bayesian Optimisation using Gaussian Processes — `gp_minimize`
To mirror the “under the hood” view, we directly optimise a **2‑D slice**:
minimise `loss = −AP(val)` over `(n_estimators, max_samples)`.

In [None]:
# %%
def gp_objective(vec):
    ne, ms = int(vec[0]), float(vec[1])
    m = IsolationForest(
        n_estimators=ne, max_samples=ms, contamination=0.01,
        max_features=1.0, random_state=42, n_jobs=-1
    ).fit(X_train_benign)
    ap = average_precision_score(y_val, -m.score_samples(X_val))
    return -ap

res_gp = gp_minimize(
    gp_objective,
    dimensions=[(100, 800), (0.3, 1.0)],
    n_calls=20, random_state=42
)
print("GP best loss (−AP):", res_gp.fun)
print("GP best params [n_estimators, max_samples]:", res_gp.x)

## 8. Compare Methods → Select Winner → Final Evaluation
Take **best from each method**, refit on **benign train**, and
report **validation/test** performance side‑by‑side.

In [None]:
# %%
candidates = []

# Random Search winner
if best_rs[1] is not None:
    candidates.append(("random_search", best_rs[1]))

# BayesSearchCV winner
candidates.append(("bayes_opt", bayes_opt.best_params_.copy()))

# GP winner (fill remaining params)
gp_params = {"n_estimators": int(res_gp.x[0]), "max_samples": float(res_gp.x[1]),
             "contamination": 0.01, "max_features": 1.0}
candidates.append(("gp_minimize", gp_params))

summary = []
for tag, p in candidates:
    m = IsolationForest(**{**p, "random_state": 42, "n_jobs": -1}).fit(X_train_benign)
    ap_val = average_precision_score(y_val, -m.score_samples(X_val))
    summary.append({"method": tag, "val_AP": ap_val, **p})
    print(f"{tag}: val AP={ap_val:.4f} params={p}")

summary_df = pd.DataFrame(summary).sort_values("val_AP", ascending=False)
display(summary_df.style.format(precision=4))
summary_df.to_csv("reports/tables/summary_val.csv", index=False)

best_row = summary_df.iloc[0]
best_params = {k: best_row[k] for k in ["n_estimators","max_samples","contamination","max_features"] if k in best_row}
final_model = IsolationForest(**{**best_params, "random_state": 42, "n_jobs": -1}).fit(X_train_benign)
joblib.dump(final_model, MODEL_PATH)

print("\nFinal Validation:")
_ = eval_split("val_final", final_model, X_val, y_val)
print("Final Test:")
_ = eval_split("test_final", final_model, X_test, y_test)

## 9. Thresholds for Later Use (not deploying here)
In production you’ll pick τ on anomaly scores to meet an **alert budget**:
- τ for **Precision ≥ 90%** (clean analyst queue)
- τ for **Benign Specificity ≥ 99%** (minimise benign false alerts)


In [None]:
# %%
def threshold_for_precision(y_true, scores, target_precision=0.90):
    p, r, t = precision_recall_curve(y_true, scores)
    if len(t) == 0: return None
    t_pad = np.concatenate([t, [t[-1]]])
    mask = p >= target_precision
    if not np.any(mask): return None
    idx = np.argmax(mask)
    return t_pad[idx]

# computing specificity on benign (TNR)
def specificity_at_threshold(y_true, scores, tau, benign_label=BENIGN_LABEL):
    y_pred_benign = (scores < tau).astype(int)            # 1 = predicted benign
    y_true_benign = (y_true == benign_label).astype(int)  # 1 = truly benign
    tn, fp, fn, tp = confusion_matrix(y_true_benign, y_pred_benign).ravel()
    return tn / (tn + fp) if (tn + fp) > 0 else 0.0


def threshold_for_specificity_on_benign(y_true, scores, target_spec=0.99):
    for tau in np.unique(scores):
        if specificity_at_threshold(y_true, scores, tau) >= target_spec:
            return tau
    return None

scores_val = -final_model.score_samples(X_val)
tau_p = threshold_for_precision(y_val, scores_val, target_precision=0.90)
tau_s = threshold_for_specificity_on_benign(y_val, scores_val, target_spec=0.99)
print("τ (Precision≥0.90):", tau_p, " | τ (Benign Specificity≥0.99):", tau_s)

## 10. Discussion (Koehrsen‑style wrap‑up)
- **Baseline vs HPO:** Random search provides a quick baseline; BO finds better configs faster
   by modelling the objective (AP) and trading off exploration/exploitation.
- **Benign‑first angle:** Training on benign only generalises to unknown attacks (no labels needed to fit),
   but we still evaluate against labels on validation/test to set a realistic operating point.
- **Next steps:** richer features (flows, temporal stats), Safe/Constrained BO (cap FPR / latency), cross‑dataset checks.


## 11. Save Metadata & Repro Bundle

In [None]:
meta = {
    "objective": "benign-first; model benign precisely; later flag non-benign via τ",
    "dataset": os.path.basename(CSV_PATH),
    "search": {"random_search": True, "bayessearchcv": True, "gp_minimize": True},
    "artifacts": {"model": MODEL_PATH, "scaler": SCALER_PATH, "trials": TRIALS_PATH},
    "figures": ["reports/figures/pr_val_baseline.png",
                "reports/figures/pr_test_baseline.png",
                "reports/figures/bo_convergence.png"],
    "notes": "Notebook follows the format and outputs style of Will Koehrsen's BO notebook (adapted to IF).",
    "format_references": [
        "WillKoehrsen/hyperparameter-optimization (GitHub repo)",
        "Kaggle version of Bayesian HPO of GBM"
    ]
}
with open(META_PATH, "w") as f:
    json.dump(meta, f, indent=2)
print("Saved metadata →", META_PATH)