## UNSW-NB15 实验（BT-TWD + 基线）

本 Notebook 展示在 UNSW-NB15 数据集上的中度不平衡、成本敏感实验流程。

### Step 0：基础设置
- 导入依赖
- 读取 YAML 配置
- 打印关键参数

In [None]:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from bttwdlib import load_dataset, load_yaml_cfg, prepare_features_and_labels
from bttwdlib.cv_runner import run_kfold_experiments
from bttwdlib.utils_logging import log_info

root_path = Path('..').resolve()
cfg_path = root_path / 'configs' / 'unsw_nb15.yaml'
cfg = load_yaml_cfg(cfg_path)
log_info(f"【配置】读取完成：{cfg_path}")
print(pd.Series(cfg['THRESHOLDS']['costs'], name='成本配置'))


### Step 1：数据加载与基本统计
- 使用显式 train/test CSV
- 查看样本分布与正负类比例

In [None]:

df_raw, target_col = load_dataset(cfg)
split_col = 'split' if 'split' in df_raw.columns else None
if split_col is None:
    raise RuntimeError('UNSW 配置要求显式 train/test 划分')

df_train = df_raw[df_raw[split_col].str.lower() == 'train'].reset_index(drop=True)
df_test = df_raw[df_raw[split_col].str.lower() == 'test'].reset_index(drop=True)

log_info(f"【UNSW】训练集大小：{len(df_train)}，测试集大小：{len(df_test)}")
pos_train = df_train[target_col].mean()
pos_test = df_test[target_col].mean()
log_info(f"【UNSW】训练集正类占比：{pos_train:.2%}，测试集正类占比：{pos_test:.2%}")

print(df_train.head())

fig, ax = plt.subplots(1, 2, figsize=(10,4))
ax[0].bar(['Train 正类','Train 负类'], [pos_train, 1-pos_train], color=['tomato','steelblue'])
ax[0].set_title('训练集比例')
ax[1].bar(['Test 正类','Test 负类'], [pos_test, 1-pos_test], color=['tomato','steelblue'])
ax[1].set_title('测试集比例')
plt.tight_layout()
plt.show()


### Step 2：运行基线模型 (LogReg / RF / KNN / XGB)
- 使用配置中的 baselines 列表
- 统一输出关键指标 (Recall / BAC / Regret 等)

In [None]:

# 预处理训练/测试特征
X_train, y_train, meta = prepare_features_and_labels(df_train, cfg)
prep_cfg = cfg.get('PREPROCESS', {})
bucket_cols = (prep_cfg.get('continuous_cols') or []) + (prep_cfg.get('categorical_cols') or [])
bucket_df_train = df_train[bucket_cols].reset_index(drop=True)

pipeline = meta['preprocess_pipeline']
from scipy import sparse
X_test_raw = df_test.drop(columns=list(prep_cfg.get('drop_cols', [])) + [cfg['DATA']['target_col']], errors='ignore')
X_test = pipeline.transform(X_test_raw)
if sparse.issparse(X_test):
    X_test = X_test.toarray()
bucket_df_test = df_test[bucket_cols].reset_index(drop=True)

test_labels = (df_test[cfg['DATA']['target_col']] == cfg['DATA']['positive_label']).astype(int).values
test_data = (X_test, test_labels, bucket_df_test)

log_info('【UNSW】开始运行 BT-TWD 与基线模型的 5 折交叉验证实验……')
results = run_kfold_experiments(X_train, y_train, bucket_df_train, cfg, test_data=test_data)

metrics_path = Path(cfg['OUTPUT']['results_dir']) / 'metrics_overview.csv'
if metrics_path.exists():
    metrics_df = pd.read_csv(metrics_path)
    display(metrics_df.sort_values('Recall', ascending=False))
else:
    print('未找到 metrics_overview.csv，请检查运行日志。')


### Step 3：BT-TWD 模型评估
- 桶树构建与阈值搜索
- 汇报测试集指标

In [None]:

# 直接读取上一节生成的指标表
if 'metrics_df' in locals():
    bttwd_row = metrics_df[metrics_df['model'] == 'BTTWD']
    print(bttwd_row)


### Step 4：桶结构与可解释性简析
- 输出每层桶的样本与正类比例
- 保存桶摘要至 results/unsw_nb15/buckets_summary.csv

In [None]:

bucket_metrics_path = Path(cfg['OUTPUT']['results_dir']) / 'bucket_metrics.csv'
if bucket_metrics_path.exists():
    bucket_df = pd.read_csv(bucket_metrics_path)
    print(bucket_df.head())
    out_path = Path(cfg['OUTPUT']['results_dir']) / 'buckets_summary.csv'
    bucket_df.to_csv(out_path, index=False)
    log_info(f'【BTTWD-UNSW】桶层级摘要已保存：{out_path}')
else:
    print('未找到 bucket_metrics.csv，可能需要先运行完整训练流程。')


### Step 5：结果小结
- BT-TWD 与基线在 Recall / BAC / Regret 上的对比
- 总结中度不平衡场景下的表现

In [None]:

if 'metrics_df' in locals():
    best_bac = metrics_df.loc[metrics_df['BAC'].idxmax()]
    print('【UNSW 小结】在测试集上 BAC 最高的模型：', best_bac['model'])
    print(metrics_df[['model','Recall','BAC','Regret']])
else:
    print('请先运行上面的实验步骤以生成指标。')
