Online Shoppers 数据集 BT-TWD 可行性实验

本 notebook 按步骤运行：加载配置 → 读取数据 → 预处理 → 桶树划分 → 基线与 BTTWD k 折实验 → 桶级分析。

In [1]:
# 步骤0：环境与路径设置
import os, sys
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 将项目根目录加入路径，便于导入 bttwdlib
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from bttwdlib import (
    load_yaml_cfg,
    show_cfg,
    load_dataset,
    prepare_features_and_labels,
    BucketTree,
    run_kfold_experiments,
    log_info,
    set_global_seed,
)

cfg_path = Path(root_path) / "configs" / "online_shoppers.yaml"
cfg = load_yaml_cfg(cfg_path)
set_global_seed(cfg.get('SEED', {}).get('global_seed', 42))
log_info('【步骤0摘要】环境准备完毕，路径与随机种子已设置。')

【INFO】【2025-12-11 19:37:25】【配置加载】已读取 e:\yan\组\三支决策\机器学习\BT_TWD\configs\online_shoppers.yaml
【INFO】【2025-12-11 19:37:30】【步骤0摘要】环境准备完毕，路径与随机种子已设置。


In [2]:
# 步骤1：加载配置
show_cfg(cfg)
log_info('【步骤1摘要】配置文件加载完成，关键参数检查通过。')

【INFO】【2025-12-11 19:37:30】【配置-数据】数据集=online_shoppers, k折=5, 目标列=Revenue, 正类="True"
【INFO】【2025-12-11 19:37:30】【配置-BTTWD】阈值模式=None, 全局模型=xgb, 桶内模型=none, 后验估计器(兼容字段)=logreg
【INFO】【2025-12-11 19:37:30】【配置-基线】LogReg启用=True, RandomForest启用=True, KNN启用=True, XGBoost启用=True
【INFO】【2025-12-11 19:37:30】【步骤1摘要】配置文件加载完成，关键参数检查通过。


In [3]:
# 步骤2：加载原始数据
df_raw, target_col_model = load_dataset(cfg)  # 这里返回的是用于建模的标签列，例如 "label"

display(df_raw.head())
print("用于建模的标签列:", target_col_model)

# 1）画 0/1 标签（购买/未购买）的比例
class_counts = df_raw[target_col_model].value_counts(normalize=True)
ax = class_counts.plot(kind='bar', title='购买 vs 未购买比例')
plt.ylabel('比例')

fig_path = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'class_distribution.png')
os.makedirs(os.path.dirname(fig_path), exist_ok=True)
plt.savefig(fig_path, bbox_inches='tight')
plt.close()

# 2）如果想看原始标签列的分布，可以另外单独分析：
raw_target_col = cfg['DATA']['target_col']  # 这里是原始标签列
print("原始目标列:", raw_target_col)
print(df_raw[raw_target_col].describe())

log_info('【步骤2摘要】Online Shoppers 原始数据加载与基本统计完成。')


【INFO】【2025-12-11 19:37:30】【数据加载】文本表格 ..\data\shopper\online_shoppers_intention.csv 已读取，样本数=12330，列数=18
【INFO】【2025-12-11 19:37:30】【数据集信息】名称=online_shoppers，样本数=12330，目标列=Revenue，正类比例=15.47%


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,0
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,0
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,0
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,0


用于建模的标签列: Revenue
原始目标列: Revenue
count    12330.000000
mean         0.154745
std          0.361676
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Revenue, dtype: float64
【INFO】【2025-12-11 19:37:30】【步骤2摘要】Online Shoppers 原始数据加载与基本统计完成。


In [4]:
# 步骤3：预处理与特征工程
X, y, meta = prepare_features_and_labels(df_raw, cfg)
log_info(f'【预处理】编码特征维度={X.shape[1]}，样本数={X.shape[0]}')
log_info(f"【步骤3摘要】特征预处理完成：连续={len(meta['continuous_cols'])}，类别={len(meta['categorical_cols'])}，编码维度={X.shape[1]}。")

【INFO】【2025-12-11 19:37:30】【预处理】连续特征=10个，类别特征=7个
【INFO】【2025-12-11 19:37:30】【预处理】编码后维度=68
【INFO】【2025-12-11 19:37:30】【预处理】编码特征维度=68，样本数=12330
【INFO】【2025-12-11 19:37:30】【步骤3摘要】特征预处理完成：连续=10，类别=7，编码维度=68。


In [5]:
# 步骤4：构建桶树并检查划分
feature_cols_for_bucket = [c for c in df_raw.columns if c != target_col_model]

bucket_tree = BucketTree(
    cfg['BTTWD']['bucket_levels'],
    feature_names=feature_cols_for_bucket
)

bucket_ids_full = bucket_tree.assign_buckets(df_raw[feature_cols_for_bucket])

bucket_df = bucket_ids_full.value_counts().reset_index()
bucket_df.columns = ['bucket_id', 'count']

bucket_df['pos_rate'] = (
    df_raw.groupby(bucket_ids_full)[target_col_model]
    .apply(lambda s: (s == 1).mean())
    .values
)
display(bucket_df.head())
bucket_df.set_index('bucket_id')['count'].plot(kind='bar', figsize=(12,4), title='桶样本数分布')
fig_bucket = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'bucket_metrics_bar.png')
plt.savefig(fig_bucket, bbox_inches='tight')
plt.close()
log_info(f'【步骤4摘要】桶树划分完成，共有 {bucket_ids_full.nunique()} 个叶子桶。')


【INFO】【2025-12-11 19:37:30】【桶树】已为样本生成桶ID，共 28 个组合


Unnamed: 0,bucket_id,count,pos_rate
0,L1_VisitorType=returning|L2_Month=high_season|...,1922,0.0
1,L1_VisitorType=returning|L2_Month=mid_season|L...,1749,0.076923
2,L1_VisitorType=returning|L2_Month=low_season|L...,1300,0.0
3,L1_VisitorType=returning|L2_Month=mid_season|L...,1164,0.258621
4,L1_VisitorType=returning|L2_Month=high_season|...,1145,0.333333


【INFO】【2025-12-11 19:37:31】【步骤4摘要】桶树划分完成，共有 28 个叶子桶。


In [6]:
# 步骤5：运行基线模型 k 折实验
# 基线部分在 run_kfold_experiments 内统一调度
log_info('【步骤5】基线模型将在整体交叉验证中一并运行。')
log_info('【步骤5摘要】基线模型性能将作为后续对比基准。')

【INFO】【2025-12-11 19:37:31】【步骤5】基线模型将在整体交叉验证中一并运行。
【INFO】【2025-12-11 19:37:31】【步骤5摘要】基线模型性能将作为后续对比基准。


In [7]:
import numpy as np
import pandas as pd

print("y 全局标签分布：", np.unique(y, return_counts=True))

print("原始标签列分布：")
print(df_raw[cfg['DATA']['target_col']].value_counts())


y 全局标签分布： (array([0, 1]), array([10422,  1908], dtype=int64))
原始标签列分布：
Revenue
0    10422
1     1908
Name: count, dtype: int64


In [8]:
# 步骤6：运行 BTTWD k 折实验（含基线）
results = run_kfold_experiments(X, y, df_raw.drop(columns=[cfg['DATA']['target_col']]), cfg)
summary_df = pd.read_csv(os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'metrics_kfold_summary.csv'))
display(summary_df)
summary_df.plot(x='model', kind='bar', figsize=(8,4), title='模型指标对比')
fig_compare = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'metrics_compare.png')
plt.savefig(fig_compare, bbox_inches='tight')
plt.close()
log_info('【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。')

【INFO】【2025-12-11 19:37:31】【基线-LogReg】使用决策阈值=0.400（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-11 19:37:34】【基线-LogReg】整体指标：AUC_mean=0.893, AUC_std=0.011, BAC_mean=0.710, BAC_std=0.015, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.555, F1_std=0.028, Kappa_mean=0.495, Kappa_std=0.030, MCC_mean=0.513, MCC_std=0.030, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.721, Precision_std=0.031, Recall_mean=0.452, Recall_std=0.030, Regret_mean=0.366, Regret_std=0.019
【INFO】【2025-12-11 19:37:34】【基线-RF】使用决策阈值=0.400（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-11 19:37:38】【基线-RF】整体指标：AUC_mean=0.926, AUC_std=0.006, BAC_mean=0.804, BAC_std=0.014, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.671, F1_std=0.018, Kappa_mean=0.612, Kappa_std=0.021, MCC_mean=0.612, MCC_std=0.021, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.678, Precision_std=0.018, Recall_mean=0.665, Recall_std=0.028, Regret_mean=0.256, Regret_std=0.017
【INFO】【2025-12-11 19:37:38】【基线-KNN】使用决策阈值=0.400（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:37:41】【基线-KNN】整体指标：AUC_mean=0.808, AUC_std=0.013, BAC_mean=0.701, BAC_std=0.014, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.524, F1_std=0.025, Kappa_mean=0.453, Kappa_std=0.027, MCC_mean=0.461, MCC_std=0.027, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.623, Precision_std=0.024, Recall_mean=0.453, Recall_std=0.027, Regret_mean=0.381, Regret_std=0.017
【INFO】【2025-12-11 19:37:41】【基线-XGB】使用决策阈值=0.400（fixed 模式）


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:37:43】【基线-XGB】整体指标：AUC_mean=0.930, AUC_std=0.005, BAC_mean=0.808, BAC_std=0.010, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.676, F1_std=0.016, Kappa_mean=0.617, Kappa_std=0.019, MCC_mean=0.617, MCC_std=0.019, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.676, Precision_std=0.019, Recall_mean=0.676, Recall_std=0.020, Regret_mean=0.251, Regret_std=0.013
【INFO】【2025-12-11 19:37:43】【K折实验】正在执行第 1/5 折...
【INFO】【2025-12-11 19:37:43】[BT] 使用桶评分配置：mode=f1_regret_bnd, f1_weight=1.0, regret_weight=1.0, bnd_weight=0.5
【INFO】【2025-12-11 19:37:44】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 19:37:44] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=9864
[INFO][BT][2025-12-11 19:37:44] 创建桶 bucket_id=L1_VisitorType=new，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="new"，n_samples=1373
[INFO][BT][2025-12-11 19:37:44] 创建桶 bucket_id=L1_VisitorType=returning，level=1，



【INFO】【2025-12-11 19:37:45】【阈值】桶 ROOT（n_val=605）使用本地阈值 α=0.2000, β=0.1000
[INFO][BT][2025-12-11 19:37:45] 桶 bucket_id=ROOT level=0：
    n_train=1420, n_val=605,
    BAC=0.772, F1=0.622, AUC=0.907,
    Regret=0.263, BND_ratio=0.060, POS_coverage=0.205,
    Score(f1_regret_bnd )=0.329
【INFO】【2025-12-11 19:37:46】【阈值】桶 L1_VisitorType=new（n_val=87）使用本地阈值 α=0.2000, β=0.1000
[INFO][BT][2025-12-11 19:37:46] 桶 bucket_id=L1_VisitorType=new level=1：
    n_train=204, n_val=87,
    BAC=0.815, F1=0.764, AUC=0.930,
    Regret=0.282, BND_ratio=0.057, POS_coverage=0.345,
    Score(f1_regret_bnd )=0.453
[INFO][BT][2025-12-11 19:37:46] 桶 bucket_id=L1_VisitorType=new：
    parent_id=ROOT，parent_Score=0.329, bucket_Score=0.453,
    Gain=+0.124, is_weak=False
【INFO】【2025-12-11 19:37:46】【阈值】桶 L1_VisitorType=returning 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 19:37:46] 桶 bucket_id=L1_VisitorType=returning level=1：
    n_train=1178, n_val=556,
    BAC=0.680, F1=0.519, AUC=0.864,
    Regret=0.311, BND_ratio=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:37:54】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=9864
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=L1_VisitorType=new，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="new"，n_samples=1348
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=L1_VisitorType=returning，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="returning"，n_samples=8445
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=L1_VisitorType=others，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="others"，n_samples=71
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=L1_VisitorType=new|L2_Month=high_season，level=2，parent_id=L1_VisitorType=new，split_name=L2_Month，split_type=categorical_group，split_rule="high_season"，n_samples=597
[INFO][BT][2025-12-11 19:37:54] 创建桶 bucket_id=L1_VisitorType



【INFO】【2025-12-11 19:37:56】【阈值】桶 ROOT（n_val=607）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-11 19:37:56] 桶 bucket_id=ROOT level=0：
    n_train=1421, n_val=607,
    BAC=0.790, F1=0.678, AUC=0.926,
    Regret=0.222, BND_ratio=0.051, POS_coverage=0.152,
    Score(f1_regret_bnd )=0.430
【INFO】【2025-12-11 19:37:56】【阈值】桶 L1_VisitorType=new（n_val=88）使用本地阈值 α=0.4000, β=0.3000
[INFO][BT][2025-12-11 19:37:56] 桶 bucket_id=L1_VisitorType=new level=1：
    n_train=197, n_val=88,
    BAC=0.881, F1=0.857, AUC=0.907,
    Regret=0.284, BND_ratio=0.000, POS_coverage=0.250,
    Score(f1_regret_bnd )=0.573
[INFO][BT][2025-12-11 19:37:56] 桶 bucket_id=L1_VisitorType=new：
    parent_id=ROOT，parent_Score=0.430, bucket_Score=0.573,
    Gain=+0.143, is_weak=False
【INFO】【2025-12-11 19:37:56】【阈值】桶 L1_VisitorType=returning 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 19:37:56] 桶 bucket_id=L1_VisitorType=returning level=1：
    n_train=1183, n_val=562,
    BAC=0.753, F1=0.577, AUC=0.917,
    Regret=0.210, BND_ratio=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:38:04】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=9864
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=L1_VisitorType=new，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="new"，n_samples=1344
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=L1_VisitorType=returning，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="returning"，n_samples=8448
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=L1_VisitorType=others，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="others"，n_samples=72
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=L1_VisitorType=new|L2_Month=high_season，level=2，parent_id=L1_VisitorType=new，split_name=L2_Month，split_type=categorical_group，split_rule="high_season"，n_samples=606
[INFO][BT][2025-12-11 19:38:04] 创建桶 bucket_id=L1_VisitorType



【INFO】【2025-12-11 19:38:05】【阈值】桶 ROOT（n_val=609）使用本地阈值 α=0.2000, β=0.1000
[INFO][BT][2025-12-11 19:38:05] 桶 bucket_id=ROOT level=0：
    n_train=1420, n_val=609,
    BAC=0.853, F1=0.722, AUC=0.951,
    Regret=0.172, BND_ratio=0.048, POS_coverage=0.213,
    Score(f1_regret_bnd )=0.527
【INFO】【2025-12-11 19:38:06】【阈值】桶 L1_VisitorType=new（n_val=87）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-11 19:38:06] 桶 bucket_id=L1_VisitorType=new level=1：
    n_train=199, n_val=87,
    BAC=0.885, F1=0.852, AUC=0.924,
    Regret=0.236, BND_ratio=0.011, POS_coverage=0.299,
    Score(f1_regret_bnd )=0.610
[INFO][BT][2025-12-11 19:38:06] 桶 bucket_id=L1_VisitorType=new：
    parent_id=ROOT，parent_Score=0.527, bucket_Score=0.610,
    Gain=+0.083, is_weak=False
【INFO】【2025-12-11 19:38:06】【阈值】桶 L1_VisitorType=returning 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 19:38:06] 桶 bucket_id=L1_VisitorType=returning level=1：
    n_train=1183, n_val=555,
    BAC=0.809, F1=0.600, AUC=0.911,
    Regret=0.195, BND_ratio=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:38:15】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=9864
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=L1_VisitorType=new，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="new"，n_samples=1360
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=L1_VisitorType=returning，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="returning"，n_samples=8442
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=L1_VisitorType=others，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="others"，n_samples=62
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=L1_VisitorType=new|L2_Month=high_season，level=2，parent_id=L1_VisitorType=new，split_name=L2_Month，split_type=categorical_group，split_rule="high_season"，n_samples=602
[INFO][BT][2025-12-11 19:38:15] 创建桶 bucket_id=L1_VisitorType



【INFO】【2025-12-11 19:38:16】【阈值】桶 ROOT（n_val=604）使用本地阈值 α=0.2000, β=0.1000
[INFO][BT][2025-12-11 19:38:16] 桶 bucket_id=ROOT level=0：
    n_train=1417, n_val=604,
    BAC=0.771, F1=0.661, AUC=0.931,
    Regret=0.241, BND_ratio=0.091, POS_coverage=0.199,
    Score(f1_regret_bnd )=0.374
【INFO】【2025-12-11 19:38:16】【阈值】桶 L1_VisitorType=new 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 19:38:16] 桶 bucket_id=L1_VisitorType=new level=1：
    n_train=208, n_val=80,
    BAC=0.744, F1=0.727, AUC=0.871,
    Regret=0.425, BND_ratio=0.100, POS_coverage=0.250,
    Score(f1_regret_bnd )=0.252
[INFO][BT][2025-12-11 19:38:16] 桶 bucket_id=L1_VisitorType=new：
    parent_id=ROOT，parent_Score=0.374, bucket_Score=0.252,
    Gain=-0.122, is_weak=True
【INFO】【2025-12-11 19:38:17】【阈值】桶 L1_VisitorType=returning（n_val=567）使用本地阈值 α=0.2000, β=0.1000
[INFO][BT][2025-12-11 19:38:17] 桶 bucket_id=L1_VisitorType=returning level=1：
    n_train=1176, n_val=567,
    BAC=0.794, F1=0.642, AUC=0.922,
    Regret=0.243, BND_ratio=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:38:23】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=9864
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=L1_VisitorType=new，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="new"，n_samples=1351
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=L1_VisitorType=returning，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="returning"，n_samples=8445
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=L1_VisitorType=others，level=1，parent_id=ROOT，split_name=L1_VisitorType，split_type=categorical_group，split_rule="others"，n_samples=68
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=L1_VisitorType=new|L2_Month=high_season，level=2，parent_id=L1_VisitorType=new，split_name=L2_Month，split_type=categorical_group，split_rule="high_season"，n_samples=608
[INFO][BT][2025-12-11 19:38:23] 创建桶 bucket_id=L1_VisitorType



【INFO】【2025-12-11 19:38:24】【阈值】桶 ROOT（n_val=605）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-11 19:38:24] 桶 bucket_id=ROOT level=0：
    n_train=1420, n_val=605,
    BAC=0.780, F1=0.671, AUC=0.917,
    Regret=0.231, BND_ratio=0.051, POS_coverage=0.136,
    Score(f1_regret_bnd )=0.414
【INFO】【2025-12-11 19:38:25】【阈值】桶 L1_VisitorType=new（n_val=83）使用本地阈值 α=0.5000, β=0.1000
[INFO][BT][2025-12-11 19:38:25] 桶 bucket_id=L1_VisitorType=new level=1：
    n_train=204, n_val=83,
    BAC=0.814, F1=0.848, AUC=0.905,
    Regret=0.223, BND_ratio=0.108, POS_coverage=0.169,
    Score(f1_regret_bnd )=0.571
[INFO][BT][2025-12-11 19:38:25] 桶 bucket_id=L1_VisitorType=new：
    parent_id=ROOT，parent_Score=0.414, bucket_Score=0.571,
    Gain=+0.157, is_weak=False
【INFO】【2025-12-11 19:38:25】【阈值】桶 L1_VisitorType=returning 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 19:38:25] 桶 bucket_id=L1_VisitorType=returning level=1：
    n_train=1176, n_val=510,
    BAC=0.811, F1=0.644, AUC=0.921,
    Regret=0.218, BND_ratio=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 19:38:31】[BASELINE] 全局 XGB 模型训练完成
【INFO】【2025-12-11 19:38:31】[BASELINE] 阈值搜索开始
【INFO】【2025-12-11 19:38:31】[BASELINE] 最佳阈值找到: alpha=0.3000, beta=0.1000, regret=0.2135
【INFO】【2025-12-11 19:38:31】【桶树】已为样本生成桶ID，共 28 个组合
【INFO】【2025-12-11 19:38:31】[BASELINE] 测试集桶映射完成，共 28 个桶


  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=returning|L2_Month=mid_season|L3_Region=region_5_plus: BAC=0.7623, Regret=0.1610, Precision=0.7333, Recall=0.5500
【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=new|L2_Month=low_season|L3_Region=region_3_4: BAC=0.9667, Regret=0.0789, Precision=0.8000, Recall=1.0000
【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=returning|L2_Month=low_season|L3_Region=region_5_plus: BAC=0.9567, Regret=0.0474, Precision=0.9231, Recall=0.9231
【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=returning|L2_Month=low_season|L3_Region=region_3_4: BAC=0.8990, Regret=0.1423, Precision=0.4400, Recall=0.9167
【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=new|L2_Month=high_season|L3_Region=region_3_4: BAC=0.7442, Regret=0.3804, Precision=0.7500, Recall=0.5455
【INFO】【2025-12-11 19:38:32】[BASELINE] 桶 L1_VisitorType=returning|L2_Month=high_season|L3_Region=region_1_2: BAC=0.7563, Regret=0.3769, Precision=0.5536, Recall=0.6739
【INFO】【2

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


Unnamed: 0,model,Precision_mean,Precision_std,Recall_mean,Recall_std,F1_mean,F1_std,BAC_mean,BAC_std,AUC_mean,...,MCC_mean,MCC_std,Kappa_mean,Kappa_std,BND_ratio_mean,BND_ratio_std,POS_Coverage_mean,POS_Coverage_std,Regret_mean,Regret_std
0,BTTWD,0.568936,0.021415,0.79297,0.030985,0.662324,0.022103,0.841408,0.016335,0.927485,...,0.60045,0.027441,0.588093,0.027168,0.066423,0.007523,0.183455,0.020059,0.22794,0.019542
1,LogReg,0.720862,0.031437,0.452283,0.029923,0.555333,0.027949,0.710069,0.015189,0.893282,...,0.513061,0.02962,0.495192,0.030444,0.0,0.0,,,0.36618,0.018833
2,RandomForest,0.677831,0.017713,0.665078,0.028105,0.671061,0.018497,0.803562,0.013506,0.925705,...,0.611817,0.020947,0.611569,0.021102,0.0,0.0,,,0.256285,0.016522
3,KNN,0.623272,0.02383,0.452809,0.027446,0.5243,0.024916,0.701361,0.014094,0.807894,...,0.460997,0.026879,0.453174,0.027451,0.0,0.0,,,0.381022,0.01749
4,XGBoost,0.676468,0.019417,0.676093,0.019574,0.676105,0.016387,0.808397,0.010278,0.930428,...,0.616933,0.019353,0.616822,0.019441,0.0,0.0,,,0.250608,0.013115


【INFO】【2025-12-11 19:38:32】【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。


In [9]:
# 步骤7：桶级别分析
bucket_metrics_path = os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'bucket_metrics.csv')
if os.path.exists(bucket_metrics_path):
    bucket_metrics_df = pd.read_csv(bucket_metrics_path)
    display(bucket_metrics_df.head())
    bucket_metrics_df.plot(x='bucket_id', y='pos_rate_all', kind='bar', figsize=(12,4), title='桶正类比例')
    plt.ylabel('正类比例')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.savefig(fig_bucket, bbox_inches='tight')
    plt.close()
log_info('【步骤7摘要】桶级指标已整理，可用于局部化分析。')

Unnamed: 0,bucket_id,layer,parent_bucket_id,n_train,n_val,pos_rate_train,pos_rate_val,alpha,beta,regret_val,...,is_weak,threshold_source_bucket,parent_with_threshold,n_test,pos_rate_test,BND_ratio_test,POS_Coverage_test,regret_test,fold,pos_rate
0,ROOT,L1,,1420,605,0.148592,0.161983,0.2,0.1,0.26281,...,False,ROOT,,,,,,,1,0.154805
1,L1_VisitorType=returning,L1,ROOT,1178,556,0.129032,0.131295,0.2,0.1,0.311151,...,True,ROOT,,,,,,,1,0.138295
2,L1_VisitorType=returning|L2_Month=high_season,L2,L1_VisitorType=returning,442,181,0.19457,0.237569,0.2,0.1,0.414365,...,True,ROOT,,,,,,,1,0.194755
3,L1_VisitorType=returning|L2_Month=mid_season,L2,L1_VisitorType=returning,411,187,0.097324,0.144385,0.2,0.1,0.254011,...,False,L1_VisitorType=returning|L2_Month=mid_season,,,,,,,1,0.097634
4,L1_VisitorType=returning|L2_Month=low_season,L2,L1_VisitorType=returning,296,121,0.141892,0.057851,0.2,0.1,0.057851,...,False,L1_VisitorType=returning|L2_Month=low_season,,,,,,,1,0.11956


  plt.tight_layout()


【INFO】【2025-12-11 19:38:34】【步骤7摘要】桶级指标已整理，可用于局部化分析。


In [None]:
# 步骤8：结果汇总
log_info('【步骤8】检查结果文件与图表。')
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['results_dir'])))
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['figs_dir'])))
log_info('【全部步骤完成】Online Shoppers 数据集上的 BT-TWD 实验结束。')

【INFO】【2025-12-11 19:38:34】【步骤8】检查结果文件与图表。
['baseline_bucket_metrics.csv', 'bucket_fallback_stats.csv', 'bucket_metrics.csv', 'bucket_metrics_gain.csv', 'bucket_thresholds.csv', 'bucket_thresholds_per_fold.csv', 'bucket_tree_structure.csv', 'metrics_kfold_per_fold.csv', 'metrics_kfold_summary.csv', 'metrics_overview.csv']
['bank_class_distribution.png', 'bucket_metrics_bar.png', 'class_distribution.png', 'metrics_compare.png']
【INFO】【2025-12-11 19:38:34】【全部步骤完成】Online Shoppers 数据集上的 BT-TWD 实验结束。


: 