 数据集 BT-TWD 可行性实验

本 notebook 按步骤运行：加载配置 → 读取数据 → 预处理 → 桶树划分 → 基线与 BTTWD k 折实验 → 桶级分析。

In [1]:
# 步骤0：环境与路径设置
import os, sys
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 将项目根目录加入路径，便于导入 bttwdlib
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from bttwdlib import (
    load_yaml_cfg,
    show_cfg,
    load_dataset,
    prepare_features_and_labels,
    BucketTree,
    run_kfold_experiments,
    log_info,
    set_global_seed,
)

cfg_path = os.path.join(root_path, 'configs', 'synth_strong_v2.yaml')
cfg = load_yaml_cfg(cfg_path)
set_global_seed(cfg.get('SEED', {}).get('global_seed', 42))
log_info('【步骤0摘要】环境准备完毕，路径与随机种子已设置。')

【INFO】【2025-12-16 18:56:12】【配置加载】已读取 e:\yan\组\三支决策\机器学习\BT_TWD\configs\synth_strong_v2.yaml
【INFO】【2025-12-16 18:56:16】【步骤0摘要】环境准备完毕，路径与随机种子已设置。


In [2]:
# 步骤1：加载配置
show_cfg(cfg)
log_info('【步骤1摘要】配置文件加载完成，关键参数检查通过。')

【INFO】【2025-12-16 18:56:17】【配置-数据】数据集=synth_strong_v2, k折=None, 目标列=target, 正类="1"
【INFO】【2025-12-16 18:56:17】【配置-BTTWD】阈值模式=bucket_wise, 全局模型=xgb, 桶内模型=lr, 后验估计器(兼容字段)=logreg
【INFO】【2025-12-16 18:56:17】【配置-基线】LogReg启用=True, RandomForest启用=True, KNN启用=True, XGBoost启用=True
【INFO】【2025-12-16 18:56:17】【步骤1摘要】配置文件加载完成，关键参数检查通过。


In [3]:
# 步骤2：加载原始数据
df_raw, target_col_model = load_dataset(cfg)  # 这里返回的是用于建模的标签列，例如 "label"

display(df_raw.head())
print("用于建模的标签列:", target_col_model)

# 1）画 0/1 标签（延误/不延误）的比例
class_counts = df_raw[target_col_model].value_counts(normalize=True)
ax = class_counts.plot(kind='bar', title='正 vs 负比例')
plt.ylabel('比例')

fig_path = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'class_distribution.png')
os.makedirs(os.path.dirname(fig_path), exist_ok=True)
plt.savefig(fig_path, bbox_inches='tight')
plt.close()

# 2）如果想看原始 DepDelay 的分布，可以另外单独分析：
raw_target_col = cfg['DATA']['target_col']  # 这里是 "DepDelay"
print("原始目标列:", raw_target_col)
print(df_raw[raw_target_col].describe())

log_info('【步骤2摘要】Airlines 原始数据加载与基本统计完成。')


【INFO】【2025-12-16 18:56:18】【合成数据v2加载】文件=..\data\synth\synth_strong_v2.csv，样本数=200000，全局正例率=26.06%
【INFO】【2025-12-16 18:56:18】组别 A: 样本数=49807，正例率=12.40%
【INFO】【2025-12-16 18:56:18】组别 B: 样本数=49948，正例率=20.91%
【INFO】【2025-12-16 18:56:18】组别 C: 样本数=50323，正例率=28.39%
【INFO】【2025-12-16 18:56:18】组别 D: 样本数=49922，正例率=42.48%
【INFO】【2025-12-16 18:56:18】【数据加载】检测到 synth_strong_v2 元数据 ..\data\synth\synth_strong_v2_meta.json，已注入 500 条桶级 cost 配置
【INFO】【2025-12-16 18:56:18】【数据集信息】名称=synth_strong_v2，样本数=200000，目标列=target，正类比例=26.06%


Unnamed: 0,target,group,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,z1,z2,z3,z4,z5
0,0,A,1.064588,-1.305676,0.557827,0.009463,2.368387,0.719543,0.007857,1.028664,0.58272,0.761297,1.054418,-0.801457,0.300131,-0.073666,-1.663032
1,0,D,0.512215,-0.375004,-0.383665,0.704062,-0.619332,0.18144,0.511407,-0.562699,0.456859,0.038571,0.511402,-0.505405,-1.402719,-0.680446,-0.0827
2,0,C,-0.370929,0.782064,-1.249109,-1.254847,0.056107,-1.11155,1.299804,-0.322134,-1.384439,-0.148576,-0.67066,0.892483,-0.741191,-1.742448,-0.781772
3,0,B,1.791067,-0.872657,-0.609798,0.56537,1.414436,-0.988554,-0.749953,-1.541402,-0.011446,0.252859,-0.328013,-0.757283,-0.919904,-1.548651,-1.364378
4,1,B,0.917348,0.4332,0.110524,0.265732,0.60297,-1.176482,0.260115,-1.65568,1.147569,1.809379,-0.471701,-0.053854,-0.310862,0.740968,1.703402


用于建模的标签列: target
原始目标列: target
count    200000.000000
mean          0.260570
std           0.438947
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: target, dtype: float64
【INFO】【2025-12-16 18:56:19】【步骤2摘要】Airlines 原始数据加载与基本统计完成。


In [4]:
# 步骤3：预处理与特征工程
X, y, meta = prepare_features_and_labels(df_raw, cfg)
log_info(f'【预处理】编码特征维度={X.shape[1]}，样本数={X.shape[0]}')
log_info(f"【步骤3摘要】特征预处理完成：连续={len(meta['continuous_cols'])}，类别={len(meta['categorical_cols'])}，编码维度={X.shape[1]}。")

【INFO】【2025-12-16 18:56:19】【预处理】连续特征=15个，类别特征=1个
【INFO】【2025-12-16 18:56:19】【预处理】编码后维度=18
【INFO】【2025-12-16 18:56:19】【预处理】编码特征维度=18，样本数=200000
【INFO】【2025-12-16 18:56:19】【步骤3摘要】特征预处理完成：连续=15，类别=1，编码维度=18。


In [5]:
# 步骤4：构建桶树并检查划分
bucket_tree = BucketTree(cfg['BTTWD']['bucket_levels'], feature_names=df_raw.drop(columns=[cfg['DATA']['target_col']]).columns.tolist())
bucket_ids_full = bucket_tree.assign_buckets(df_raw.drop(columns=[cfg['DATA']['target_col']]))
bucket_df = bucket_ids_full.value_counts().reset_index()
bucket_df.columns = ['bucket_id', 'count']
bucket_df['pos_rate'] = df_raw.groupby(bucket_ids_full)[cfg['DATA']['target_col']].apply(lambda s: (s == cfg['DATA']['positive_label']).mean()).values
display(bucket_df.head())
bucket_df.set_index('bucket_id')['count'].plot(kind='bar', figsize=(12,4), title='桶样本数分布')
fig_bucket = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'bucket_metrics_bar.png')
plt.savefig(fig_bucket, bbox_inches='tight')
plt.close()
log_info(f'【步骤4摘要】桶树划分完成，共有 {bucket_ids_full.nunique()} 个叶子桶。')

【INFO】【2025-12-16 18:56:19】【桶树】已为样本生成桶ID，共 500 个组合


Unnamed: 0,bucket_id,count,pos_rate
0,L1_group=C|L2_x1=b3|L3_x2=b2|L4_x3=b2,967,0.076923
1,L1_group=A|L2_x1=b3|L3_x2=b2|L4_x3=b2,949,0.050209
2,L1_group=A|L2_x1=b2|L3_x2=b2|L4_x3=b2,937,0.119266
3,L1_group=D|L2_x1=b4|L3_x2=b2|L4_x3=b2,935,0.07947
4,L1_group=C|L2_x1=b2|L3_x2=b2|L4_x3=b2,925,0.181818


【INFO】【2025-12-16 18:56:25】【步骤4摘要】桶树划分完成，共有 500 个叶子桶。


In [6]:
# 步骤5：运行基线模型 k 折实验
# 基线部分在 run_kfold_experiments 内统一调度
log_info('【步骤5】基线模型将在整体交叉验证中一并运行。')
log_info('【步骤5摘要】基线模型性能将作为后续对比基准。')

【INFO】【2025-12-16 18:56:25】【步骤5】基线模型将在整体交叉验证中一并运行。
【INFO】【2025-12-16 18:56:25】【步骤5摘要】基线模型性能将作为后续对比基准。


In [7]:
import numpy as np
import pandas as pd

print("y 全局标签分布：", np.unique(y, return_counts=True))

print("原始 income 列分布：")
print(df_raw[cfg['DATA']['target_col']].value_counts())


y 全局标签分布： (array([0, 1]), array([147886,  52114], dtype=int64))
原始 income 列分布：
target
0    147886
1     52114
Name: count, dtype: int64


In [9]:
# 步骤6：运行 BTTWD k 折实验（含基线）
results = run_kfold_experiments(X, y, df_raw.drop(columns=[cfg['DATA']['target_col']]), cfg)
summary_df = pd.read_csv(os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'metrics_kfold_summary.csv'))
display(summary_df)
summary_df.plot(x='model', kind='bar', figsize=(8,4), title='模型指标对比')
fig_compare = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'metrics_compare.png')
plt.savefig(fig_compare, bbox_inches='tight')
plt.close()
log_info('【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。')

【INFO】【2025-12-16 19:04:06】【基线-LogReg】使用决策阈值=0.300（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-16 19:04:08】【基线-LogReg】整体指标：AUC_mean=0.811, AUC_std=0.002, BAC_mean=0.735, BAC_std=0.004, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.596, F1_std=0.005, Kappa_mean=0.426, Kappa_std=0.007, MCC_mean=0.434, MCC_std=0.007, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.523, Precision_std=0.004, Recall_mean=0.693, Recall_std=0.004, Regret_mean=0.404, Regret_std=0.005
【INFO】【2025-12-16 19:04:08】【基线-RF】使用决策阈值=0.300（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-16 19:07:12】【基线-RF】整体指标：AUC_mean=0.871, AUC_std=0.002, BAC_mean=0.798, BAC_std=0.002, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.678, F1_std=0.003, Kappa_mean=0.544, Kappa_std=0.003, MCC_mean=0.554, MCC_std=0.004, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.601, Precision_std=0.002, Recall_mean=0.778, Recall_std=0.004, Regret_mean=0.308, Regret_std=0.003
【INFO】【2025-12-16 19:07:12】【基线-KNN】使用决策阈值=0.300（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:07:35】【基线-KNN】整体指标：AUC_mean=0.786, AUC_std=0.003, BAC_mean=0.720, BAC_std=0.004, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.572, F1_std=0.004, Kappa_mean=0.372, Kappa_std=0.006, MCC_mean=0.393, MCC_std=0.007, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.468, Precision_std=0.003, Recall_mean=0.736, Recall_std=0.006, Regret_mean=0.425, Regret_std=0.006
【INFO】【2025-12-16 19:07:35】【基线-XGB】使用决策阈值=0.300（fixed 模式）


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-16 19:07:41】【基线-XGB】整体指标：AUC_mean=0.878, AUC_std=0.002, BAC_mean=0.802, BAC_std=0.003, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.688, F1_std=0.004, Kappa_mean=0.562, Kappa_std=0.006, MCC_mean=0.568, MCC_std=0.005, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.624, Precision_std=0.004, Recall_mean=0.768, Recall_std=0.004, Regret_mean=0.302, Regret_std=0.004
【INFO】【2025-12-16 19:07:41】【K折实验】正在执行第 1/5 折...
【INFO】【2025-12-16 19:07:41】[BT] 使用桶评分配置：mode=f1_regret_bnd, f1_weight=1.0, regret_weight=1.0, bnd_weight=0.5


Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:07:42】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=160000
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=L1_group=A，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="A"，n_samples=39874
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=L1_group=B，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="B"，n_samples=39988
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=L1_group=C，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="C"，n_samples=40123
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=L1_group=D，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="D"，n_samples=40015
[INFO][BT][2025-12-16 19:07:44] 创建桶 bucket_id=L1_group=A|L2_x1=b1，level=2，parent_id=L1_group=A，split_name=L2_x1，split_type=numeric_bin，split_rule="b1"，n_samples=4578
[INF



[INFO][BT][2025-12-16 19:07:58] 桶 ROOT采样：原始样本 N=22398 → 使用样本 n=15678
【INFO】【2025-12-16 19:07:58】【阈值】桶 ROOT（n_val=9599）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:07:58] 桶 bucket_id=ROOT level=0：
    n_train=22398, n_val=9599,
    BAC=0.588, F1=0.568, AUC=0.812,
    Regret=0.396, BND_ratio=0.247, POS_coverage=0.233,
    Score(f1_regret_bnd )=0.048
[INFO][BT][2025-12-16 19:07:58] 桶 L1_group=A采样：原始样本 N=5570 → 使用样本 n=3898
【INFO】【2025-12-16 19:07:59】【阈值】桶 L1_group=A（n_val=2402）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:07:59] 桶 bucket_id=L1_group=A level=1：
    n_train=5570, n_val=2402,
    BAC=0.630, F1=0.488, AUC=0.828,
    Regret=0.233, BND_ratio=0.145, POS_coverage=0.092,
    Score(f1_regret_bnd )=0.183
[INFO][BT][2025-12-16 19:07:59] 桶 bucket_id=L1_group=A：
    parent_id=ROOT，parent_Score=0.048, bucket_Score=0.183,
    Gain=+0.135, is_weak=False
[INFO][BT][2025-12-16 19:07:59] 桶 L1_group=B采样：原始样本 N=5591 → 使用样本 n=3913
【INFO】【2025-12-16 19:07:59】【阈值】桶 L1_group=B（n_val=2403

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:09:10】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=160000
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=L1_group=A，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="A"，n_samples=39817
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=L1_group=B，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="B"，n_samples=39904
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=L1_group=C，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="C"，n_samples=40277
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=L1_group=D，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="D"，n_samples=40002
[INFO][BT][2025-12-16 19:09:12] 创建桶 bucket_id=L1_group=A|L2_x1=b1，level=2，parent_id=L1_group=A，split_name=L2_x1，split_type=numeric_bin，split_rule="b1"，n_samples=4564
[INF



[INFO][BT][2025-12-16 19:09:25] 桶 ROOT采样：原始样本 N=22398 → 使用样本 n=15678
【INFO】【2025-12-16 19:09:26】【阈值】桶 ROOT（n_val=9598）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:09:26] 桶 bucket_id=ROOT level=0：
    n_train=22398, n_val=9598,
    BAC=0.588, F1=0.566, AUC=0.808,
    Regret=0.398, BND_ratio=0.245, POS_coverage=0.234,
    Score(f1_regret_bnd )=0.045
[INFO][BT][2025-12-16 19:09:26] 桶 L1_group=A采样：原始样本 N=5568 → 使用样本 n=3897
【INFO】【2025-12-16 19:09:26】【阈值】桶 L1_group=A（n_val=2392）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:09:26] 桶 bucket_id=L1_group=A level=1：
    n_train=5568, n_val=2392,
    BAC=0.622, F1=0.476, AUC=0.846,
    Regret=0.223, BND_ratio=0.133, POS_coverage=0.075,
    Score(f1_regret_bnd )=0.186
[INFO][BT][2025-12-16 19:09:26] 桶 bucket_id=L1_group=A：
    parent_id=ROOT，parent_Score=0.045, bucket_Score=0.186,
    Gain=+0.141, is_weak=False
[INFO][BT][2025-12-16 19:09:26] 桶 L1_group=B采样：原始样本 N=5578 → 使用样本 n=3904
【INFO】【2025-12-16 19:09:26】【阈值】桶 L1_group=B（n_val=2398

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:10:36】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=160000
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=L1_group=A，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="A"，n_samples=39802
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=L1_group=B，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="B"，n_samples=40042
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=L1_group=C，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="C"，n_samples=40381
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=L1_group=D，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="D"，n_samples=39775
[INFO][BT][2025-12-16 19:10:38] 创建桶 bucket_id=L1_group=A|L2_x1=b1，level=2，parent_id=L1_group=A，split_name=L2_x1，split_type=numeric_bin，split_rule="b1"，n_samples=4524
[INF



[INFO][BT][2025-12-16 19:10:51] 桶 ROOT采样：原始样本 N=22399 → 使用样本 n=15679
【INFO】【2025-12-16 19:10:52】【阈值】桶 ROOT（n_val=9599）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:10:52] 桶 bucket_id=ROOT level=0：
    n_train=22399, n_val=9599,
    BAC=0.591, F1=0.574, AUC=0.810,
    Regret=0.394, BND_ratio=0.247, POS_coverage=0.238,
    Score(f1_regret_bnd )=0.056
[INFO][BT][2025-12-16 19:10:52] 桶 L1_group=A采样：原始样本 N=5578 → 使用样本 n=3904
【INFO】【2025-12-16 19:10:52】【阈值】桶 L1_group=A（n_val=2379）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:10:52] 桶 bucket_id=L1_group=A level=1：
    n_train=5578, n_val=2379,
    BAC=0.584, F1=0.414, AUC=0.826,
    Regret=0.239, BND_ratio=0.145, POS_coverage=0.061,
    Score(f1_regret_bnd )=0.103
[INFO][BT][2025-12-16 19:10:52] 桶 bucket_id=L1_group=A：
    parent_id=ROOT，parent_Score=0.056, bucket_Score=0.103,
    Gain=+0.046, is_weak=False
[INFO][BT][2025-12-16 19:10:52] 桶 L1_group=B采样：原始样本 N=5613 → 使用样本 n=3929
【INFO】【2025-12-16 19:10:52】【阈值】桶 L1_group=B（n_val=2391

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:12:01】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=160000
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=L1_group=A，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="A"，n_samples=39841
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=L1_group=B，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="B"，n_samples=39882
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=L1_group=C，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="C"，n_samples=40381
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=L1_group=D，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="D"，n_samples=39896
[INFO][BT][2025-12-16 19:12:02] 创建桶 bucket_id=L1_group=A|L2_x1=b1，level=2，parent_id=L1_group=A，split_name=L2_x1，split_type=numeric_bin，split_rule="b1"，n_samples=4527
[INF



[INFO][BT][2025-12-16 19:12:15] 桶 ROOT采样：原始样本 N=22398 → 使用样本 n=15678
【INFO】【2025-12-16 19:12:15】【阈值】桶 ROOT（n_val=9598）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:12:15] 桶 bucket_id=ROOT level=0：
    n_train=22398, n_val=9598,
    BAC=0.599, F1=0.581, AUC=0.814,
    Regret=0.395, BND_ratio=0.238, POS_coverage=0.252,
    Score(f1_regret_bnd )=0.067
[INFO][BT][2025-12-16 19:12:15] 桶 L1_group=A采样：原始样本 N=5590 → 使用样本 n=3912
【INFO】【2025-12-16 19:12:16】【阈值】桶 L1_group=A（n_val=2375）使用本地阈值 α=0.4000, β=0.3000
[INFO][BT][2025-12-16 19:12:16] 桶 bucket_id=L1_group=A level=1：
    n_train=5590, n_val=2375,
    BAC=0.644, F1=0.442, AUC=0.796,
    Regret=0.231, BND_ratio=0.050, POS_coverage=0.071,
    Score(f1_regret_bnd )=0.186
[INFO][BT][2025-12-16 19:12:16] 桶 bucket_id=L1_group=A：
    parent_id=ROOT，parent_Score=0.067, bucket_Score=0.186,
    Gain=+0.119, is_weak=False
[INFO][BT][2025-12-16 19:12:16] 桶 L1_group=B采样：原始样本 N=5590 → 使用样本 n=3912
【INFO】【2025-12-16 19:12:16】【阈值】桶 L1_group=B（n_val=2383

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:13:24】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=160000
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=L1_group=A，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="A"，n_samples=39894
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=L1_group=B，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="B"，n_samples=39976
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=L1_group=C，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="C"，n_samples=40130
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=L1_group=D，level=1，parent_id=ROOT，split_name=L1_group，split_type=categorical_group，split_rule="D"，n_samples=40000
[INFO][BT][2025-12-16 19:13:25] 创建桶 bucket_id=L1_group=A|L2_x1=b1，level=2，parent_id=L1_group=A，split_name=L2_x1，split_type=numeric_bin，split_rule="b1"，n_samples=4635
[INF



[INFO][BT][2025-12-16 19:13:39] 桶 ROOT采样：原始样本 N=22398 → 使用样本 n=15678
【INFO】【2025-12-16 19:13:40】【阈值】桶 ROOT（n_val=9598）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:13:40] 桶 bucket_id=ROOT level=0：
    n_train=22398, n_val=9598,
    BAC=0.596, F1=0.577, AUC=0.808,
    Regret=0.396, BND_ratio=0.239, POS_coverage=0.243,
    Score(f1_regret_bnd )=0.062
[INFO][BT][2025-12-16 19:13:40] 桶 L1_group=A采样：原始样本 N=5584 → 使用样本 n=3908
【INFO】【2025-12-16 19:13:40】【阈值】桶 L1_group=A（n_val=2392）使用本地阈值 α=0.4000, β=0.2000
[INFO][BT][2025-12-16 19:13:40] 桶 bucket_id=L1_group=A level=1：
    n_train=5584, n_val=2392,
    BAC=0.620, F1=0.473, AUC=0.823,
    Regret=0.237, BND_ratio=0.133, POS_coverage=0.077,
    Score(f1_regret_bnd )=0.170
[INFO][BT][2025-12-16 19:13:40] 桶 bucket_id=L1_group=A：
    parent_id=ROOT，parent_Score=0.062, bucket_Score=0.170,
    Gain=+0.109, is_weak=False
[INFO][BT][2025-12-16 19:13:40] 桶 L1_group=B采样：原始样本 N=5618 → 使用样本 n=3932
【INFO】【2025-12-16 19:13:40】【阈值】桶 L1_group=B（n_val=2373

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-16 19:14:46】[BASELINE] 全局 XGB 模型训练完成
【INFO】【2025-12-16 19:14:46】[BASELINE] 阈值搜索开始
【INFO】【2025-12-16 19:14:48】[BASELINE] 最佳阈值找到: alpha=0.4000, beta=0.2000, regret=0.2998
【INFO】【2025-12-16 19:14:48】【桶树】已为样本生成桶ID，共 500 个组合
【INFO】【2025-12-16 19:14:48】[BASELINE] 测试集桶映射完成，共 500 个桶


  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=A|L2_x1=b2|L3_x2=b2|L4_x3=b4: BAC=0.5000, Regret=0.1556, Precision=0.0000, Recall=0.0000
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=C|L2_x1=b5|L3_x2=b2|L4_x3=b2: BAC=0.8015, Regret=0.2534, Precision=0.8878, Recall=0.8529
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=C|L2_x1=b5|L3_x2=b5|L4_x3=b1: BAC=0.5000, Regret=0.3382, Precision=0.0000, Recall=0.0000
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=D|L2_x1=b4|L3_x2=b1|L4_x3=b5: BAC=0.6000, Regret=0.1216, Precision=0.8889, Recall=1.0000
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=C|L2_x1=b4|L3_x2=b2|L4_x3=b3: BAC=0.8027, Regret=0.3618, Precision=0.7571, Recall=0.8030
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=B|L2_x1=b2|L3_x2=b4|L4_x3=b2: BAC=0.5884, Regret=0.1231, Precision=0.2500, Recall=0.2000
【INFO】【2025-12-16 19:15:04】[BASELINE] 桶 L1_group=D|L2_x1=b1|L3_x2=b1|L4_x3=b5: BAC=0.6571, Regret=0.4318, Precision=0.4545, Recall=0.7143
【INFO】【2025-12-16 19:15:04】[BASELI

Unnamed: 0,model,Precision_mean,Precision_std,Recall_mean,Recall_std,F1_mean,F1_std,BAC_mean,BAC_std,AUC_mean,...,MCC_mean,MCC_std,Kappa_mean,Kappa_std,BND_ratio_mean,BND_ratio_std,POS_Coverage_mean,POS_Coverage_std,Regret_mean,Regret_std
0,BTTWD,0.589161,0.003902,0.779273,0.005359,0.670993,0.002251,0.793877,0.001786,0.868271,...,0.542862,0.003286,0.532146,0.003546,0.118205,0.014392,0.275585,0.005134,0.314587,0.002497
1,LogReg,0.523174,0.004491,0.693192,0.004472,0.5963,0.004555,0.735274,0.003516,0.811006,...,0.434423,0.006709,0.42575,0.006716,0.0,0.0,,,0.404465,0.00539
2,RandomForest,0.601121,0.00209,0.778179,0.003758,0.678284,0.002531,0.798107,0.002068,0.871166,...,0.553529,0.00364,0.5443,0.003494,0.0,0.0,,,0.30795,0.003213
3,KNN,0.467856,0.003497,0.735522,0.006312,0.571919,0.004382,0.720357,0.003838,0.785755,...,0.393377,0.006793,0.37183,0.006344,0.0,0.0,,,0.424735,0.005932
4,XGBoost,0.623738,0.004028,0.767625,0.004069,0.688238,0.00383,0.802219,0.002812,0.878252,...,0.568449,0.005467,0.562427,0.005504,0.0,0.0,,,0.302315,0.004323


【INFO】【2025-12-16 19:15:04】【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。


In [10]:
# 步骤7：桶级别分析
bucket_metrics_path = os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'bucket_metrics.csv')
if os.path.exists(bucket_metrics_path):
    bucket_metrics_df = pd.read_csv(bucket_metrics_path)
    display(bucket_metrics_df.head())
    bucket_metrics_df.plot(x='bucket_id', y='pos_rate_all', kind='bar', figsize=(12,4), title='桶正类比例')
    plt.ylabel('正类比例')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.savefig(fig_bucket, bbox_inches='tight')
    plt.close()
log_info('【步骤7摘要】桶级指标已整理，可用于局部化分析。')

Unnamed: 0,bucket_id,layer,parent_bucket_id,n_train,n_val,pos_rate_train,pos_rate_val,alpha,beta,regret_val,...,is_weak,threshold_source_bucket,parent_with_threshold,n_test,pos_rate_test,BND_ratio_test,POS_Coverage_test,regret_test,fold,pos_rate
0,ROOT,L1,,22398,9599,0.260202,0.260548,0.4,0.2,0.3965,...,False,ROOT,,,,,,,1,0.260575
1,L1_group=C,L1,ROOT,5603,2417,0.273782,0.288788,0.4,0.2,0.376707,...,False,L1_group=C,,,,,,,1,0.283827
2,L1_group=D,L1,ROOT,5627,2372,0.431491,0.413997,0.4,0.3,0.416948,...,False,L1_group=D,,,,,,,1,0.424416
3,L1_group=B,L1,ROOT,5591,2403,0.212127,0.223887,0.4,0.2,0.315855,...,False,L1_group=B,,,,,,,1,0.208438
4,L1_group=A,L1,ROOT,5570,2402,0.1386,0.119484,0.4,0.2,0.233139,...,False,L1_group=A,,,,,,,1,0.125044


【INFO】【2025-12-16 19:15:24】【步骤7摘要】桶级指标已整理，可用于局部化分析。


In [11]:
# 步骤8：结果汇总
log_info('【步骤8】检查结果文件与图表。')
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['results_dir'])))
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['figs_dir'])))
log_info('【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。')

【INFO】【2025-12-16 19:15:24】【步骤8】检查结果文件与图表。
['baseline_bucket_metrics.csv', 'bucket_fallback_stats.csv', 'bucket_metrics.csv', 'bucket_metrics_gain.csv', 'bucket_thresholds.csv', 'bucket_thresholds_per_fold.csv', 'bucket_tree_structure.csv', 'metrics_kfold_per_fold.csv', 'metrics_kfold_summary.csv', 'metrics_overview.csv']
['bank_class_distribution.png', 'bucket_metrics_bar.png', 'class_distribution.png', 'metrics_compare.png']
【INFO】【2025-12-16 19:15:24】【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。
