 数据集 BT-TWD 可行性实验

本 notebook 按步骤运行：加载配置 → 读取数据 → 预处理 → 桶树划分 → 基线与 BTTWD k 折实验 → 桶级分析。

In [1]:
# 步骤0：环境与路径设置
import os, sys
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 将项目根目录加入路径，便于导入 bttwdlib
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from bttwdlib import (
    load_yaml_cfg,
    show_cfg,
    load_adult_raw,
    prepare_features_and_labels,
    BucketTree,
    run_kfold_experiments,
    log_info,
    set_global_seed,
)

cfg_path = os.path.join(root_path, 'configs', 'airlines_delay.yaml')
cfg = load_yaml_cfg(cfg_path)
set_global_seed(cfg.get('SEED', {}).get('global_seed', 42))
log_info('【步骤0摘要】环境准备完毕，路径与随机种子已设置。')

【INFO】【2025-11-24 20:54:34】【配置加载】已读取 e:\yan\组\三支决策\机器学习\BT_TWD\configs\airlines_delay.yaml
【INFO】【2025-11-24 20:54:37】【步骤0摘要】环境准备完毕，路径与随机种子已设置。


In [2]:
# 步骤1：加载配置
show_cfg(cfg)
log_info('【步骤1摘要】配置文件加载完成，关键参数检查通过。')

【INFO】【2025-11-24 20:54:37】【配置-数据】数据集=airlines_delay_1m, k折=5, 目标列=DepDelay, 正类="1"
【INFO】【2025-11-24 20:54:37】【配置-BTTWD】阈值模式=None, 全局模型=xgb, 桶内模型=knn, 后验估计器(兼容字段)=logreg
【INFO】【2025-11-24 20:54:37】【配置-基线】LogReg启用=True, RandomForest启用=False, KNN启用=True, XGBoost启用=True
【INFO】【2025-11-24 20:54:37】【步骤1摘要】配置文件加载完成，关键参数检查通过。


In [3]:
# 步骤2：加载原始数据
df_raw = load_adult_raw(cfg)
display(df_raw.head())
target_col = cfg['DATA']['target_col']
class_counts = df_raw[target_col].value_counts(normalize=True)
ax = class_counts.plot(kind='bar', title='正负类比例')
plt.ylabel('比例')
fig_path = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'class_distribution.png')
os.makedirs(os.path.dirname(fig_path), exist_ok=True)
plt.savefig(fig_path, bbox_inches='tight')
plt.close()
log_info('【步骤2摘要】Adult 原始数据加载与基本统计完成。')

ParserError: Error tokenizing data. C error: Expected 15 fields in line 8, saw 30


In [None]:
# 步骤3：预处理与特征工程
X, y, meta = prepare_features_and_labels(df_raw, cfg)
log_info(f'【预处理】编码特征维度={X.shape[1]}，样本数={X.shape[0]}')
log_info(f"【步骤3摘要】特征预处理完成：连续={len(meta['continuous_cols'])}，类别={len(meta['categorical_cols'])}，编码维度={X.shape[1]}。")

【INFO】【2025-11-23 21:21:44】【预处理】连续特征=6个，类别特征=8个
【INFO】【2025-11-23 21:21:44】【预处理】编码后维度=100
【INFO】【2025-11-23 21:21:44】【预处理】编码特征维度=100，样本数=32561
【INFO】【2025-11-23 21:21:44】【步骤3摘要】特征预处理完成：连续=6，类别=8，编码维度=100。


In [None]:
# 步骤4：构建桶树并检查划分
bucket_tree = BucketTree(cfg['BTTWD']['bucket_levels'], feature_names=df_raw.drop(columns=[cfg['DATA']['target_col']]).columns.tolist())
bucket_ids_full = bucket_tree.assign_buckets(df_raw.drop(columns=[cfg['DATA']['target_col']]))
bucket_df = bucket_ids_full.value_counts().reset_index()
bucket_df.columns = ['bucket_id', 'count']
bucket_df['pos_rate'] = df_raw.groupby(bucket_ids_full)[cfg['DATA']['target_col']].apply(lambda s: (s == cfg['DATA']['positive_label']).mean()).values
display(bucket_df.head())
bucket_df.set_index('bucket_id')['count'].plot(kind='bar', figsize=(12,4), title='桶样本数分布')
fig_bucket = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'bucket_metrics_bar.png')
plt.savefig(fig_bucket, bbox_inches='tight')
plt.close()
log_info(f'【步骤4摘要】桶树划分完成，共有 {bucket_ids_full.nunique()} 个叶子桶。')

【INFO】【2025-11-23 21:21:44】【桶树】已为样本生成桶ID，共 16 个组合


Unnamed: 0,bucket_id,count,pos_rate
0,L1_age=old|L2_education=high,5328,0.235657
1,L1_age=mid|L2_education=high,4218,0.018779
2,L1_age=old|L2_education=mid,4166,0.097167
3,L1_age=mid|L2_education=mid,3530,0.45208
4,L1_age=young|L2_education=high,3216,0.411974


【INFO】【2025-11-23 21:21:44】【步骤4摘要】桶树划分完成，共有 16 个叶子桶。


In [None]:
# 步骤5：运行基线模型 k 折实验
# 基线部分在 run_kfold_experiments 内统一调度
log_info('【步骤5】基线模型将在整体交叉验证中一并运行。')
log_info('【步骤5摘要】基线模型性能将作为后续对比基准。')

【INFO】【2025-11-23 21:21:44】【步骤5】基线模型将在整体交叉验证中一并运行。
【INFO】【2025-11-23 21:21:44】【步骤5摘要】基线模型性能将作为后续对比基准。


In [None]:
# 步骤6：运行 BTTWD k 折实验（含基线）
results = run_kfold_experiments(X, y, df_raw.drop(columns=[cfg['DATA']['target_col']]), cfg)
summary_df = pd.read_csv(os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'metrics_kfold_summary.csv'))
display(summary_df)
summary_df.plot(x='model', kind='bar', figsize=(8,4), title='模型指标对比')
fig_compare = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'metrics_compare.png')
plt.savefig(fig_compare, bbox_inches='tight')
plt.close()
log_info('【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。')

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-11-23 21:21:51】【基线-LogReg】整体指标：AUC_mean=0.907, AUC_std=0.002, BAC_mean=0.767, BAC_std=0.005, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.662, F1_std=0.007, Kappa_mean=0.568, Kappa_std=0.009, MCC_mean=0.573, MCC_std=0.008, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.735, Precision_std=0.008, Recall_mean=0.602, Recall_std=0.011, Regret_mean=0.340, Regret_std=0.007


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-11-23 21:21:59】【基线-RF】整体指标：AUC_mean=0.906, AUC_std=0.002, BAC_mean=0.778, BAC_std=0.004, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.678, F1_std=0.007, Kappa_mean=0.587, Kappa_std=0.010, MCC_mean=0.590, MCC_std=0.010, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.739, Precision_std=0.009, Recall_mean=0.627, Recall_std=0.007, Regret_mean=0.323, Regret_std=0.007


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:02】【基线-KNN】整体指标：AUC_mean=0.869, AUC_std=0.006, BAC_mean=0.729, BAC_std=0.007, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.604, F1_std=0.012, Kappa_mean=0.502, Kappa_std=0.014, MCC_mean=0.511, MCC_std=0.014, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.711, Precision_std=0.011, Recall_mean=0.525, Recall_std=0.012, Regret_mean=0.395, Regret_std=0.010


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:05】【基线-XGB】整体指标：AUC_mean=0.929, AUC_std=0.002, BAC_mean=0.799, BAC_std=0.004, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.714, F1_std=0.006, Kappa_mean=0.634, Kappa_std=0.007, MCC_mean=0.639, MCC_std=0.007, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.786, Precision_std=0.007, Recall_mean=0.655, Recall_std=0.007, Regret_mean=0.292, Regret_std=0.005
【INFO】【2025-11-23 21:22:05】【K折实验】正在执行第 1/5 折...
【INFO】【2025-11-23 21:22:05】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1_age=mid|L2_education=high 向父桶 L1_age=mid 贡献 681 个典型样本
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1_age=mid|L2_education=low 样本太少(n=166)，全部并入父桶 L1_age=mid
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1_age=mid|L2_education=mid 向父桶 L1_age=mid 贡献 561 个典型样本
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1_age=mid|L2_education=top 向父桶 L1_age=mid 贡献 87 个典型样本
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1_age=old|L2_education=high 向父桶 L1_age=old 贡献 854 个典型样本
【INFO】【2025-11-23 21:22:05】【BTTWD】桶 L1

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:11】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-23 21:22:12】【BTTWD】bucket_estimator=none：不训练桶内局部模型，仅使用全局模型概率做桶内阈值搜索
【INFO】【2025-11-23 21:22:13】【BTTWD】叶子桶 L1_age=old|L2_education=low 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-23 21:22:16】【BTTWD】共生成 16 个叶子桶，其中有效桶 0 个（样本数 ≥ 200）
【INFO】【2025-11-23 21:22:16】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:16】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:16】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:16】【BTTWD】三支指标(含后悔)：Precision=0.618, Recall=0.869, F1=0.722, BAC=0.849, AUC=0.929, MCC=0.631, Kappa=0.613, BND_ratio=0.073, POS_Coverage=0.277, Regret=0.233
【INFO】【2025-11-23 21:22:16】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:16】【K折实验】正在执行第 3/5 折...
【INFO】【2025-11-23 21:22:16】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:16】【BTTWD】桶 L1_age=mid|L2_education=high 向父桶 L1_age=mid 贡献 669 个典型样本
【INFO】【2025-11-23 21:22:16】【BTTWD】桶 L1_age=mid|L2_education=low 样本太少(n=173)，全部并入父桶 L1_age=mid
【INFO】【2025-11-23 21:22:16】【BTTWD】桶 L1_age=mid|L

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:16】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-23 21:22:17】【BTTWD】bucket_estimator=none：不训练桶内局部模型，仅使用全局模型概率做桶内阈值搜索
【INFO】【2025-11-23 21:22:18】【BTTWD】叶子桶 L1_age=old|L2_education=low 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-23 21:22:21】【BTTWD】共生成 16 个叶子桶，其中有效桶 0 个（样本数 ≥ 200）
【INFO】【2025-11-23 21:22:21】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:21】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:21】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:22】【BTTWD】三支指标(含后悔)：Precision=0.615, Recall=0.853, F1=0.715, BAC=0.842, AUC=0.927, MCC=0.620, Kappa=0.604, BND_ratio=0.073, POS_Coverage=0.270, Regret=0.242
【INFO】【2025-11-23 21:22:22】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:22】【K折实验】正在执行第 4/5 折...
【INFO】【2025-11-23 21:22:22】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:22】【BTTWD】桶 L1_age=mid|L2_education=high 向父桶 L1_age=mid 贡献 672 个典型样本
【INFO】【2025-11-23 21:22:22】【BTTWD】桶 L1_age=mid|L2_education=low 样本太少(n=173)，全部并入父桶 L1_age=mid
【INFO】【2025-11-23 21:22:22】【BTTWD】桶 L1_age=mid|L

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:22】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-23 21:22:23】【BTTWD】bucket_estimator=none：不训练桶内局部模型，仅使用全局模型概率做桶内阈值搜索
【INFO】【2025-11-23 21:22:24】【BTTWD】叶子桶 L1_age=old|L2_education=low 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-23 21:22:27】【BTTWD】共生成 16 个叶子桶，其中有效桶 0 个（样本数 ≥ 200）
【INFO】【2025-11-23 21:22:27】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:27】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:27】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:27】【BTTWD】三支指标(含后悔)：Precision=0.606, Recall=0.852, F1=0.709, BAC=0.838, AUC=0.925, MCC=0.612, Kappa=0.594, BND_ratio=0.074, POS_Coverage=0.277, Regret=0.244
【INFO】【2025-11-23 21:22:27】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:27】【K折实验】正在执行第 5/5 折...
【INFO】【2025-11-23 21:22:27】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:27】【BTTWD】桶 L1_age=mid|L2_education=high 向父桶 L1_age=mid 贡献 680 个典型样本
【INFO】【2025-11-23 21:22:27】【BTTWD】桶 L1_age=mid|L2_education=low 样本太少(n=165)，全部并入父桶 L1_age=mid
【INFO】【2025-11-23 21:22:27】【BTTWD】桶 L1_age=mid|L

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-23 21:22:28】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-23 21:22:28】【BTTWD】bucket_estimator=none：不训练桶内局部模型，仅使用全局模型概率做桶内阈值搜索
【INFO】【2025-11-23 21:22:29】【BTTWD】叶子桶 L1_age=old|L2_education=low 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-23 21:22:33】【BTTWD】共生成 16 个叶子桶，其中有效桶 0 个（样本数 ≥ 200）
【INFO】【2025-11-23 21:22:33】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:33】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:33】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:33】【BTTWD】三支指标(含后悔)：Precision=0.612, Recall=0.861, F1=0.715, BAC=0.844, AUC=0.927, MCC=0.621, Kappa=0.604, BND_ratio=0.053, POS_Coverage=0.273, Regret=0.241
【INFO】【2025-11-23 21:22:33】【桶树】已为样本生成桶ID，共 16 个组合
【INFO】【2025-11-23 21:22:33】【K折实验】所有结果已写入 results 目录


Unnamed: 0,model,Precision_mean,Precision_std,Recall_mean,Recall_std,F1_mean,F1_std,BAC_mean,BAC_std,AUC_mean,...,MCC_mean,MCC_std,Kappa_mean,Kappa_std,BND_ratio_mean,BND_ratio_std,POS_Coverage_mean,POS_Coverage_std,Regret_mean,Regret_std
0,BTTWD,0.612886,0.004174,0.856907,0.007809,0.714631,0.004901,0.842612,0.004173,0.927143,...,0.619975,0.007108,0.603221,0.006719,0.070943,0.010465,0.27189,0.005922,0.242038,0.00611
1,LogReg,0.735177,0.008064,0.601964,0.010647,0.661864,0.007215,0.766577,0.004906,0.906752,...,0.572968,0.008306,0.568268,0.00857,0.0,0.0,,,0.339793,0.007071
2,RandomForest,0.738984,0.008985,0.627088,0.006764,0.678445,0.007394,0.77841,0.004453,0.905703,...,0.590452,0.009705,0.587172,0.00959,0.0,0.0,,,0.322748,0.006507
3,KNN,0.711204,0.011278,0.524933,0.012333,0.604007,0.011766,0.728668,0.006979,0.868952,...,0.511495,0.013983,0.502222,0.014163,0.0,0.0,,,0.394521,0.010132
4,XGBoost,0.785942,0.007124,0.654892,0.007286,0.714426,0.005556,0.799149,0.003685,0.929265,...,0.638761,0.006904,0.634395,0.006939,0.0,0.0,,,0.292282,0.005355


【INFO】【2025-11-23 21:22:33】【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。


In [None]:
# 步骤7：桶级别分析
bucket_metrics_path = os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'bucket_metrics.csv')
if os.path.exists(bucket_metrics_path):
    bucket_metrics_df = pd.read_csv(bucket_metrics_path)
    display(bucket_metrics_df.head())
    bucket_metrics_df.plot(x='bucket_id', y='pos_rate_all', kind='bar', figsize=(12,4), title='桶正类比例')
    plt.ylabel('正类比例')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.savefig(fig_bucket, bbox_inches='tight')
    plt.close()
log_info('【步骤7摘要】桶级指标已整理，可用于局部化分析。')

Unnamed: 0,bucket_id,layer,parent_bucket_id,n_train,n_val,pos_rate_train,pos_rate_val,alpha,beta,regret_val,...,threshold_n_samples,n_all,pos_rate_all,n_test,pos_rate_test,BND_ratio_test,POS_Coverage_test,regret_test,fold,pos_rate
0,L1_age=old|L2_education=high,L2,L1_age=old,2992,1280,0.408422,0.422656,0.35,0.2,0.305469,...,1280,4272,0.412687,1056.0,0.409091,0.0,0.508523,0.34375,1,0.412687
1,L1_age=mid|L2_education=high,L2,L1_age=mid,2416,990,0.235099,0.243434,0.35,0.25,0.291414,...,990,3406,0.237522,812.0,0.227833,0.0,0.257389,0.275246,1,0.237522
2,L1_age=old|L2_education=mid,L2,L1_age=old,2344,991,0.213737,0.211907,0.4,0.2,0.323411,...,991,3335,0.213193,831.0,0.216606,0.0,0.181709,0.356197,1,0.213193
3,L1_age=mid|L2_education=mid,L2,L1_age=mid,1925,884,0.092987,0.097285,0.45,0.2,0.206448,...,884,2809,0.09434,721.0,0.108183,0.0,0.0319,0.237864,1,0.09434
4,L1_age=young|L2_education=high,L2,L1_age=young,1799,753,0.026126,0.018592,0.65,0.2,0.02656,...,753,2552,0.023903,664.0,0.024096,0.0,0.009036,0.042169,1,0.023903


【INFO】【2025-11-23 21:22:34】【步骤7摘要】桶级指标已整理，可用于局部化分析。


In [None]:
# 步骤8：结果汇总
log_info('【步骤8】检查结果文件与图表。')
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['results_dir'])))
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['figs_dir'])))
log_info('【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。')

【INFO】【2025-11-23 21:22:34】【步骤8】检查结果文件与图表。
['bucket_metrics.csv', 'bucket_thresholds_per_fold.csv', 'metrics_kfold_per_fold.csv', 'metrics_kfold_summary.csv']
['bucket_metrics_bar.png', 'class_distribution.png', 'metrics_compare.png']
【INFO】【2025-11-23 21:22:34】【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。


: 