 数据集 BT-TWD 可行性实验

本 notebook 按步骤运行：加载配置 → 读取数据 → 预处理 → 桶树划分 → 基线与 BTTWD k 折实验 → 桶级分析。

In [1]:
# 步骤0：环境与路径设置
import os, sys
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 将项目根目录加入路径，便于导入 bttwdlib
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from bttwdlib import (
    load_yaml_cfg,
    show_cfg,
    load_dataset,
    prepare_features_and_labels,
    BucketTree,
    run_kfold_experiments,
    log_info,
    set_global_seed,
)

cfg_path = os.path.join(root_path, 'configs', 'airlines_delay.yaml')
cfg = load_yaml_cfg(cfg_path)
set_global_seed(cfg.get('SEED', {}).get('global_seed', 42))
log_info('【步骤0摘要】环境准备完毕，路径与随机种子已设置。')

【INFO】【2025-11-24 21:07:25】【配置加载】已读取 e:\yan\组\三支决策\机器学习\BT_TWD\configs\airlines_delay.yaml
【INFO】【2025-11-24 21:07:33】【步骤0摘要】环境准备完毕，路径与随机种子已设置。


In [2]:
# 步骤1：加载配置
show_cfg(cfg)
log_info('【步骤1摘要】配置文件加载完成，关键参数检查通过。')

【INFO】【2025-11-24 21:07:33】【配置-数据】数据集=airlines_delay_1m, k折=5, 目标列=DepDelay, 正类="1"
【INFO】【2025-11-24 21:07:33】【配置-BTTWD】阈值模式=None, 全局模型=xgb, 桶内模型=knn, 后验估计器(兼容字段)=logreg
【INFO】【2025-11-24 21:07:33】【配置-基线】LogReg启用=True, RandomForest启用=False, KNN启用=True, XGBoost启用=True
【INFO】【2025-11-24 21:07:33】【步骤1摘要】配置文件加载完成，关键参数检查通过。


In [3]:
# 步骤2：加载原始数据
df_raw, target_col_model = load_dataset(cfg)  # 这里返回的是用于建模的标签列，例如 "label"

display(df_raw.head())
print("用于建模的标签列:", target_col_model)

# 1）画 0/1 标签（延误/不延误）的比例
class_counts = df_raw[target_col_model].value_counts(normalize=True)
ax = class_counts.plot(kind='bar', title='延误 vs 未延误比例')
plt.ylabel('比例')

fig_path = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'class_distribution.png')
os.makedirs(os.path.dirname(fig_path), exist_ok=True)
plt.savefig(fig_path, bbox_inches='tight')
plt.close()

# 2）如果想看原始 DepDelay 的分布，可以另外单独分析：
raw_target_col = cfg['DATA']['target_col']  # 这里是 "DepDelay"
print("原始目标列:", raw_target_col)
print(df_raw[raw_target_col].describe())

log_info('【步骤2摘要】Airlines 原始数据加载与基本统计完成。')


【INFO】【2025-11-24 21:07:46】【数据加载】ARFF 文件 ../data/airline/airlines_train_regression_1000000.arff 已读取，含 1000000 条记录，10 列
【INFO】【2025-11-24 21:07:46】【目标变换】已按阈值 15.0 生成二分类标签列 label，正类取 > 15.0
【INFO】【2025-11-24 21:07:46】【数据集信息】名称=airlines_delay_1m，样本数=1000000，目标列=label，正类比例=15.59%


Unnamed: 0,DepDelay,Month,DayofMonth,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,Origin,Dest,Distance,label
0,8.0,10.0,11.0,7.0,1300.0,1535.0,AA,LAX,HNL,2556.0,0
1,-3.0,10.0,10.0,6.0,2035.0,2110.0,AA,OGG,HNL,100.0,0
2,6.0,10.0,26.0,1.0,1200.0,1446.0,AA,JFK,LAX,2475.0,0
3,1.0,10.0,9.0,5.0,1145.0,1512.0,AA,JFK,SFO,2586.0,0
4,0.0,10.0,16.0,5.0,930.0,1149.0,AA,SFO,HNL,2399.0,0


用于建模的标签列: label
原始目标列: DepDelay
count    1000000.000000
mean           8.191935
std           28.877186
min        -1197.000000
25%           -3.000000
50%            0.000000
75%            7.000000
max         2119.000000
Name: DepDelay, dtype: float64
【INFO】【2025-11-24 21:07:47】【步骤2摘要】Airlines 原始数据加载与基本统计完成。


In [4]:
# 步骤3：预处理与特征工程
X, y, meta = prepare_features_and_labels(df_raw, cfg)
log_info(f'【预处理】编码特征维度={X.shape[1]}，样本数={X.shape[0]}')
log_info(f"【步骤3摘要】特征预处理完成：连续={len(meta['continuous_cols'])}，类别={len(meta['categorical_cols'])}，编码维度={X.shape[1]}。")

【INFO】【2025-11-24 21:07:47】【预处理】连续特征=6个，类别特征=3个
【INFO】【2025-11-24 21:07:50】【预处理】编码后维度=755
【INFO】【2025-11-24 21:07:50】【预处理】编码特征维度=755，样本数=1000000
【INFO】【2025-11-24 21:07:50】【步骤3摘要】特征预处理完成：连续=6，类别=3，编码维度=755。


In [5]:
# 步骤4：构建桶树并检查划分
bucket_tree = BucketTree(cfg['BTTWD']['bucket_levels'], feature_names=df_raw.drop(columns=[cfg['DATA']['target_col']]).columns.tolist())
bucket_ids_full = bucket_tree.assign_buckets(df_raw.drop(columns=[cfg['DATA']['target_col']]))
bucket_df = bucket_ids_full.value_counts().reset_index()
bucket_df.columns = ['bucket_id', 'count']
bucket_df['pos_rate'] = df_raw.groupby(bucket_ids_full)[cfg['DATA']['target_col']].apply(lambda s: (s == cfg['DATA']['positive_label']).mean()).values
display(bucket_df.head())
bucket_df.set_index('bucket_id')['count'].plot(kind='bar', figsize=(12,4), title='桶样本数分布')
fig_bucket = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'bucket_metrics_bar.png')
plt.savefig(fig_bucket, bbox_inches='tight')
plt.close()
log_info(f'【步骤4摘要】桶树划分完成，共有 {bucket_ids_full.nunique()} 个叶子桶。')

【INFO】【2025-11-24 21:07:53】【桶树】已为样本生成桶ID，共 228 个组合


Unnamed: 0,bucket_id,count,pos_rate
0,L1_UniqueCarrier=WN|L2_Distance=300-800|L3_CRS...,29020,0.064395
1,L1_UniqueCarrier=WN|L2_Distance=300-800|L3_CRS...,28723,0.051562
2,L1_UniqueCarrier=DL|L2_Distance=300-800|L3_CRS...,23817,0.06299
3,L1_UniqueCarrier=DL|L2_Distance=300-800|L3_CRS...,21924,0.052195
4,L1_UniqueCarrier=US|L2_Distance=300-800|L3_CRS...,18645,0.062972


【INFO】【2025-11-24 21:07:56】【步骤4摘要】桶树划分完成，共有 228 个叶子桶。


In [6]:
# 步骤5：运行基线模型 k 折实验
# 基线部分在 run_kfold_experiments 内统一调度
log_info('【步骤5】基线模型将在整体交叉验证中一并运行。')
log_info('【步骤5摘要】基线模型性能将作为后续对比基准。')

【INFO】【2025-11-24 21:07:56】【步骤5】基线模型将在整体交叉验证中一并运行。
【INFO】【2025-11-24 21:07:56】【步骤5摘要】基线模型性能将作为后续对比基准。


In [7]:
# 步骤6：运行 BTTWD k 折实验（含基线）
results = run_kfold_experiments(X, y, df_raw.drop(columns=[cfg['DATA']['target_col']]), cfg)
summary_df = pd.read_csv(os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'metrics_kfold_summary.csv'))
display(summary_df)
summary_df.plot(x='model', kind='bar', figsize=(8,4), title='模型指标对比')
fig_compare = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'metrics_compare.png')
plt.savefig(fig_compare, bbox_inches='tight')
plt.close()
log_info('【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。')

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

【INFO】【2025-11-24 21:48:30】【基线-LogReg】整体指标：AUC_mean=0.652, AUC_std=0.002, BAC_mean=0.500, BAC_std=0.000, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.000, F1_std=0.000, Kappa_mean=0.000, Kappa_std=0.000, MCC_mean=0.003, MCC_std=0.004, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.336, Precision_std=0.239, Recall_mean=0.000, Recall_std=0.000, Regret_mean=0.468, Regret_std=0.000


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-11-25 01:40:41】【基线-KNN】整体指标：AUC_mean=0.655, AUC_std=0.001, BAC_mean=0.501, BAC_std=0.000, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.003, F1_std=0.000, Kappa_mean=0.002, Kappa_std=0.000, MCC_mean=0.019, MCC_std=0.002, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.465, Precision_std=0.044, Recall_mean=0.002, Recall_std=0.000, Regret_mean=0.467, Regret_std=0.000


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-11-25 01:45:57】【基线-XGB】整体指标：AUC_mean=0.686, AUC_std=0.001, BAC_mean=0.501, BAC_std=0.000, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.006, F1_std=0.001, Kappa_mean=0.005, Kappa_std=0.001, MCC_mean=0.034, MCC_std=0.003, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.574, Precision_std=0.026, Recall_mean=0.003, Recall_std=0.000, Regret_mean=0.466, Regret_std=0.000
【INFO】【2025-11-25 01:45:57】【K折实验】正在执行第 1/5 折...
【INFO】【2025-11-25 01:46:26】【桶树】已为样本生成桶ID，共 156 个组合
【INFO】【2025-11-25 01:46:26】【BTTWD】桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 向父桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000 贡献 4880 个典型样本
【INFO】【2025-11-25 01:46:26】【BTTWD】桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000|L3_CRSDepTime=evening 向父桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000 贡献 2621 个典型样本
【INFO】【2025-11-25 01:46:26】【BTTWD】桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000|L3_CRSDepTime=morning 向父桶 L1_UniqueCarrier=AA|L2_Distance=1500-5000 贡献 5000 个典型样本
【INFO】【2025-11-25 

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-25 01:47:23】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-25 01:47:42】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=800-1500|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:47:43】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:47:56】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:47:56】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=morning 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:48:08】【BTTWD】叶子桶 L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:48:20】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:48:23】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=800-1500|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:48:26】【BTTWD】叶子桶 L1_UniqueCarrier=WN|L2_Distance=1500-5000|L3_CRSDepTime=a

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-25 01:51:21】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-25 01:51:40】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=800-1500|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:51:41】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:51:54】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:51:54】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=morning 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:52:06】【BTTWD】叶子桶 L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:52:18】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:52:21】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=800-1500|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:52:24】【BTTWD】叶子桶 L1_UniqueCarrier=WN|L2_Distance=1500-5000|L3_CRSDepTime=a

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-25 01:55:26】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-25 01:55:44】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=800-1500|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:55:45】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:55:56】【BTTWD】叶子桶 L1_UniqueCarrier=MQ|L2_Distance=300-800|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:55:58】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:55:58】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=morning 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:56:09】【BTTWD】叶子桶 L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:56:20】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:56:24】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=800-1500|L3_CRSDepTime=eve

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-25 01:59:21】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-25 01:59:39】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=800-1500|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:59:40】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=<=300|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:59:40】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:59:41】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=300-800|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:59:52】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 01:59:52】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=morning 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:00:05】【BTTWD】叶子桶 L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:00:17】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=1500-5000|L3_CRSDepTime=afternoo

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-11-25 02:03:25】【BTTWD】全局模型训练完成，用于兜底预测
【INFO】【2025-11-25 02:03:44】【BTTWD】叶子桶 L1_UniqueCarrier=AA|L2_Distance=800-1500|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:03:45】【BTTWD】叶子桶 L1_UniqueCarrier=CO|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:03:57】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:03:57】【BTTWD】叶子桶 L1_UniqueCarrier=NW|L2_Distance=1500-5000|L3_CRSDepTime=morning 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:04:10】【BTTWD】叶子桶 L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_CRSDepTime=night 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:04:22】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=1500-5000|L3_CRSDepTime=afternoon 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:04:25】【BTTWD】叶子桶 L1_UniqueCarrier=US|L2_Distance=800-1500|L3_CRSDepTime=evening 训练样本不足或单类，使用父桶/全局阈值
【INFO】【2025-11-25 02:04:27】【BTTWD】叶子桶 L1_UniqueCarrier=WN|L2_Distance=1500-5000|L3_CRSDepTime=a

Unnamed: 0,model,Precision_mean,Precision_std,Recall_mean,Recall_std,F1_mean,F1_std,BAC_mean,BAC_std,AUC_mean,...,MCC_mean,MCC_std,Kappa_mean,Kappa_std,BND_ratio_mean,BND_ratio_std,POS_Coverage_mean,POS_Coverage_std,Regret_mean,Regret_std
0,BTTWD,0.22456,0.001169,0.347215,0.002048,0.272731,0.001387,0.562918,0.001041,0.599452,...,0.106723,0.00177,0.102919,0.001714,0.0217,0.001211,0.053992,0.004633,0.467939,0.001356
1,LogReg,0.336111,0.239018,8.3e-05,5.9e-05,0.000167,0.000119,0.500025,3.7e-05,0.651562,...,0.003001,0.004156,8.5e-05,0.000123,0.0,0.0,,,0.467569,3.9e-05
2,KNN,0.465053,0.044473,0.001501,8.7e-05,0.002993,0.000173,0.500588,4.3e-05,0.654633,...,0.019048,0.001869,0.001982,0.000144,0.0,0.0,,,0.467152,5.3e-05
3,XGBoost,0.573661,0.026263,0.003131,0.000328,0.006228,0.00065,0.501351,0.000158,0.686021,...,0.033584,0.0028,0.004545,0.00053,0.0,0.0,,,0.466478,0.000144


【INFO】【2025-11-25 02:06:06】【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。


In [8]:
# 步骤7：桶级别分析
bucket_metrics_path = os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'bucket_metrics.csv')
if os.path.exists(bucket_metrics_path):
    bucket_metrics_df = pd.read_csv(bucket_metrics_path)
    display(bucket_metrics_df.head())
    bucket_metrics_df.plot(x='bucket_id', y='pos_rate_all', kind='bar', figsize=(12,4), title='桶正类比例')
    plt.ylabel('正类比例')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.savefig(fig_bucket, bbox_inches='tight')
    plt.close()
log_info('【步骤7摘要】桶级指标已整理，可用于局部化分析。')

Unnamed: 0,bucket_id,layer,parent_bucket_id,n_train,n_val,pos_rate_train,pos_rate_val,alpha,beta,regret_val,...,threshold_n_samples,n_all,pos_rate_all,n_test,pos_rate_test,BND_ratio_test,POS_Coverage_test,regret_test,fold,pos_rate
0,L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_...,L3,L1_UniqueCarrier=OTHER|L2_Distance=300-800,23618,5894,0.198027,0.191381,0.5,0.45,0.560825,...,5894,29512,0.1967,17110.0,0.190298,0.0,0.040795,0.570602,1,0.1967
1,L1_UniqueCarrier=OTHER|L2_Distance=300-800|L3_...,L3,L1_UniqueCarrier=OTHER|L2_Distance=300-800,22168,5562,0.106324,0.112549,0.55,0.5,0.338997,...,5562,27730,0.107573,16465.0,0.108837,0.0,0.002065,0.328272,1,0.107573
2,L1_UniqueCarrier=WN|L2_Distance=300-800|L3_CRS...,L3,L1_UniqueCarrier=WN|L2_Distance=300-800,18570,4669,0.078675,0.079889,0.55,0.5,0.239666,...,4669,23239,0.078919,5781.0,0.081474,0.0,0.000692,0.244162,1,0.078919
3,L1_UniqueCarrier=WN|L2_Distance=300-800|L3_CRS...,L3,L1_UniqueCarrier=WN|L2_Distance=300-800,18362,4552,0.221054,0.210237,0.5,0.45,0.639389,...,4552,22914,0.218905,5809.0,0.21088,0.0,0.045963,0.640041,1,0.218905
4,L1_UniqueCarrier=DL|L2_Distance=300-800|L3_CRS...,L3,L1_UniqueCarrier=DL|L2_Distance=300-800,15291,3784,0.158459,0.150634,0.55,0.5,0.457452,...,3784,19075,0.156907,4742.0,0.156474,0.0,0.008857,0.4748,1,0.156907


  plt.tight_layout()


【INFO】【2025-11-25 02:06:16】【步骤7摘要】桶级指标已整理，可用于局部化分析。


In [None]:
# 步骤8：结果汇总
log_info('【步骤8】检查结果文件与图表。')
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['results_dir'])))
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['figs_dir'])))
log_info('【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。')

【INFO】【2025-11-25 02:06:16】【步骤8】检查结果文件与图表。
['bucket_metrics.csv', 'bucket_thresholds_per_fold.csv', 'metrics_kfold_per_fold.csv', 'metrics_kfold_summary.csv']
['bucket_metrics_bar.png', 'class_distribution.png', 'metrics_compare.png']
【INFO】【2025-11-25 02:06:16】【全部步骤完成】Adult 数据集上的 BT-TWD 可行性实验结束。


: 