Airbnb NYC 数据集 BT-TWD 可行性实验

本 notebook 按步骤运行：加载配置 → 读取数据 → 预处理 → 桶树划分 → 基线与 BTTWD k 折实验 → 桶级分析。

In [1]:
# 步骤0：环境与路径设置
import os, sys
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

# 将项目根目录加入路径，便于导入 bttwdlib
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from bttwdlib import (
    load_yaml_cfg,
    show_cfg,
    load_dataset,
    prepare_features_and_labels,
    BucketTree,
    run_kfold_experiments,
    log_info,
    set_global_seed,
)

cfg_path = Path(root_path) / "configs" / "airbnb_nyc.yaml"
cfg = load_yaml_cfg(cfg_path)
set_global_seed(cfg.get('SEED', {}).get('global_seed', 42))
log_info('【步骤0摘要】环境准备完毕，路径与随机种子已设置。')

【INFO】【2025-12-11 17:53:06】【配置加载】已读取 e:\yan\组\三支决策\机器学习\BT_TWD\configs\airbnb_nyc.yaml
【INFO】【2025-12-11 17:53:09】【步骤0摘要】环境准备完毕，路径与随机种子已设置。


In [2]:
# 步骤1：加载配置
show_cfg(cfg)
log_info('【步骤1摘要】配置文件加载完成，关键参数检查通过。')

【INFO】【2025-12-11 17:53:09】【配置-数据】数据集=airbnb_nyc, k折=5, 目标列=price, 正类="1"
【INFO】【2025-12-11 17:53:09】【配置-BTTWD】阈值模式=None, 全局模型=xgb, 桶内模型=none, 后验估计器(兼容字段)=logreg
【INFO】【2025-12-11 17:53:09】【配置-基线】LogReg启用=False, RandomForest启用=True, KNN启用=True, XGBoost启用=True
【INFO】【2025-12-11 17:53:09】【步骤1摘要】配置文件加载完成，关键参数检查通过。


In [3]:
# 步骤2：加载原始数据
df_raw, target_col_model = load_dataset(cfg)  # 这里返回的是用于建模的标签列，例如 "label"

display(df_raw.head())
print("用于建模的标签列:", target_col_model)

# 1）画 0/1 标签（高价/非高价）的比例
class_counts = df_raw[target_col_model].value_counts(normalize=True)
ax = class_counts.plot(kind='bar', title='高价 vs 非高价比例')
plt.ylabel('比例')

fig_path = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'class_distribution.png')
os.makedirs(os.path.dirname(fig_path), exist_ok=True)
plt.savefig(fig_path, bbox_inches='tight')
plt.close()

# 2）如果想看原始标签列的分布，可以另外单独分析：
raw_target_col = cfg['DATA']['target_col']  # 这里是原始标签列
print("原始目标列:", raw_target_col)
print(df_raw[raw_target_col].describe())

log_info('【步骤2摘要】Airbnb NYC 原始数据加载与基本统计完成。')


【INFO】【2025-12-11 17:53:09】【数据加载】文本表格 ..\data\AB_NYC\AB_NYC_2019.csv 已读取，样本数=48895，列数=16
【INFO】【2025-12-11 17:53:09】【目标变换】已按阈值 255 生成二分类标签列 high_price，正类取 > 255
【INFO】【2025-12-11 17:53:09】【数据集信息】名称=airbnb_nyc，样本数=48895，目标列=high_price，正类比例=10.52%


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,high_price
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365,0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,0


用于建模的标签列: high_price
原始目标列: price
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64
【INFO】【2025-12-11 17:53:09】【步骤2摘要】Airbnb NYC 原始数据加载与基本统计完成。


In [4]:
# 步骤3：预处理与特征工程
X, y, meta = prepare_features_and_labels(df_raw, cfg)
log_info(f'【预处理】编码特征维度={X.shape[1]}，样本数={X.shape[0]}')
log_info(f"【步骤3摘要】特征预处理完成：连续={len(meta['continuous_cols'])}，类别={len(meta['categorical_cols'])}，编码维度={X.shape[1]}。")

【INFO】【2025-12-11 17:53:09】【预处理】缺失值填充策略=most_frequent
【INFO】【2025-12-11 17:53:09】【预处理】连续特征=7个，类别特征=3个
【INFO】【2025-12-11 17:53:09】【预处理】编码后维度=233
【INFO】【2025-12-11 17:53:09】【预处理】编码特征维度=233，样本数=48895
【INFO】【2025-12-11 17:53:09】【步骤3摘要】特征预处理完成：连续=7，类别=3，编码维度=233。


In [5]:
# 步骤4：构建桶树并检查划分
feature_cols_for_bucket = [c for c in df_raw.columns if c != target_col_model]

bucket_tree = BucketTree(
    cfg['BTTWD']['bucket_levels'],
    feature_names=feature_cols_for_bucket
)

bucket_ids_full = bucket_tree.assign_buckets(df_raw[feature_cols_for_bucket])

bucket_df = bucket_ids_full.value_counts().reset_index()
bucket_df.columns = ['bucket_id', 'count']

bucket_df['pos_rate'] = (
    df_raw.groupby(bucket_ids_full)[target_col_model]
    .apply(lambda s: (s == 1).mean())
    .values
)
display(bucket_df.head())
bucket_df.set_index('bucket_id')['count'].plot(kind='bar', figsize=(12,4), title='桶样本数分布')
fig_bucket = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'bucket_metrics_bar.png')
plt.savefig(fig_bucket, bbox_inches='tight')
plt.close()
log_info(f'【步骤4摘要】桶树划分完成，共有 {bucket_ids_full.nunique()} 个叶子桶。')


【INFO】【2025-12-11 17:53:09】【桶树】已为样本生成桶ID，共 45 个组合


Unnamed: 0,bucket_id,count,pos_rate
0,L1_neighbourhood_group=manhattan|L2_room_type=...,6276,0.100671
1,L1_neighbourhood_group=brooklyn|L2_room_type=p...,5190,0.058824
2,L1_neighbourhood_group=brooklyn|L2_room_type=e...,4857,0.031915
3,L1_neighbourhood_group=manhattan|L2_room_type=...,4311,0.011111
4,L1_neighbourhood_group=manhattan|L2_room_type=...,4020,0.004149


【INFO】【2025-12-11 17:53:10】【步骤4摘要】桶树划分完成，共有 45 个叶子桶。


In [6]:
# 步骤5：运行基线模型 k 折实验
# 基线部分在 run_kfold_experiments 内统一调度
log_info('【步骤5】基线模型将在整体交叉验证中一并运行。')
log_info('【步骤5摘要】基线模型性能将作为后续对比基准。')

【INFO】【2025-12-11 17:53:10】【步骤5】基线模型将在整体交叉验证中一并运行。
【INFO】【2025-12-11 17:53:10】【步骤5摘要】基线模型性能将作为后续对比基准。


In [7]:
import numpy as np
import pandas as pd

print("y 全局标签分布：", np.unique(y, return_counts=True))

print("原始标签列分布：")
print(df_raw[cfg['DATA']['target_col']].value_counts())


y 全局标签分布： (array([0, 1]), array([43751,  5144], dtype=int64))
原始标签列分布：
price
100    2051
150    2047
50     1534
60     1458
200    1401
       ... 
780       1
386       1
888       1
483       1
338       1
Name: count, Length: 674, dtype: int64


In [8]:
# 步骤6：运行 BTTWD k 折实验（含基线）
results = run_kfold_experiments(X, y, df_raw.drop(columns=[cfg['DATA']['target_col']]), cfg)
summary_df = pd.read_csv(os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'metrics_kfold_summary.csv'))
display(summary_df)
summary_df.plot(x='model', kind='bar', figsize=(8,4), title='模型指标对比')
fig_compare = os.path.join(root_path, cfg['OUTPUT']['figs_dir'], 'metrics_compare.png')
plt.savefig(fig_compare, bbox_inches='tight')
plt.close()
log_info('【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。')

【INFO】【2025-12-11 17:53:10】【基线-RF】使用决策阈值=0.400（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-11 17:53:41】【基线-RF】整体指标：AUC_mean=0.864, AUC_std=0.006, BAC_mean=0.666, BAC_std=0.008, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.443, F1_std=0.016, Kappa_mean=0.393, Kappa_std=0.018, MCC_mean=0.404, MCC_std=0.018, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.565, Precision_std=0.022, Recall_mean=0.365, Recall_std=0.016, Regret_mean=0.130, Regret_std=0.003
【INFO】【2025-12-11 17:53:41】【基线-KNN】使用决策阈值=0.400（fixed 模式）


  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-11 17:53:50】【基线-KNN】整体指标：AUC_mean=0.786, AUC_std=0.005, BAC_mean=0.647, BAC_std=0.011, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.393, F1_std=0.021, Kappa_mean=0.334, Kappa_std=0.022, MCC_mean=0.339, MCC_std=0.022, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.466, Precision_std=0.022, Recall_mean=0.340, Recall_std=0.021, Regret_mean=0.145, Regret_std=0.004
【INFO】【2025-12-11 17:53:50】【基线-XGB】使用决策阈值=0.400（fixed 模式）


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

  summary[f"{col}_mean"] = float(np.nanmean(arr))
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


【INFO】【2025-12-11 17:53:58】【基线-XGB】整体指标：AUC_mean=0.876, AUC_std=0.005, BAC_mean=0.660, BAC_std=0.003, BND_ratio_mean=0.000, BND_ratio_std=0.000, F1_mean=0.435, F1_std=0.005, Kappa_mean=0.386, Kappa_std=0.005, MCC_mean=0.400, MCC_std=0.004, POS_Coverage_mean=nan, POS_Coverage_std=nan, Precision_mean=0.573, Precision_std=0.007, Recall_mean=0.351, Recall_std=0.006, Regret_mean=0.130, Regret_std=0.001
【INFO】【2025-12-11 17:53:58】【K折实验】正在执行第 1/5 折...
【INFO】【2025-12-11 17:53:58】[BT] 使用桶评分配置：mode=f1_regret_bnd, f1_weight=1.0, regret_weight=1.0, bnd_weight=0.5


Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:54:00】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 17:54:00] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=39116
[INFO][BT][2025-12-11 17:54:00] 创建桶 bucket_id=L1_neighbourhood_group=bronx，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="bronx"，n_samples=854
[INFO][BT][2025-12-11 17:54:00] 创建桶 bucket_id=L1_neighbourhood_group=brooklyn，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="brooklyn"，n_samples=16008
[INFO][BT][2025-12-11 17:54:00] 创建桶 bucket_id=L1_neighbourhood_group=manhattan，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="manhattan"，n_samples=17369
[INFO][BT][2025-12-11 17:54:00] 创建桶 bucket_id=L1_neighbourhood_group=queens，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="queens"，n_samples=4589
[INFO][BT][202



【INFO】【2025-12-11 17:54:03】【阈值】桶 ROOT（n_val=2419）使用本地阈值 α=0.5000, β=0.4000
[INFO][BT][2025-12-11 17:54:03] 桶 bucket_id=ROOT level=0：
    n_train=5474, n_val=2419,
    BAC=0.599, F1=0.333, AUC=0.859,
    Regret=0.123, BND_ratio=0.025, POS_coverage=0.037,
    Score(f1_regret_bnd )=0.198
【INFO】【2025-12-11 17:54:04】【阈值】桶 L1_neighbourhood_group=bronx 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:04] 桶 bucket_id=L1_neighbourhood_group=bronx level=1：
    n_train=141, n_val=130,
    BAC=0.542, F1=0.182, AUC=0.843,
    Regret=0.096, BND_ratio=0.031, POS_coverage=0.008,
    Score(f1_regret_bnd )=0.070
[INFO][BT][2025-12-11 17:54:04] 桶 bucket_id=L1_neighbourhood_group=bronx：
    parent_id=ROOT，parent_Score=0.198, bucket_Score=0.070,
    Gain=-0.127, is_weak=True
【INFO】【2025-12-11 17:54:04】【阈值】桶 L1_neighbourhood_group=brooklyn 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:04] 桶 bucket_id=L1_neighbourhood_group=brooklyn level=1：
    n_train=2232, n_val=1039,
    BAC=0.514, F1=0.066, AUC=0

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:54:17】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 17:54:17] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=39116
[INFO][BT][2025-12-11 17:54:17] 创建桶 bucket_id=L1_neighbourhood_group=bronx，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="bronx"，n_samples=869
[INFO][BT][2025-12-11 17:54:17] 创建桶 bucket_id=L1_neighbourhood_group=brooklyn，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="brooklyn"，n_samples=16000
[INFO][BT][2025-12-11 17:54:17] 创建桶 bucket_id=L1_neighbourhood_group=manhattan，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="manhattan"，n_samples=17412
[INFO][BT][2025-12-11 17:54:17] 创建桶 bucket_id=L1_neighbourhood_group=queens，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="queens"，n_samples=4533
[INFO][BT][202



【INFO】【2025-12-11 17:54:20】【阈值】桶 ROOT（n_val=2422）使用本地阈值 α=0.5000, β=0.3000
[INFO][BT][2025-12-11 17:54:20] 桶 bucket_id=ROOT level=0：
    n_train=5473, n_val=2422,
    BAC=0.581, F1=0.331, AUC=0.881,
    Regret=0.113, BND_ratio=0.066, POS_coverage=0.029,
    Score(f1_regret_bnd )=0.185
【INFO】【2025-12-11 17:54:20】【阈值】桶 L1_neighbourhood_group=bronx 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:20] 桶 bucket_id=L1_neighbourhood_group=bronx level=1：
    n_train=139, n_val=145,
    BAC=0.496, F1=0.000, AUC=0.685,
    Regret=0.069, BND_ratio=0.014, POS_coverage=0.000,
    Score(f1_regret_bnd )=-0.076
[INFO][BT][2025-12-11 17:54:20] 桶 bucket_id=L1_neighbourhood_group=bronx：
    parent_id=ROOT，parent_Score=0.185, bucket_Score=-0.076,
    Gain=-0.261, is_weak=True
【INFO】【2025-12-11 17:54:20】【阈值】桶 L1_neighbourhood_group=brooklyn 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:20] 桶 bucket_id=L1_neighbourhood_group=brooklyn level=1：
    n_train=2249, n_val=1027,
    BAC=0.539, F1=0.147, AUC

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:54:31】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 17:54:32] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=39116
[INFO][BT][2025-12-11 17:54:32] 创建桶 bucket_id=L1_neighbourhood_group=bronx，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="bronx"，n_samples=880
[INFO][BT][2025-12-11 17:54:32] 创建桶 bucket_id=L1_neighbourhood_group=brooklyn，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="brooklyn"，n_samples=16120
[INFO][BT][2025-12-11 17:54:32] 创建桶 bucket_id=L1_neighbourhood_group=manhattan，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="manhattan"，n_samples=17319
[INFO][BT][2025-12-11 17:54:32] 创建桶 bucket_id=L1_neighbourhood_group=queens，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="queens"，n_samples=4494
[INFO][BT][202



【INFO】【2025-12-11 17:54:34】【阈值】桶 ROOT（n_val=2418）使用本地阈值 α=0.5000, β=0.3000
[INFO][BT][2025-12-11 17:54:34] 桶 bucket_id=ROOT level=0：
    n_train=5473, n_val=2418,
    BAC=0.579, F1=0.326, AUC=0.866,
    Regret=0.128, BND_ratio=0.067, POS_coverage=0.033,
    Score(f1_regret_bnd )=0.165
【INFO】【2025-12-11 17:54:35】【阈值】桶 L1_neighbourhood_group=bronx（n_val=140）使用本地阈值 α=0.4000, β=0.3000
[INFO][BT][2025-12-11 17:54:35] 桶 bucket_id=L1_neighbourhood_group=bronx level=1：
    n_train=149, n_val=140,
    BAC=0.639, F1=0.400, AUC=0.821,
    Regret=0.061, BND_ratio=0.000, POS_coverage=0.021,
    Score(f1_regret_bnd )=0.339
[INFO][BT][2025-12-11 17:54:35] 桶 bucket_id=L1_neighbourhood_group=bronx：
    parent_id=ROOT，parent_Score=0.165, bucket_Score=0.339,
    Gain=+0.174, is_weak=False
【INFO】【2025-12-11 17:54:35】【阈值】桶 L1_neighbourhood_group=brooklyn 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:35] 桶 bucket_id=L1_neighbourhood_group=brooklyn level=1：
    n_train=2266, n_val=1031,
    BAC=0.495, F

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:54:46】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 17:54:46] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=39116
[INFO][BT][2025-12-11 17:54:46] 创建桶 bucket_id=L1_neighbourhood_group=bronx，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="bronx"，n_samples=871
[INFO][BT][2025-12-11 17:54:46] 创建桶 bucket_id=L1_neighbourhood_group=brooklyn，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="brooklyn"，n_samples=16148
[INFO][BT][2025-12-11 17:54:46] 创建桶 bucket_id=L1_neighbourhood_group=manhattan，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="manhattan"，n_samples=17262
[INFO][BT][2025-12-11 17:54:46] 创建桶 bucket_id=L1_neighbourhood_group=queens，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="queens"，n_samples=4535
[INFO][BT][202



【INFO】【2025-12-11 17:54:49】【阈值】桶 ROOT（n_val=2409）使用本地阈值 α=0.5000, β=0.4000
[INFO][BT][2025-12-11 17:54:49] 桶 bucket_id=ROOT level=0：
    n_train=5475, n_val=2409,
    BAC=0.609, F1=0.358, AUC=0.846,
    Regret=0.126, BND_ratio=0.020, POS_coverage=0.037,
    Score(f1_regret_bnd )=0.222
【INFO】【2025-12-11 17:54:49】【阈值】桶 L1_neighbourhood_group=bronx 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:49] 桶 bucket_id=L1_neighbourhood_group=bronx level=1：
    n_train=143, n_val=146,
    BAC=0.500, F1=0.000, AUC=0.753,
    Regret=0.031, BND_ratio=0.000, POS_coverage=0.000,
    Score(f1_regret_bnd )=-0.031
[INFO][BT][2025-12-11 17:54:49] 桶 bucket_id=L1_neighbourhood_group=bronx：
    parent_id=ROOT，parent_Score=0.222, bucket_Score=-0.031,
    Gain=-0.253, is_weak=True
【INFO】【2025-12-11 17:54:49】【阈值】桶 L1_neighbourhood_group=brooklyn 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:54:49] 桶 bucket_id=L1_neighbourhood_group=brooklyn level=1：
    n_train=2282, n_val=1019,
    BAC=0.501, F1=0.086, AUC

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:55:01】【BTTWD】全局模型训练完成，用于兜底预测
[INFO][BT][2025-12-11 17:55:01] 创建桶 bucket_id=ROOT，level=0，parent_id=ROOT，split_name=ROOT，split_type=ROOT，split_rule="all"，n_samples=39116
[INFO][BT][2025-12-11 17:55:01] 创建桶 bucket_id=L1_neighbourhood_group=bronx，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="bronx"，n_samples=890
[INFO][BT][2025-12-11 17:55:01] 创建桶 bucket_id=L1_neighbourhood_group=brooklyn，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="brooklyn"，n_samples=16140
[INFO][BT][2025-12-11 17:55:01] 创建桶 bucket_id=L1_neighbourhood_group=manhattan，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="manhattan"，n_samples=17282
[INFO][BT][2025-12-11 17:55:01] 创建桶 bucket_id=L1_neighbourhood_group=queens，level=1，parent_id=ROOT，split_name=L1_neighbourhood_group，split_type=categorical_group，split_rule="queens"，n_samples=4513
[INFO][BT][202



【INFO】【2025-12-11 17:55:04】【阈值】桶 ROOT（n_val=2400）使用本地阈值 α=0.4000, β=0.3000
[INFO][BT][2025-12-11 17:55:04] 桶 bucket_id=ROOT level=0：
    n_train=5474, n_val=2400,
    BAC=0.656, F1=0.465, AUC=0.888,
    Regret=0.119, BND_ratio=0.045, POS_coverage=0.062,
    Score(f1_regret_bnd )=0.323
【INFO】【2025-12-11 17:55:04】【阈值】桶 L1_neighbourhood_group=bronx 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:55:04] 桶 bucket_id=L1_neighbourhood_group=bronx level=1：
    n_train=154, n_val=137,
    BAC=0.500, F1=0.000, AUC=0.834,
    Regret=0.099, BND_ratio=0.000, POS_coverage=0.000,
    Score(f1_regret_bnd )=-0.099
[INFO][BT][2025-12-11 17:55:04] 桶 bucket_id=L1_neighbourhood_group=bronx：
    parent_id=ROOT，parent_Score=0.323, bucket_Score=-0.099,
    Gain=-0.422, is_weak=True
【INFO】【2025-12-11 17:55:05】【阈值】桶 L1_neighbourhood_group=brooklyn 标记为弱桶，阈值将回退使用 ROOT 的阈值
[INFO][BT][2025-12-11 17:55:05] 桶 bucket_id=L1_neighbourhood_group=brooklyn level=1：
    n_train=2249, n_val=976,
    BAC=0.531, F1=0.122, AUC=

Parameters: { "use_label_encoder" } are not used.



【INFO】【2025-12-11 17:55:15】[BASELINE] 全局 XGB 模型训练完成
【INFO】【2025-12-11 17:55:15】[BASELINE] 阈值搜索开始
【INFO】【2025-12-11 17:55:16】[BASELINE] 最佳阈值找到: alpha=0.5000, beta=0.3000, regret=0.1279
【INFO】【2025-12-11 17:55:16】【桶树】已为样本生成桶ID，共 44 个组合
【INFO】【2025-12-11 17:55:16】[BASELINE] 测试集桶映射完成，共 44 个桶


  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expecte

【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=manhattan|L2_room_type=entire_home|L3_availability_365=rare: BAC=0.5484, Regret=0.2179, Precision=0.6176, Recall=0.1094
【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=brooklyn|L2_room_type=shared_room|L3_availability_365=long_term: BAC=0.5000, Regret=0.0385, Precision=0.0000, Recall=0.0000
【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=manhattan|L2_room_type=private_room|L3_availability_365=long_term: BAC=0.7065, Regret=0.0995, Precision=1.0000, Recall=0.4130
【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=brooklyn|L2_room_type=entire_home|L3_availability_365=mid_term: BAC=0.5050, Regret=0.2156, Precision=0.3333, Recall=0.0152
【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=queens|L2_room_type=private_room|L3_availability_365=rare: BAC=0.5000, Regret=0.0144, Precision=0.0000, Recall=0.0000
【INFO】【2025-12-11 17:55:17】[BASELINE] 桶 L1_neighbourhood_group=queens|L2_r

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


Unnamed: 0,model,Precision_mean,Precision_std,Recall_mean,Recall_std,F1_mean,F1_std,BAC_mean,BAC_std,AUC_mean,...,MCC_mean,MCC_std,Kappa_mean,Kappa_std,BND_ratio_mean,BND_ratio_std,POS_Coverage_mean,POS_Coverage_std,Regret_mean,Regret_std
0,BTTWD,0.557338,0.015747,0.347976,0.017226,0.428361,0.016775,0.657748,0.008643,0.8726,...,0.390701,0.01694,0.378096,0.017494,0.042888,0.011682,0.04254,0.003196,0.130484,0.002831
1,RandomForest,0.564523,0.022227,0.364886,0.01642,0.443072,0.016438,0.665872,0.008275,0.863615,...,0.404264,0.017795,0.392934,0.017539,0.0,0.0,,,0.12988,0.003346
2,KNN,0.466329,0.021834,0.340392,0.020807,0.393426,0.020866,0.647305,0.010757,0.785738,...,0.339435,0.022216,0.33437,0.022319,0.0,0.0,,,0.145056,0.004241
3,XGBoost,0.573155,0.007293,0.350892,0.006426,0.435225,0.004616,0.660075,0.00279,0.875833,...,0.400131,0.004171,0.386184,0.004515,0.0,0.0,,,0.129942,0.000741


【INFO】【2025-12-11 17:55:18】【步骤6摘要】BTTWD 与基线的 k 折结果已生成并保存。


In [9]:
# 步骤7：桶级别分析
bucket_metrics_path = os.path.join(root_path, cfg['OUTPUT']['results_dir'], 'bucket_metrics.csv')
if os.path.exists(bucket_metrics_path):
    bucket_metrics_df = pd.read_csv(bucket_metrics_path)
    display(bucket_metrics_df.head())
    bucket_metrics_df.plot(x='bucket_id', y='pos_rate_all', kind='bar', figsize=(12,4), title='桶正类比例')
    plt.ylabel('正类比例')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.savefig(fig_bucket, bbox_inches='tight')
    plt.close()
log_info('【步骤7摘要】桶级指标已整理，可用于局部化分析。')

Unnamed: 0,bucket_id,layer,parent_bucket_id,n_train,n_val,pos_rate_train,pos_rate_val,alpha,beta,regret_val,...,is_weak,threshold_source_bucket,parent_with_threshold,n_test,pos_rate_test,BND_ratio_test,POS_Coverage_test,regret_test,fold,pos_rate
0,ROOT,L1,,5474,2419,0.105955,0.094667,0.5,0.4,0.122985,...,False,ROOT,,,,,,,1,0.105225
1,L1_neighbourhood_group=manhattan,L1,ROOT,2435,1036,0.166735,0.169884,0.5,0.4,0.196429,...,True,ROOT,,,,,,,1,0.171109
2,L1_neighbourhood_group=brooklyn,L1,ROOT,2232,1039,0.053763,0.056785,0.5,0.4,0.080366,...,True,ROOT,,,,,,,1,0.059095
3,L1_neighbourhood_group=manhattan|L2_room_type=...,L2,L1_neighbourhood_group=manhattan,1481,636,0.242404,0.259434,0.5,0.3,0.275943,...,False,L1_neighbourhood_group=manhattan|L2_room_type=...,,,,,,,1,0.253068
4,L1_neighbourhood_group=brooklyn|L2_room_type=p...,L2,L1_neighbourhood_group=brooklyn,1124,488,0.008897,0.016393,0.5,0.4,0.02459,...,True,ROOT,,,,,,,1,0.008796


  plt.tight_layout()


【INFO】【2025-12-11 17:55:21】【步骤7摘要】桶级指标已整理，可用于局部化分析。


In [10]:
# 步骤8：结果汇总
log_info('【步骤8】检查结果文件与图表。')
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['results_dir'])))
print(os.listdir(os.path.join(root_path, cfg['OUTPUT']['figs_dir'])))
log_info('【全部步骤完成】Airbnb NYC 数据集上的 BT-TWD 实验结束。')

【INFO】【2025-12-11 17:55:21】【步骤8】检查结果文件与图表。
['baseline_bucket_metrics.csv', 'bucket_fallback_stats.csv', 'bucket_metrics.csv', 'bucket_metrics_gain.csv', 'bucket_thresholds.csv', 'bucket_thresholds_per_fold.csv', 'bucket_tree_structure.csv', 'metrics_kfold_per_fold.csv', 'metrics_kfold_summary.csv', 'metrics_overview.csv']
['bank_class_distribution.png', 'bucket_metrics_bar.png', 'class_distribution.png', 'metrics_compare.png']
【INFO】【2025-12-11 17:55:21】【全部步骤完成】Airbnb NYC 数据集上的 BT-TWD 实验结束。
