# Llama OCEAN 特征生成完整流程
## 使用 Llama-3.1-8B-Instruct 模型（免费）

---

### 流程概览

1. **Llama 打标签** (500样本) → OCEAN ground truth
2. **学习权重** (Ridge Regression) → categorical → OCEAN 映射
3. **生成特征** (全量数据) → 10000行 × 5列 OCEAN
4. **XGBoost 对比** → Baseline vs Baseline+OCEAN

**成本**: $0 (Llama 免费)  
**时间**: ~20分钟

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# 项目模块
from utils.io import load_lending_club_data, prepare_binary_target
from utils.seed import set_seed
from utils.metrics import compute_all_metrics, delong_test
from text_features.ocean_llama_labeler import OceanLlamaLabeler, OCEAN_DIMS
from utils.ocean_weight_learner import OceanWeightLearner
from utils.ocean_feature_generator import OceanFeatureGenerator
from utils.ocean_evaluator import OceanEvaluator

set_seed(42)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 0. 加载数据

In [None]:
# 下载数据
path = kagglehub.dataset_download("ethon0426/lending-club-20072020q1")
file_path = path + "/Loan_status_2007-2020Q3.gzip"

# 加载数据（先用 10000 行测试）
ROW_LIMIT = 10000

df = load_lending_club_data(file_path, row_limit=ROW_LIMIT)
df = prepare_binary_target(df, target_col="loan_status")

print(f"\n数据形状: {df.shape}")
print(f"违约率: {df['target'].mean():.2%}")

## 1. Llama 打标签（生成 Ground Truth）

**说明**: 使用 Llama 模型给 500 个样本打 OCEAN 人格分数  
**成本**: $0 (免费)  
**时间**: ~10-15分钟  

**⚠️ 重要**: 需要先配置 `.env` 文件中的 `HF_TOKEN`

In [None]:
# 初始化标注器
labeler = OceanLlamaLabeler()

# 批量打标签（500样本，分层抽样）
df_truth = labeler.label_batch(
    df, 
    sample_size=500, 
    stratified=True,
    rate_limit_delay=0.5  # API 限流
)

# 保存 ground truth
df_truth.to_csv('../artifacts/ground_truth_llama.csv', index=False)
print("\n✅ Ground Truth 已保存到: artifacts/ground_truth_llama.csv")

### 1.1 评估 Ground Truth 质量

In [None]:
evaluator = OceanEvaluator()
truth_quality = evaluator.evaluate_ground_truth_quality(df_truth)

## 2. 学习权重（Ridge Regression）

**说明**: 学习 categorical variables → OCEAN 的映射规律  
**方法**: Ridge Regression (L2 正则化)  
**输出**: 每个 OCEAN 维度的权重系数

In [None]:
# 定义 categorical variables
CATEGORICAL_VARS = [
    'grade', 'purpose', 'term', 'home_ownership',
    'emp_length', 'verification_status', 'application_type'
]

# 过滤存在的列
CATEGORICAL_VARS = [c for c in CATEGORICAL_VARS if c in df_truth.columns]

print(f"使用的 categorical variables: {CATEGORICAL_VARS}")

In [None]:
# 初始化学习器
learner = OceanWeightLearner(method='ridge', alpha=0.1)

# 学习权重
weights, encoder = learner.fit(
    X_categorical=df_truth[CATEGORICAL_VARS],
    y_ocean_truth=df_truth[[f'{d}_truth' for d in OCEAN_DIMS]],
    cv=5
)

# 保存权重
joblib.dump(
    {'weights': weights, 'encoder': encoder},
    '../artifacts/ocean_weights_llama.pkl'
)
print("\n✅ 权重已保存到: artifacts/ocean_weights_llama.pkl")

### 2.1 查看学习结果摘要

In [None]:
learner.get_summary()

### 2.2 查看各维度 Top 特征

In [None]:
for dim in OCEAN_DIMS:
    print(f"\n{'=' * 60}")
    print(f"{dim.upper()} - Top 10 特征")
    print(f"{'=' * 60}")
    display(learner.get_top_features(dim, top_n=10))

## 3. 生成全量 OCEAN 特征

**说明**: 使用学到的权重给全量数据（10000行）生成 OCEAN 特征  
**成本**: $0 (本地计算)  
**时间**: 即时

In [None]:
# 初始化生成器
generator = OceanFeatureGenerator(weights, encoder)

# 生成 OCEAN 特征
df_full = generator.generate_features(df)

print("\n✅ OCEAN 特征已添加到数据集")
print(f"新增列: {OCEAN_DIMS}")

### 3.1 评估生成特征的预测能力

In [None]:
predictive_power = evaluator.evaluate_predictive_power(df_full, target_col='target')

In [None]:
# 汇总报告
summary = evaluator.generate_summary_report()
display(summary)

## 4. XGBoost A/B 对比测试

**对比方案**:
- **方案 A**: Baseline (结构化变量)
- **方案 B**: Baseline + OCEAN (5个性格特征)

**评估指标**: ROC-AUC, PR-AUC, KS, Brier Score

In [None]:
# 定义 baseline 特征
numeric_features = [
    "loan_amnt", "int_rate", "installment", "annual_inc", "dti",
    "inq_last_6mths", "open_acc", "pub_rec", "revol_bal", "revol_util",
    "total_acc"
]
numeric_features = [c for c in numeric_features if c in df_full.columns]

categorical_features_model = [c for c in CATEGORICAL_VARS if c in df_full.columns]

baseline_features = numeric_features + categorical_features_model
ocean_features = OCEAN_DIMS

print(f"Baseline 特征数: {len(baseline_features)}")
print(f"OCEAN 特征数: {len(ocean_features)}")
print(f"总特征数 (Baseline+OCEAN): {len(baseline_features) + len(ocean_features)}")

In [None]:
# 预处理：OneHot 编码 categorical features
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# 数值特征处理
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

# categorical 特征处理
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=True))
])

# 方案 A: Baseline
preprocessor_A = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features_model),
    ],
    remainder="drop"
)

# 方案 B: Baseline + OCEAN
preprocessor_B = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("ocean", "passthrough", ocean_features),  # OCEAN 直接通过
        ("cat", categorical_transformer, categorical_features_model),
    ],
    remainder="drop"
)

In [None]:
# Train-Test Split
y = df_full['target'].values

X_train, X_test, y_train, y_test = train_test_split(
    df_full, y, test_size=0.2, random_state=42, stratify=y
)

print(f"训练集: {len(X_train)} 样本")
print(f"测试集: {len(X_test)} 样本")
print(f"训练集违约率: {y_train.mean():.2%}")
print(f"测试集违约率: {y_test.mean():.2%}")

### 4.1 训练方案 A: Baseline

In [None]:
print("\n训练方案 A: Baseline (无 OCEAN)...\n")

# 预处理
X_train_A = preprocessor_A.fit_transform(X_train)
X_test_A = preprocessor_A.transform(X_test)

# 训练 XGBoost
pos = int((y_train == 1).sum())
neg = int((y_train == 0).sum())
scale_pos_weight = neg / max(1, pos)

model_A = XGBClassifier(
    objective="binary:logistic",
    tree_method="hist",
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric="auc"
)

model_A.fit(X_train_A, y_train, verbose=False)

# 预测
y_proba_A = model_A.predict_proba(X_test_A)[:, 1]
metrics_A = compute_all_metrics(y_test, y_proba_A)

print("\n方案 A 结果:")
for k, v in metrics_A.items():
    print(f"  {k}: {v:.4f}")

### 4.2 训练方案 B: Baseline + OCEAN

In [None]:
print("\n训练方案 B: Baseline + OCEAN...\n")

# 预处理
X_train_B = preprocessor_B.fit_transform(X_train)
X_test_B = preprocessor_B.transform(X_test)

# 训练 XGBoost
model_B = XGBClassifier(
    objective="binary:logistic",
    tree_method="hist",
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric="auc"
)

model_B.fit(X_train_B, y_train, verbose=False)

# 预测
y_proba_B = model_B.predict_proba(X_test_B)[:, 1]
metrics_B = compute_all_metrics(y_test, y_proba_B)

print("\n方案 B 结果:")
for k, v in metrics_B.items():
    print(f"  {k}: {v:.4f}")

### 4.3 对比结果与统计显著性

In [None]:
comparison = evaluator.compare_models(y_test, y_proba_A, y_proba_B)

### 4.4 可视化对比

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve, auc

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curve
fpr_A, tpr_A, _ = roc_curve(y_test, y_proba_A)
fpr_B, tpr_B, _ = roc_curve(y_test, y_proba_B)

axes[0].plot(fpr_A, tpr_A, label=f"Baseline (AUC={auc(fpr_A, tpr_A):.3f})", linewidth=2)
axes[0].plot(fpr_B, tpr_B, label=f"+ OCEAN (AUC={auc(fpr_B, tpr_B):.3f})", linewidth=2)
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1)
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve Comparison')
axes[0].legend()
axes[0].grid(True)

# Precision-Recall Curve
prec_A, rec_A, _ = precision_recall_curve(y_test, y_proba_A)
prec_B, rec_B, _ = precision_recall_curve(y_test, y_proba_B)

axes[1].plot(rec_A, prec_A, label=f"Baseline (PR-AUC={metrics_A['pr_auc']:.3f})", linewidth=2)
axes[1].plot(rec_B, prec_B, label=f"+ OCEAN (PR-AUC={metrics_B['pr_auc']:.3f})", linewidth=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve Comparison')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.savefig('../artifacts/results/llama_ocean_comparison.png', dpi=150)
plt.show()

## 5. 保存结果

In [None]:
# 保存最佳模型
joblib.dump(model_B, '../artifacts/xgb_ocean_llama.pkl')
print("✅ 模型已保存: artifacts/xgb_ocean_llama.pkl")

# 保存对比结果
results_df = pd.DataFrame([
    {'model': 'Baseline', **metrics_A},
    {'model': 'Baseline+OCEAN', **metrics_B}
])
results_df.to_csv('../artifacts/results/llama_ocean_results.csv', index=False)
print("✅ 结果已保存: artifacts/results/llama_ocean_results.csv")

display(results_df)

## 6. 总结

### ✅ 完成的工作

1. 使用 Llama 模型为 500 个样本生成 OCEAN ground truth
2. 使用 Ridge Regression 学习 categorical → OCEAN 映射权重
3. 为全量 10000 样本生成 OCEAN 特征
4. XGBoost A/B 对比测试

### 📊 结果摘要

（运行后填写）

- **ROC-AUC 提升**: +_____ 
- **PR-AUC 提升**: +_____
- **KS 提升**: +_____
- **统计显著性**: p = _____

### 💰 成本

- **总成本**: $0 (Llama 免费)
- **总时间**: ~20分钟

### 🎯 下一步

1. 如果效果显著：扩展到完整数据集（100k+样本）
2. 尝试其他文本源（如果有 `desc` 字段）
3. 特征解释分析（SHAP）
4. 生产部署