# 财务舞弊识别 - CatBoost模型训练与评估

本Notebook将使用预处理后的`reduced_data.csv`数据集，训练CatBoost模型进行财务舞弊识别。我们将执行以下步骤：
1. 数据加载与探索
2. 数据分割与预处理
3. 基础CatBoost模型训练
4. 超参数优化
5. 模型评估与可视化
6. 特征重要性分析
7. 结论与建议

In [14]:
# 导入必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
import catboost as cb
from imblearn.over_sampling import SMOTE
from collections import Counter
import time
import warnings
warnings.filterwarnings('ignore')

## 1. 数据加载与探索

In [15]:
# 加载数据集
data = pd.read_csv('reduced_data.csv')

## 2. 数据分割与预处理

In [16]:
# 准备特征和目标变量
X = data.drop(['Stkcd', 'Accper', 'Typrep ', 'isviolation'], axis=1)
y = data['isviolation']

# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [17]:
# 使用SMOTE处理类别不平衡
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print('SMOTE处理后的训练集形状:', X_train_resampled.shape)
print('SMOTE处理后的训练集类别分布:')
print(Counter(y_train_resampled))

SMOTE处理后的训练集形状: (162916, 27)
SMOTE处理后的训练集类别分布:
Counter({1: 81458, 0: 81458})


## 3. 基础CatBoost模型训练

In [18]:
# 定义基础CatBoost模型
base_model = cb.CatBoostClassifier(
    random_state=42,
    iterations=100,
    depth=6,
    learning_rate=0.1,
    loss_function='Logloss',
    eval_metric='AUC',
    verbose=20
)

# 训练基础模型
start_time = time.time()
base_model.fit(X_train_resampled, y_train_resampled)
base_training_time = time.time() - start_time

# 在测试集上进行预测
y_pred_base = base_model.predict(X_test)
y_pred_prob_base = base_model.predict_proba(X_test)[:, 1]

# 评估基础模型性能
base_accuracy = accuracy_score(y_test, y_pred_base)
base_precision = precision_score(y_test, y_pred_base)
base_recall = recall_score(y_test, y_pred_base)
base_f1 = f1_score(y_test, y_pred_base)
base_auc = roc_auc_score(y_test, y_pred_prob_base)

print('基础模型性能评估:')
print(f'训练时间: {base_training_time:.2f}秒')
print(f'准确率: {base_accuracy:.4f}')
print(f'精确率: {base_precision:.4f}')
print(f'召回率: {base_recall:.4f}')
print(f'F1分数: {base_f1:.4f}')
print(f'AUC分数: {base_auc:.4f}')

0:	total: 22.4ms	remaining: 2.22s
20:	total: 397ms	remaining: 1.49s
40:	total: 857ms	remaining: 1.23s
60:	total: 1.34s	remaining: 856ms
80:	total: 1.9s	remaining: 446ms
99:	total: 2.37s	remaining: 0us
基础模型性能评估:
训练时间: 2.52秒
准确率: 0.6495
精确率: 0.2304
召回率: 0.6075
F1分数: 0.3341
AUC分数: 0.6861


## 4. 超参数优化

In [20]:
# 定义CatBoost的参数网格
param_grid = {
    'iterations': [100, 200, 300, 400, 500],
    'depth': [4, 5, 6, 7, 8],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'random_strength': [1, 2, 3],
    'bagging_temperature': [0.0, 0.5, 1.0],
    'border_count': [32, 64, 128],
    'l2_leaf_reg': [1, 3, 5, 7],
    'subsample': [0.7, 0.8, 0.9, 1.0]
}

# 定义有限的参数网格（用于演示）
param_grid_limited = {
    'iterations': [100, 200],
    'depth': [5, 7],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 0.9],
    'border_count': [32, 64]
}

In [22]:
# 使用StratifiedKFold进行交叉验证
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 定义CatBoost分类器
catboost_model = cb.CatBoostClassifier(
    random_state=42,
    loss_function='Logloss',
    eval_metric='AUC',
    verbose=False
)

# 使用RandomizedSearchCV进行超参数优化
random_search = RandomizedSearchCV(
    estimator=catboost_model,
    param_distributions=param_grid,
    n_iter=50,  # 采样50个参数组合
    cv=skf,
    scoring='roc_auc',
    random_state=42,
    verbose=2,
    n_jobs=-1
)

# 执行超参数优化
start_time = time.time()
random_search.fit(X_train_resampled, y_train_resampled)
optimization_time = time.time() - start_time

# 输出最佳参数和分数
print('超参数优化完成!')
print(f'优化时间: {optimization_time:.2f}秒')
print('最佳参数:')
for param, value in random_search.best_params_.items():
    print(f'  {param}: {value}')

print(f'最佳交叉验证分数 (ROC AUC): {random_search.best_score_:.4f}')

Fitting 5 folds for each of 50 candidates, totalling 250 fits
超参数优化完成!
优化时间: 1103.57秒
最佳参数:
  subsample: 0.8
  random_strength: 2
  learning_rate: 0.2
  l2_leaf_reg: 3
  iterations: 500
  depth: 8
  border_count: 64
  bagging_temperature: 0.5
最佳交叉验证分数 (ROC AUC): 0.8944


In [23]:
# 使用最佳参数训练最终模型
best_params = random_search.best_params_
best_model = cb.CatBoostClassifier(
    **best_params,
    random_state=42,
    loss_function='Logloss',
    eval_metric='AUC',
    verbose=20
)

# 训练最终模型
best_model.fit(X_train_resampled, y_train_resampled)

# 在测试集上进行预测
y_pred = best_model.predict(X_test)
y_pred_prob = best_model.predict_proba(X_test)[:, 1]

0:	total: 21.2ms	remaining: 10.6s
20:	total: 577ms	remaining: 13.2s
40:	total: 1.19s	remaining: 13.4s
60:	total: 1.8s	remaining: 12.9s
80:	total: 2.4s	remaining: 12.4s
100:	total: 3.03s	remaining: 12s
120:	total: 3.66s	remaining: 11.5s
140:	total: 4.33s	remaining: 11s
160:	total: 5s	remaining: 10.5s
180:	total: 5.65s	remaining: 9.96s
200:	total: 6.26s	remaining: 9.32s
220:	total: 6.9s	remaining: 8.71s
240:	total: 7.49s	remaining: 8.05s
260:	total: 8.08s	remaining: 7.4s
280:	total: 8.66s	remaining: 6.75s
300:	total: 9.25s	remaining: 6.12s
320:	total: 9.83s	remaining: 5.48s
340:	total: 10.4s	remaining: 4.85s
360:	total: 11s	remaining: 4.24s
380:	total: 11.6s	remaining: 3.63s
400:	total: 12.2s	remaining: 3.02s
420:	total: 12.9s	remaining: 2.42s
440:	total: 13.4s	remaining: 1.8s
460:	total: 14s	remaining: 1.18s
480:	total: 14.6s	remaining: 575ms
499:	total: 15.1s	remaining: 0us


## 5. 模型评估与可视化

In [27]:
# 评估最终模型性能
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_prob)

print('最终模型性能评估:')
print(f'准确率: {accuracy:.4f}')
print(f'精确率: {precision:.4f}')
print(f'召回率: {recall:.4f}')
print(f'F1分数: {f1:.4f}')
print(f'AUC分数: {auc:.4f}')

# 打印详细分类报告
print('分类报告:')
print(classification_report(y_test, y_pred))

最终模型性能评估:
准确率: 0.7482
精确率: 0.3002
召回率: 0.5553
F1分数: 0.3897
AUC分数: 0.7500
分类报告:
              precision    recall  f1-score   support

           0       0.91      0.78      0.84     20365
           1       0.30      0.56      0.39      3447

    accuracy                           0.75     23812
   macro avg       0.61      0.67      0.62     23812
weighted avg       0.82      0.75      0.78     23812



## 6. 特征重要性分析

In [25]:
# 获取特征重要性
feature_importance = best_model.get_feature_importance()
feature_names = X.columns

# 创建特征重要性DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

# 按重要性降序排序
importance_df = importance_df.sort_values('Importance', ascending=False)

print('特征重要性（前15个）:')
print(importance_df.head(15))

特征重要性（前15个）:
   Feature  Importance
0     year   13.127930
26  pca_24    7.071976
3    pca_1    5.851326
5    pca_3    4.663223
19  pca_17    4.264592
25  pca_23    4.246397
11   pca_9    3.702893
13  pca_11    3.647322
22  pca_20    3.606870
4    pca_2    3.453686
21  pca_19    3.305625
15  pca_13    3.269745
17  pca_15    3.260701
10   pca_8    3.248057
20  pca_18    3.241329


## 7. 模型保存

In [29]:
# 保存最佳模型
import joblib

# 保存模型到文件
model_filename = 'catboost_fraud_detection_model.pkl'
best_model.save_model(model_filename)



print(f'模型已保存到: {model_filename}')

模型已保存到: catboost_fraud_detection_model.pkl


## 8. 结论与建议

In [30]:
# 性能对比总结
results_df = pd.DataFrame({
    '评估指标': ['准确率', '精确率', '召回率', 'F1分数', 'AUC分数'],
    '基础模型': [base_accuracy, base_precision, base_recall, base_f1, base_auc],
    '优化模型': [accuracy, precision, recall, f1, auc]
})

print('模型性能对比:')
print(results_df.round(4))

模型性能对比:
    评估指标    基础模型    优化模型
0    准确率  0.6495  0.7482
1    精确率  0.2304  0.3002
2    召回率  0.6075  0.5553
3   F1分数  0.3341  0.3897
4  AUC分数  0.6861  0.7500
