## **构造XGBoost模型进行预测**

### **1. 前置操作**

In [25]:
# 常规包
import pandas as pd
import numpy as np

# xgboost包
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score

- 这里不知道怎么调试，我一直导入数据结果和之前结果不一样，我直接把上一步的特征工程最后一步降维搬过来了。这样结果就一样了。

In [35]:
# 加载数据
x_train = pd.read_csv('x_train.csv')
x_test = pd.read_csv('x_test.csv')

columns_to_drop = ['num_var12_0_5', 'num_var12_0_7', 'num_var12_4', 'num_var24_0_4', 'num_var30_10',
                   'num_var39_0_7', 'num_var41_0_6', 'num_var41_0_7', 'num_var42_0_7', 'num_var42_0_9',
                   'num_var42_7', 'num_var5_0_6', 'num_var5_6', 'num_var12_0_6', 'num_var13_0_8',
                   'num_var13_8', 'num_var30_9', 'num_var39_0_11', 'num_var39_0_9', 'num_var41_0_11',
                   'num_var41_0_9', 'num_var42_0_8', 'num_var4_10', 'num_var4_9', 'num_var5_0_5',
                   'num_var5_5']

# 从 x_train 中删除指定列
for col in columns_to_drop:
    if col in x_train.columns:
        x_train.drop(columns=col, inplace=True)

# 从 x_test 中删除指定列
for col in columns_to_drop:
    if col in x_test.columns:
        x_test.drop(columns=col, inplace=True)

        
# 分离特征和标签
X_train = x_train.drop(['TARGET', 'ID'], axis=1)
y_train = x_train['TARGET']
X_test = x_test.drop('ID', axis=1)

# 计算PCA需要保留的组件数以解释至少98%的方差
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.98) + 1  # 加1因为索引从0开始

print(f"Number of components to explain 98% variance: {d}")

# 使用计算出的组件数设置PCA
pca = PCA(n_components=d)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# 应用LDA
lda = LDA()
X_train_lda = lda.fit_transform(X_train_pca, y_train)
X_test_lda = lda.transform(X_test_pca)


Number of components to explain 98% variance: 305


### **2. 使用交叉验证平均得分评判**

- 降维数据集

In [36]:
# 创建XGBoost分类器
model_lda = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

# 进行交叉验证，这里使用5折交叉验证
scores = cross_val_score(model_lda, X_train_lda, y_train, cv=5, scoring='accuracy')

# 输出交叉验证的平均得分和标准差
print(f'Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})')

Accuracy: 0.96 (+/- 0.00)


- 原始数据集

In [38]:
# 假设X_train和y_train是从你的x_train数据中提取的特征和目标变量
X_train_full = train_1.drop(['TARGET'], axis=1)
y_train = train_1['TARGET']

# 创建XGBoost分类器
model_full = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

# 进行交叉验证，这里使用5折交叉验证
scores_full = cross_val_score(model_full, X_train_full, y_train, cv=5, scoring='accuracy')

# 输出交叉验证的平均得分和标准差
print(f'Accuracy with full features: {scores_full.mean():.2f} (+/- {scores_full.std() * 2:.2f})')


Accuracy with full features: 0.96 (+/- 0.00)


- 两个平均得分都是0.96，这说明了我们模型拟合效果十分优秀。但也不仅思考，是不是出现过拟合了呢？为了防止出现过拟合的风险，我们不仅仅采取这一个评分标准，再采用AUC的评分标准。

### **3.使用AUC评分**

- 降维后数据

In [20]:
# 假设X_train_lda和y_train是降维后的数据和标签
# 假设X_test_lda是降维后的测试数据

# 划分数据集为训练集和验证集
X_train_lda_split, X_val_lda, y_train_lda_split, y_val = train_test_split(
    X_train_lda, y_train, stratify=y_train, test_size=0.15, random_state=42)

# 设置XGBoost模型参数
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 参数网格
param_grid = {
    'max_depth': [3, 5, 7],  # 树的最大深度
    'learning_rate': [0.01, 0.1, 0.2],  # 学习率
    'n_estimators': [50, 100, 150]  # 树的数量
}

# 创建网格搜索对象，使用3折交叉验证
grid = GridSearchCV(model, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)

# 运行网格搜索
grid.fit(X_train_lda_split, y_train_lda_split)

# 输出最佳参数和最佳得分
print("Best parameters:", grid.best_params_)
print("Best AUC: {:.3f}".format(grid.best_score_))

# 使用最佳参数的模型在验证集上预测
best_model = grid.best_estimator_
val_predictions = best_model.predict_proba(X_val_lda)[:, 1]

# 计算AUC得分
auc_score = roc_auc_score(y_val, val_predictions)
print(f'Validation AUC Score: {auc_score:.3f}')


Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Best AUC: 0.765
Validation AUC Score: 0.787


- 原始数据

In [18]:
# 设置模型参数
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 参数网格
param_grid = {
    'max_depth': [3, 5, 7],  # 树的最大深度
    'learning_rate': [0.01, 0.1, 0.2],  # 学习率
    'n_estimators': [50, 100, 150]  # 树的数量
}

# 创建网格搜索对象，使用3折交叉验证
grid = GridSearchCV(model, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)

# 划分数据集为训练集和验证集
X_train_full, X_val_full, y_train_full, y_val_full = train_test_split(X_full, y_full, stratify=y_full, test_size=0.15, random_state=42)

# 运行网格搜索
grid.fit(X_train_full, y_train_full)

# 输出最佳参数和最佳得分
print("Best parameters:", grid.best_params_)
print("Best AUC: {:.3f}".format(grid.best_score_))

# 使用最佳参数的模型在验证集上预测
best_model = grid.best_estimator_
val_predictions = best_model.predict_proba(X_val_full)[:, 1]

# 计算AUC得分
auc_score = roc_auc_score(y_val_full, val_predictions)
print(f'Validation AUC Score: {auc_score:.3f}')


Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Best AUC: 0.824
Validation AUC Score: 0.842


- 由上面可以发现，降维后数据虽然拟合速度较快，占用内存少，但是准确度得分不高。 
- 而直接使用全部特征的得分更高，但是拟合速度十分缓慢。


需要注意的是，由于我的笔记本的显卡是3050，并且kaggle的数据集上传出现了某些问题，无法使用kaggle的GPU。所以为了简化操作，这里的网格搜索和交叉验证进行超参数寻优的时候我只选择了很小的范围和3折。如果以后想改进模型，需要更好的电脑才能完成。

## **至此所有的操作完成**