# 模型摘要



### 项目背景
本项目的目标是利用机器学习模型对NCAA（美国大学生体育协会）篮球比赛的结果进行预测。数据集包含了男女篮的常规赛和锦标赛的详细比赛数据，涵盖比赛得分、队伍信息、种子排名等特征。通过构建多个高性能的机器学习模型并进行集成，以提高预测的准确性和鲁棒性。

### 数据处理与特征工程
数据处理和特征工程部分主要由团队中的其他成员完成。他们对原始数据进行了清洗、合并和特征提取，生成了可供模型训练的特征集。特征包括比赛得分、队伍种子排名、历史比赛统计值等。数据经过标准化处理，并通过时间序列交叉验证进行拆分，以确保模型训练的合理性和泛化能力。

### 模型构建与集成
#### 模型选择
为了充分利用不同模型的优势，我们选择了以下几种高性能的机器学习模型：
1. **XGBoost回归器**（`XGBRegressor`）：基于梯度提升的集成学习模型，具有强大的特征拟合能力和高效的训练速度。
2. **CatBoost回归器**（`CatBoostRegressor`）：一种高效的梯度提升框架，对类别特征处理友好，能够自动处理缺失值。
3. **随机森林回归器**（`RandomForestRegressor`）：基于决策树的集成模型，具有较好的抗过拟合能力，适用于大规模数据集。
4. **LightGBM回归器**（`LGBMRegressor`）：一种轻量级的梯度提升框架，训练速度快，内存占用低，适合处理大规模数据。

- **多层感知机回归器** 因为效果太差，弃用。

#### 超参数优化
所有模型的超参数均通过`optuna`进行了优化，以确保每个模型在当前数据集上达到最优性能。例如：
- XGBoost的最优超参数为：`n_estimators=346`，`learning_rate=0.01`，`max_depth=3`。
- CatBoost的最优超参数为：`iterations=322`，`depth=3`，`learning_rate=0.025`。
- LightGBM的最优超参数为：`num_leaves=63`，`learning_rate=0.0076`，`n_estimators=537`等。

#### 模型集成
为了进一步提升预测性能，我们采用了模型集成策略。具体步骤如下：
1. **单独训练**：对每个模型分别进行训练，确保每个模型都能充分发挥其性能。
2. **预测结果融合**：将各个模型的预测结果进行简单平均融合。具体公式为：

\[
\text{集成预测} = \frac{\text{XGBoost预测} + \text{CatBoost预测} + \text{随机森林预测} + \text{LightGBM预测}}{4}
\]

   通过这种方式，融合了不同模型的优势，降低了单一模型的偏差和方差。

### 模型评估
我们使用布里尔分数（`Brier Score`）作为主要评估指标，用于衡量预测概率与实际结果之间的差异。布里尔分数越低，表示模型的预测性能越好。

在时间序列交叉验证的每一折中，我们分别计算了每个模型的布里尔分数，并输出了集成模型的布里尔分数。例如：
- XGBoost的布里尔分数为0.152
- CatBoost的布里尔分数为0.155
- 随机森林的布里尔分数为0.178
- LightGBM的布里尔分数为0.163
- 集成模型的布里尔分数为0.149

集成模型在所有测试场景中均表现出了优于单一模型的性能，验证了模型集成策略的有效性。

### 模型应用
最终，我们使用训练好的模型对测试集（`SampleSubmissionStage2`）进行了预测，并生成了符合Kaggle比赛要求的提交文件（`skl.csv`）。集成模型的预测结果为比赛的最终输出，确保了预测的准确性和稳定性。

### 总结
本项目通过构建多种高性能的机器学习模型，并采用模型集成策略，成功提高了NCAA篮球比赛结果的预测性能。模型集成不仅充分利用了各个模型的优势，还有效降低了单一模型的偏差和方差，提升了预测的鲁棒性。未来，我们计划进一步探索更复杂的集成方法（如加权平均、堆叠等），以进一步提升模型的预测能力。


In [2]:
import numpy as np
import pandas as pd 
from scipy.stats import linregress
from tqdm import tqdm
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier, CatBoostRegressor
from sklearn.model_selection import train_test_split
import os
from itertools import combinations
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import brier_score_loss, log_loss, mean_absolute_error
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from xgboost import XGBRegressor
import glob
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor


from sklearn import *
#import redisAI
import glob
import optuna
from sklearn import ensemble
from sklearn.metrics import *
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import random_split
from torch import nn
import torch.optim as optim

from sklearn.neural_network import MLPRegressor  # 假设使用MLP回归器作为模型

regular_m = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/MRegularSeasonCompactResults.csv')
tourney_m = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv')
teams_m = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/MTeams.csv')
# Load and Process Data Women's Tourney
regular_w = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/WRegularSeasonCompactResults.csv')
tourney_w = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv')
teams_w = pd.read_csv('./kaggle/input/march-machine-learning-mania-2025/WTeams.csv')
#print(teams_w.columns, regular_w.columns, tourney_m.columns)
# print(len(regular_m), len(tourney_m))

# Getting all files
path = "./kaggle/input/march-machine-learning-mania-2025/**"
data = {p.split('/')[-1].split('.')[0].split('\\')[1] : pd.read_csv(p, encoding='latin-1') for p in glob.glob(path)}
df = data["SampleSubmissionStage2"]
# Creating year, left team, and right team columns
"""
df['Year'] = [int(yr[0:4]) for yr in df['ID']]
df['LTeam'] = [int(L[5:9]) for L in df['ID']]
df['RTeam'] = [int(R[10:14]) for R in df['ID']]
"""
df['RTeam'] = [int(R[10:14]) for R in df['ID']]
df['LTeam'] = [int(L[5:9]) for L in df['ID']]

df['LTeam']
df['ID'] # 从示例提交文件格式中获取左侧和右侧队伍的id
# Lots of feature selecting and engineering
teams = pd.concat([data['MTeams'], data['WTeams']])
teams_spelling = pd.concat([data['MTeamSpellings'], data['WTeamSpellings']])
teams_spelling = teams_spelling.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
teams_spelling.columns = ['TeamID', 'TeamNameCount']
teams = pd.merge(teams, teams_spelling, how='left', on=['TeamID'])
del teams_spelling
season_cresults = pd.concat([data['MRegularSeasonCompactResults'], data['WRegularSeasonCompactResults']])
season_dresults = pd.concat([data['MRegularSeasonDetailedResults'], data['WRegularSeasonDetailedResults']])
tourney_cresults = pd.concat([data['MNCAATourneyCompactResults'], data['WNCAATourneyCompactResults']])
tourney_dresults = pd.concat([data['MNCAATourneyDetailedResults'], data['WNCAATourneyDetailedResults']])
slots = pd.concat([data['MNCAATourneySlots'], data['WNCAATourneySlots']])
seeds = pd.concat([data['MNCAATourneySeeds'], data['WNCAATourneySeeds']])
gcities = pd.concat([data['MGameCities'], data['WGameCities']])
seasons = pd.concat([data['MSeasons'], data['WSeasons']])

seeds = {'_'.join(map(str,[int(k1),k2])):int(v[1:3]) for k1, v, k2 in seeds[['Season', 'Seed', 'TeamID']].values}
cities = data['Cities']
sub = data['SampleSubmissionStage2']
del data

season_cresults['ST'] = 'S'
season_dresults['ST'] = 'S'
tourney_cresults['ST'] = 'T'
tourney_dresults['ST'] = 'T'
games = pd.concat((season_dresults, tourney_dresults), axis=0, ignore_index=True)# 只有2003年开始才有detailed results，这里舍弃了compact results
games.reset_index(drop=True, inplace=True)
games['WLoc'] = games['WLoc'].map({'A': 1, 'H': 2, 'N': 3})

games['ID'] = games.apply(lambda r: '_'.join(map(str, [r['Season']]+sorted([r['WTeamID'],r['LTeamID']]))), axis=1)# 比赛id：年，1队，2队
games['IDTeams'] = games.apply(lambda r: '_'.join(map(str, sorted([r['WTeamID'],r['LTeamID']]))), axis=1)# 1队，2队
games['Team1'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[0], axis=1)# 1队
games['Team2'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[1], axis=1)
games['IDTeam1'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)# 年 1队
games['IDTeam2'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)

games['Team1Seed'] = games['IDTeam1'].map(seeds).fillna(0)
games['Team2Seed'] = games['IDTeam2'].map(seeds).fillna(0)

games['ScoreDiff'] = games['WScore'] - games['LScore']
games['Pred'] = games.apply(lambda r: 1. if sorted([r['WTeamID'],r['LTeamID']])[0]==r['WTeamID'] else 0., axis=1) # 1队赢了没
games['ScoreDiffNorm'] = games.apply(lambda r: r['ScoreDiff'] * -1 if r['Pred'] == 0. else r['ScoreDiff'], axis=1)
games['SeedDiff'] = games['Team1Seed'] - games['Team2Seed']
games = games.fillna(-1)

c_score_col = ['NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl',
 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl',
 'LBlk', 'LPF'] # 选择和比赛得分相关的列
c_score_agg = ['sum', 'mean', 'median', 'max', 'min', 'std', 'skew', 'nunique']
gb = games.groupby(by=['IDTeams']).agg({k: c_score_agg for k in c_score_col}).reset_index()
# groupby 分组，同一支队的比赛会被分到一起
# agg 对于 c_score_col 中的每一列，分别计算每个分组（即每一对队伍）的这些聚合统计值。
# 聚合操作后，结果是一个 MultiIndex DataFrame。调用 reset_index() 是为了将其转换为普通的 DataFrame，方便后续处理。
gb.columns = [''.join(c) + '_c_score' for c in gb.columns]

games = games[games['ST']=='T']

sub['WLoc'] = 3
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['Season'].astype(int)
sub['Team1'] = sub['ID'].map(lambda x: x.split('_')[1])
sub['Team2'] = sub['ID'].map(lambda x: x.split('_')[2])
sub['IDTeams'] = sub.apply(lambda r: '_'.join(map(str, [r['Team1'], r['Team2']])), axis=1)
sub['IDTeam1'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
sub['IDTeam2'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)
sub['Team1Seed'] = sub['IDTeam1'].map(seeds).fillna(0)
sub['Team2Seed'] = sub['IDTeam2'].map(seeds).fillna(0)
sub['SeedDiff'] = sub['Team1Seed'] - sub['Team2Seed'] # 提取各种信息和添加种子特征
sub = sub.fillna(-1)

games = pd.merge(games, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')
sub = pd.merge(sub, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')
# 将比赛数据（games）与之前生成的统计特征数据（gb）通过IDTeams进行左连接。
# 将提交数据（sub）与统计特征数据（gb）通过IDTeams进行左连接。

col = [c for c in games.columns if c not in ['ID', 'DayNum', 'ST', 'Team1', 'Team2', 'IDTeams', 'IDTeam1', 'IDTeam2',
                                             'WTeamID', 'WScore', 'LTeamID', 'LScore', 'NumOT', 'Pred', 'ScoreDiff', 'ScoreDiffNorm',
                                             'WLoc'] + c_score_col]



In [6]:
# Best Hyperparameters: {'n_estimators': 346, 'learning_rate': 0.010193615589609411, 'max_depth': 3}
X = games[col].fillna(-1)
sub_X = sub[col].fillna(-1)
y = games['Pred']
#X = all_results[['Season', 'WTeamID', 'LTeamID']] # for cat?



# 定义需要优化的参数
param_grid_xgb = {
    'n_estimators': 346,
    'learning_rate': 0.01,
    'max_depth': 3
}

param_grid_mlp = {
    'alpha': 0.0001,
    'learning_rate_init': 0.001,
    'max_iter': 200,
    'hidden_layer_sizes': (1000, 1000)
}

param_grid_cat = {
        'iterations': 322,
        'depth': 3,
        'learning_rate': 0.025
    }

param_grid_rf = {
        'n_estimators': 584,
        'max_depth': 6,
        'min_samples_split': 8,
        'min_samples_leaf': 10
    }

param_grid_lgb = {
        'num_leaves': 63,
        'learning_rate': 0.007602565555810267,
        'n_estimators': 537,
        'max_depth': 3,
        'min_child_samples': 82,
        'subsample': 0.8191854478457833,
        'colsample_bytree': 0.8521050862477741
    }


# 定义 Pipeline
pipeline_xgb = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('xgb', XGBRegressor(**param_grid_xgb, device="gpu", random_state=42))
])

pipeline_mlp = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # 均值填充
    ('scaler', StandardScaler()),  # 标准化
    ('model', MLPRegressor(**param_grid_mlp))  # 神经网络回归器
])

pipeline_cat = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        #('catboost', CatBoostClassifier(**param_grid_cat, verbose=False, task_type='GPU', loss_function='Logloss'))
        ('catboost', CatBoostRegressor(**param_grid_cat, verbose=False, task_type='GPU'))
    ])

pipeline_rf = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('rf', RandomForestRegressor(**param_grid_rf, random_state=42))
    ])

pipeline_lgb = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('lgb', LGBMRegressor(**param_grid_lgb, random_state=42, device='gpu'))
    ])



n_splits = 5  # 5折交叉验证
tscv = TimeSeriesSplit(n_splits=n_splits)
brier = []
for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    print(f"Fold {fold + 1}/{n_splits}")
    
    # 按索引拆分训练集和测试集
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = games['Pred'].iloc[train_index], games['Pred'].iloc[test_index]
    #print(len(X_train), len(X_test), len(y_train), len(y_test))
    pipeline_xgb.fit(X_train, y_train)
    pipeline_mlp.fit(X_train, y_train)
    pipeline_cat.fit(X_train, y_train)
    pipeline_rf.fit(X_train, y_train)
    pipeline_lgb.fit(X_train, y_train)


    pred_xgb = pipeline_xgb.predict(X_test).clip(0.001, 0.999)
    pred_mlp = pipeline_mlp.predict(X_test).clip(0.001, 0.999)
    pred_cat = pipeline_cat.predict(X_test).clip(0.001, 0.999)
    pred_rf = pipeline_rf.predict(X_test).clip(0.001, 0.999)
    pred_lgb = pipeline_lgb.predict(X_test).clip(0.001, 0.999)


    brier_score_xgb = brier_score_loss(y_test, pred_xgb)
    brier_score_mlp = brier_score_loss(y_test, pred_mlp)
    brier_score_cat = brier_score_loss(y_test, pred_cat)
    brier_score_rf = brier_score_loss(y_test, pred_rf)
    brier_score_lgb = brier_score_loss(y_test, pred_lgb)



    mean = (pred_xgb + pred_mlp + pred_cat + pred_rf + pred_lgb) / 5.0
    brier_score_mean = brier_score_loss(y_test, mean)
    print(f'xgb: {brier_score_xgb:.3f}, mlp: {brier_score_mlp:.3f}, cat: {brier_score_cat:.3f}, rf: {brier_score_rf:.3f}, lgb: {brier_score_lgb:.3f}, mean: {brier_score_mean:.3f}')






Fold 1/5
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 8106
[LightGBM] [Info] Number of data points in the train set: 381, number of used features: 215
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3060 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 215 dense feature groups (0.08 MB) transferred to GPU in 0.003685 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.480315
xgb: 0.218, mlp: 0.253, cat: 0.212, rf: 0.208, lgb: 0.211, mean: 0.213
Fold 2/5
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 11547
[LightGBM] [Info] Number of data points in the train set: 760, number of used features: 219
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3060 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightG

In [8]:
X = games[col].fillna(-1)
sub_X = sub[col].fillna(-1)
y = games['Pred']
# X = all_results[['Season', 'WTeamID', 'LTeamID']] # for cat?


# 定义需要优化的参数
param_grid_xgb = {
    'n_estimators': 346,
    'learning_rate': 0.01,
    'max_depth': 3
}

param_grid_cat = {
        'iterations': 322,
        'depth': 3,
        'learning_rate': 0.025
    }

param_grid_rf = {
        'n_estimators': 584,
        'max_depth': 6,
        'min_samples_split': 8,
        'min_samples_leaf': 10
    }

param_grid_lgb = {
        'num_leaves': 63,
        'learning_rate': 0.007602565555810267,
        'n_estimators': 537,
        'max_depth': 3,
        'min_child_samples': 82,
        'subsample': 0.8191854478457833,
        'colsample_bytree': 0.8521050862477741
    }


# 定义 Pipeline
pipeline_xgb = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('xgb', XGBRegressor(**param_grid_xgb, device="gpu", random_state=42))
])

pipeline_cat = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        #('catboost', CatBoostClassifier(**param_grid_cat, verbose=False, task_type='GPU', loss_function='Logloss'))
        ('catboost', CatBoostRegressor(**param_grid_cat, verbose=False, task_type='GPU'))
    ])

pipeline_rf = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('rf', RandomForestRegressor(**param_grid_rf, random_state=42))
    ])

pipeline_lgb = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('lgb', LGBMRegressor(**param_grid_lgb, random_state=42, device='gpu'))
    ])


split_index = int(len(X) * 0.97)  # 假设 e 占总数据的 20%，可以根据实际情况调整

X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

pipeline_xgb.fit(X_train, y_train)
pipeline_cat.fit(X_train, y_train)
pipeline_rf.fit(X_train, y_train)
pipeline_lgb.fit(X_train, y_train)


pred_xgb = pipeline_xgb.predict(X_test).clip(0.001, 0.999)
pred_cat = pipeline_cat.predict(X_test).clip(0.001, 0.999)
pred_rf = pipeline_rf.predict(X_test).clip(0.001, 0.999)
pred_lgb = pipeline_lgb.predict(X_test).clip(0.001, 0.999)


brier_score_xgb = brier_score_loss(y_test, pred_xgb)
brier_score_cat = brier_score_loss(y_test, pred_cat)
brier_score_rf = brier_score_loss(y_test, pred_rf)
brier_score_lgb = brier_score_loss(y_test, pred_lgb)

mean_pred = (
    pred_xgb +
    pred_cat +
    pred_rf +
    pred_lgb
) / 4

brier_score_mean = brier_score_loss(y_test, mean_pred)


print(f"xgb: {brier_score_xgb:.3f}, cat: {brier_score_cat:.3f}, rf: {brier_score_rf:.3f}, lgb: {brier_score_lgb:.3f}, mean: {brier_score_mean:.3f}")


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 23487
[LightGBM] [Info] Number of data points in the train set: 2207, number of used features: 220
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3060 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 215 dense feature groups (0.45 MB) transferred to GPU in 0.003859 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score 0.502039
xgb: 0.149, cat: 0.142, rf: 0.151, lgb: 0.145, mean: 0.146


In [40]:

pred_xgb
random_vector = np.random.rand(len(pred_xgb))
vector_of_0_5 = np.full(len(pred_xgb), 0.5)

#print(np.array(random_vector))
brier_score_xgb = brier_score_loss(y_test, vector_of_0_5)
brier_score_xgb

0.25

In [9]:
pipeline_xgb.fit(X, y)
pipeline_cat.fit(X, y)
pipeline_rf.fit(X, y)
pipeline_lgb.fit(X, y)

sub_pred_xgb = pipeline_xgb.predict(sub_X).clip(0.001, 0.999)
sub_pred_cat = pipeline_cat.predict(sub_X).clip(0.001, 0.999)
sub_pred_rf = pipeline_rf.predict(sub_X).clip(0.001, 0.999)
sub_pred_lgb = pipeline_lgb.predict(sub_X).clip(0.001, 0.999)

sub_final = (sub_pred_xgb + sub_pred_cat + sub_pred_rf + sub_pred_lgb)/4
submission_df = pd.DataFrame({
    'ID': df['ID'],
    'Pred': sub_final
})
submission_df.head()




[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 23707
[LightGBM] [Info] Number of data points in the train set: 2276, number of used features: 220
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3060 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 215 dense feature groups (0.47 MB) transferred to GPU in 0.003286 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score 0.501318


Unnamed: 0,ID,Pred
0,2025_1101_1102,0.51698
1,2025_1101_1103,0.542405
2,2025_1101_1104,0.542405
3,2025_1101_1105,0.542405
4,2025_1101_1106,0.542405


In [17]:
print(submission_df.tail(20))
print(len(submission_df))
submission_df.to_csv('skl.csv', index=False)

                    ID      Pred
131387  2025_3474_3476  0.542405
131388  2025_3474_3477  0.542405
131389  2025_3474_3478  0.542405
131390  2025_3474_3479  0.542405
131391  2025_3474_3480  0.487852
131392  2025_3475_3476  0.542405
131393  2025_3475_3477  0.542405
131394  2025_3475_3478  0.578534
131395  2025_3475_3479  0.542405
131396  2025_3475_3480  0.542405
131397  2025_3476_3477  0.542405
131398  2025_3476_3478  0.538147
131399  2025_3476_3479  0.569235
131400  2025_3476_3480  0.542405
131401  2025_3477_3478  0.542405
131402  2025_3477_3479  0.542405
131403  2025_3477_3480  0.542405
131404  2025_3478_3479  0.569989
131405  2025_3478_3480  0.542405
131406  2025_3479_3480  0.542405
131407


In [41]:
X

Unnamed: 0,Season,Team1Seed,Team2Seed,SeedDiff,IDTeams_c_score,NumOTsum_c_score,NumOTmean_c_score,NumOTmedian_c_score,NumOTmax_c_score,NumOTmin_c_score,...,LBlkskew_c_score,LBlknunique_c_score,LPFsum_c_score,LPFmean_c_score,LPFmedian_c_score,LPFmax_c_score,LPFmin_c_score,LPFstd_c_score,LPFskew_c_score,LPFnunique_c_score
0,2003,16.0,16.0,0.0,1411_1421,1,1.00,1.0,1,1,...,-1.000000,1,22,22.000000,22.0,22,22,-1.000000,-1.000000,1
1,2003,1.0,16.0,-15.0,1112_1436,0,0.00,0.0,0,0,...,-1.000000,1,15,15.000000,15.0,15,15,-1.000000,-1.000000,1
2,2003,10.0,7.0,3.0,1113_1272,0,0.00,0.0,0,0,...,-1.000000,1,18,18.000000,18.0,18,18,-1.000000,-1.000000,1
3,2003,11.0,6.0,5.0,1141_1166,0,0.00,0.0,0,0,...,-1.000000,2,35,17.500000,17.5,21,14,4.949747,-1.000000,2
4,2003,8.0,9.0,-1.0,1143_1301,1,0.25,0.0,1,0,...,-0.422521,4,66,16.500000,17.0,19,13,2.645751,-0.863919,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2271,2024,3.0,1.0,2.0,3163_3425,0,0.00,0.0,0,0,...,-1.000000,2,32,16.000000,16.0,20,12,5.656854,-1.000000,2
2272,2024,1.0,3.0,-2.0,3234_3261,0,0.00,0.0,0,0,...,-1.000000,2,40,20.000000,20.0,21,19,1.414214,-1.000000,2
2273,2024,3.0,1.0,2.0,3163_3234,0,0.00,0.0,0,0,...,1.732051,2,35,11.666667,10.0,18,7,5.686241,1.205659,3
2274,2024,3.0,1.0,2.0,3301_3376,0,0.00,0.0,0,0,...,0.190198,7,104,14.857143,15.0,22,8,5.014265,-0.159771,6


In [48]:
df = X
mask_men = df['IDTeams_c_score'].apply(lambda x: str(x).startswith('1'))

# 检查IDTeams_c_score列的值是否以"3"开头（女篮）
mask_women = df['IDTeams_c_score'].apply(lambda x: str(x).startswith('3'))

# 分别过滤出男篮和女篮的数据
df_men = df[mask_men]
df_women = df[mask_women]

# # 显示结果
print("男篮比赛结果:")
print(df_men)

# print("\n女篮比赛结果:")
# print(df_women)

男篮比赛结果:
      Season  Team1Seed  Team2Seed  SeedDiff IDTeams_c_score  \
0       2003       16.0       16.0       0.0       1411_1421   
1       2003        1.0       16.0     -15.0       1112_1436   
2       2003       10.0        7.0       3.0       1113_1272   
3       2003       11.0        6.0       5.0       1141_1166   
4       2003        8.0        9.0      -1.0       1143_1301   
...      ...        ...        ...       ...             ...   
1377    2024        4.0       11.0      -7.0       1181_1301   
1378    2024        1.0        2.0      -1.0       1345_1397   
1379    2024        4.0        1.0       3.0       1104_1163   
1380    2024       11.0        1.0      10.0       1301_1345   
1381    2024        1.0        1.0       0.0       1163_1345   

      NumOTsum_c_score  NumOTmean_c_score  NumOTmedian_c_score  \
0                    1           1.000000                  1.0   
1                    0           0.000000                  0.0   
2                    0   