### Data Preprocessing and Feature Selection

This part performs data preprocessing and feature selection. We use L1 regularization along with Logistic Regression CV to select important features.

For data preprocessing, we perform shuffle and data engineering. The functions are included in the Processors package under the same directory.

In [11]:
import pandas as pd

from Processors.missing_value_processor import ratio
from Processors.feature_engieering_processor import feature_engineering
from Processors.shuffle_processor import shuffle
from Processors.feature_selection_processor import feature_selection
from Processors.get_feature_names_processor import get_feature_names

In [2]:
# Read the Data
df = pd.read_csv("NCAA_Tourney_2002_2022.csv")

# Shuffle the data since Team_1 win all the games in the dataset
df = shuffle(df, 600)

# Constructing new features
df = feature_engineering(df)

# Decrease missing values
df = ratio(df)

In [4]:
y = df['team1_win']
X = df.drop(columns=['team1_win'])

numeric = X.select_dtypes(include=['float', 'int64', 'int32', 'int']).columns.tolist()

categorical = X.drop(columns = numeric).columns.tolist()

In [9]:
# Trainning the pipeline
LR_pipeline = feature_selection(categorical,numeric,X,y)

preprocessor = LR_pipeline.named_steps['preprocessor']

In [16]:
# Get Feature Names
df_feature = pd.DataFrame(LR_pipeline.named_steps['model'].coef_.flatten(), index=get_feature_names(preprocessor))

selected_features = df_feature[df_feature[0] != 0]



In [18]:
print(selected_features)

                                         0
num__num_ot                      -0.074446
num__team2_lat                   -0.009768
num__team2_long                  -0.012912
num__team2_pt_overall_ncaa       -0.048946
num__team2_pt_coach_season_wins  -0.100660
num__team1_pt_school_s16          0.081467
num__team1_pt_overall_s16         0.060198
num__team1_pt_team_season_wins    0.015116
num__team1_pt_team_season_losses  0.065156
num__team2_oppftpct              -0.015159
num__team2_arate                 -0.065358
num__team2_stlrate               -0.085646
num__team1_arate                 -0.124966
num__team1_oppstlrate            -0.066748
num__team2_oe                    -0.114811
num__team1_oe                     0.090440
num__team1_adjde                 -0.157453
num__sead_diff                   -0.392338
num__exp_win1                     0.381381
num__exp_win2                    -0.528139
onehot__x0_N                      0.048515


### Baseline

The following part we perform several machine learning methods to train the historical data.

Note: The competition uses log-loss as metric. For tuning the hyperparameters, go to the baseline package.

In [39]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from Baseline.lightGBM import lightGBM
from Baseline.catboost import catboost

In [27]:
# Redine X for baseline training since we have performed feature selection
features = ["sead_diff", "team1_seed","team1_adjoe","team2_adjoe","team1_adjde","team2_adjde",           "team1_blockpct","team2_blockpct",
           "team1_pt_team_season_wins","team2_pt_team_season_wins",
           "team1_pt_overall_s16","team2_pt_overall_s16","team1_pt_coach_season_wins","team2_pt_coach_season_wins",
           "team1_pt_school_ncaa","team2_pt_school_ncaa"]

X = df[features]
# y does not change

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [36]:
LGBM = lightGBM(X_train, y_train)

In [41]:
CAT = catboost(X_train, y_train)

TypeError: catboost() takes 0 positional arguments but 2 were given

### Performance

In the following, we test the performance of each baseline based on log-loss and AUC.

For each baseline, we use train_test_split method to test the performance.


In [37]:
y_pred = LGBM.predict_proba(X_test)[:,1]

In [38]:
print("log_loss is: ", metrics.log_loss(y_test,y_pred))
print("roc_auc is: ", metrics.roc_auc_score(y_test,y_pred))

log_loss is:  0.5753981394987617
roc_auc is:  0.7637502900905082
