## Training (XGBoost edition)
After loading and preprocessing the data, we can now train the model.

### First things first
Importing libraries. Make sure you have them installed (check the instructions in the `README.md`)

Then, let's find the best hyperparams using GridSearch

In [1]:
import pandas as pd
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

from sklearn.model_selection import GridSearchCV

# Load the processed data
df = pd.read_csv('data/' + 'train_processed.csv')

# Split features and labels
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


param_grid = {
    'n_estimators': [300, 500],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [1],
    'colsample_bytree': [0.8, 1],
    'min_child_weight':[1, 5, 10],      
    'gamma':           [0, 1, 5]        
}

grid = GridSearchCV(xgb.XGBClassifier(objective='binary:logistic', random_state=42),
                    param_grid,
                    scoring='roc_auc',
                    cv=3,
                    verbose=1,
                    n_jobs=-1)

grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

Fitting 3 folds for each of 324 candidates, totalling 972 fits
Best params: {'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.01, 'max_depth': 6, 'min_child_weight': 5, 'n_estimators': 500, 'subsample': 1}


### On to training!
Now that we have the best hyperparameters, let's train the classifier.

In [2]:
# defaults: learning_rate=0.1, n_estimators=100, max_depth=6
model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    n_estimators=grid.best_params_['n_estimators'],
    learning_rate=grid.best_params_['learning_rate'],
    max_depth=grid.best_params_['max_depth'],
    subsample=grid.best_params_['subsample'],
    colsample_bytree=grid.best_params_['colsample_bytree'],
    gamma=grid.best_params_['gamma'],
    min_child_weight=grid.best_params_['min_child_weight'],
    random_state=67
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=True
)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print('Accuracy:', accuracy_score(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_proba))

[0]	validation_0-auc:0.86395
[1]	validation_0-auc:0.87214
[2]	validation_0-auc:0.87163
[3]	validation_0-auc:0.87335
[4]	validation_0-auc:0.87713
[5]	validation_0-auc:0.87814
[6]	validation_0-auc:0.87886
[7]	validation_0-auc:0.87960
[8]	validation_0-auc:0.87980
[9]	validation_0-auc:0.87995
[10]	validation_0-auc:0.87934
[11]	validation_0-auc:0.88067
[12]	validation_0-auc:0.88006
[13]	validation_0-auc:0.88095
[14]	validation_0-auc:0.88161
[15]	validation_0-auc:0.88203
[16]	validation_0-auc:0.88232
[17]	validation_0-auc:0.88333
[18]	validation_0-auc:0.88373
[19]	validation_0-auc:0.88439
[20]	validation_0-auc:0.88420
[21]	validation_0-auc:0.88428
[22]	validation_0-auc:0.88408
[23]	validation_0-auc:0.88491
[24]	validation_0-auc:0.88503
[25]	validation_0-auc:0.88602
[26]	validation_0-auc:0.88579
[27]	validation_0-auc:0.88577
[28]	validation_0-auc:0.88618
[29]	validation_0-auc:0.88649
[30]	validation_0-auc:0.88627
[31]	validation_0-auc:0.88600
[32]	validation_0-auc:0.88569
[33]	validation_0-au

An accuracy of .8 is really good! The XGBoost is clearly better, at least compared to the MLP (~.55 after optimizations).

I'll go with this model for my first submission. The goal is to see how well I'm doing so far, and if I should double down on this version or just keep trying different ways of modelling the problem.

### Find the best thresold

In [3]:
# IMPROVEMENT: Use default threshold (0.5) to avoid overfitting to validation set
# Custom thresholds often don't generalize well to test data

best_t = 0.5
print(f"Using default threshold: {best_t:.2f}")
print(f"Validation accuracy at 0.5: {accuracy_score(y_test, y_proba > best_t):.4f}")

Using default threshold: 0.50
Validation accuracy at 0.5: 0.8120


In [4]:
# Load the test data
df_submission_original = pd.read_csv('data/' + 'test.csv')
df_submission = pd.read_csv('data/' + 'test_processed.csv')

# Train multiple models and average their predictions
ensemble_probas = []
seeds = [67, 99, 111, 879, 2026]

for seed in seeds:
    print(f"Training model with seed {seed}...")
    ensemble_model = xgb.XGBClassifier(
        objective='binary:logistic',
        eval_metric='auc',
        n_estimators=grid.best_params_['n_estimators'],
        learning_rate=grid.best_params_['learning_rate'],
        max_depth=grid.best_params_['max_depth'],
        subsample=grid.best_params_['subsample'],
        colsample_bytree=grid.best_params_['colsample_bytree'],
        gamma=grid.best_params_['gamma'],
        min_child_weight=grid.best_params_['min_child_weight'],
        random_state=seed
    )
    
    ensemble_model.fit(X_train, y_train, verbose=False)
    proba = ensemble_model.predict_proba(df_submission)[:, 1]
    ensemble_probas.append(proba)

# Average predictions from all models
y_submission_proba = np.mean(ensemble_probas, axis=0)
y_submission_bool = (y_submission_proba > best_t)

submission_df = pd.DataFrame({
    'PassengerId': df_submission_original['PassengerId'],
    'Transported': y_submission_bool
})

submission_df.to_csv('data/submission.csv', index=False)
print(submission_df)


Training model with seed 67...
Training model with seed 99...
Training model with seed 111...
Training model with seed 879...
Training model with seed 2026...
     PassengerId  Transported
0        0013_01         True
1        0018_01        False
2        0019_01         True
3        0021_01         True
4        0023_01         True
...          ...          ...
4272     9266_02         True
4273     9269_01        False
4274     9271_01         True
4275     9273_01         True
4276     9277_01         True

[4277 rows x 2 columns]
