## Training (XGBoost edition)
After loading and preprocessing the data, we can now train the model.

### First things first
Importing libraries. Make sure you have them installed (check the instructions in the `README.md`)

Then, let's find the best hyperparams using GridSearch

In [1]:
import pandas as pd
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

from sklearn.model_selection import GridSearchCV


# Load the processed data
df = pd.read_csv('data/' + 'train_processed.csv')

# Split features and labels
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 1],
    'colsample_bytree': [0.7, 0.8, 1]
}

grid = GridSearchCV(xgb.XGBClassifier(objective='binary:logistic', random_state=42),
                    param_grid,
                    scoring='roc_auc',
                    cv=3,
                    verbose=1,
                    n_jobs=-1)

grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

Fitting 3 folds for each of 243 candidates, totalling 729 fits
Best params: {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300, 'subsample': 0.8}


### On to training!
Now that we have the best hyperparameters, let's train the classifier.

In [2]:
# defaults: learning_rate=0.1, n_estimators=100, max_depth=6
model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    n_estimators=grid.best_params_['n_estimators'],
    learning_rate=grid.best_params_['learning_rate'],
    max_depth=grid.best_params_['max_depth'],
    subsample=grid.best_params_['subsample'],
    colsample_bytree=grid.best_params_['colsample_bytree'],
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=True
)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print('Accuracy:', accuracy_score(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_proba))

[0]	validation_0-auc:0.83842
[1]	validation_0-auc:0.85933
[2]	validation_0-auc:0.85990
[3]	validation_0-auc:0.86193
[4]	validation_0-auc:0.86384
[5]	validation_0-auc:0.86672
[6]	validation_0-auc:0.86878
[7]	validation_0-auc:0.86949
[8]	validation_0-auc:0.86917
[9]	validation_0-auc:0.86880
[10]	validation_0-auc:0.86941
[11]	validation_0-auc:0.86997
[12]	validation_0-auc:0.86995
[13]	validation_0-auc:0.87193
[14]	validation_0-auc:0.87143
[15]	validation_0-auc:0.87126
[16]	validation_0-auc:0.87198
[17]	validation_0-auc:0.87249
[18]	validation_0-auc:0.87297
[19]	validation_0-auc:0.87361
[20]	validation_0-auc:0.87389
[21]	validation_0-auc:0.87426
[22]	validation_0-auc:0.87508
[23]	validation_0-auc:0.87584
[24]	validation_0-auc:0.87615
[25]	validation_0-auc:0.87675
[26]	validation_0-auc:0.87711
[27]	validation_0-auc:0.87757
[28]	validation_0-auc:0.87799
[29]	validation_0-auc:0.87841
[30]	validation_0-auc:0.87864
[31]	validation_0-auc:0.87978
[32]	validation_0-auc:0.87985
[33]	validation_0-au

An accuracy of .8 is really good! The XGBoost is clearly better, at least compared to the MLP (~.55 after optimizations).

I'll go with this model for my first submission. The goal is to see how well I'm doing so far, and if I should double down on this version or just keep trying different ways of modelling the problem.

### Find the best thresold

In [3]:
thresholds = np.linspace(0, 1, 100)
accuracies = [(t, accuracy_score(y_test, y_proba > t)) for t in thresholds]
best_t, best_acc = max(accuracies, key=lambda x: x[1])
print(f"Best threshold: {best_t:.2f} with accuracy: {best_acc:.4f}")

Best threshold: 0.54 with accuracy: 0.8131


In [4]:
# Load the test data
df_submission_original = pd.read_csv('data/' + 'test.csv')
df_submission = pd.read_csv('data/' + 'test_processed.csv')

y_submission_pred = model.predict(df_submission)
y_submission_proba = model.predict_proba(df_submission)[:, 1]
y_submission_bool = (y_submission_proba > best_t) 

submission_df = pd.DataFrame({
    'PassengerId': df_submission_original['PassengerId'],
    'Transported': y_submission_bool
})

submission_df.to_csv('data/submission.csv', index=False)
print(submission_df)


     PassengerId  Transported
0        0013_01         True
1        0018_01        False
2        0019_01         True
3        0021_01         True
4        0023_01         True
...          ...          ...
4272     9266_02         True
4273     9269_01         True
4274     9271_01         True
4275     9273_01         True
4276     9277_01         True

[4277 rows x 2 columns]
