## Training (XGBoost edition)
After loading and preprocessing the data, we can now train the model.

### First things first
Importing libraries. Make sure you have them installed (check the instructions in the `README.md`)
And then, splitting 

In [1]:
import pandas as pd
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

# Load the processed data
df = pd.read_csv('data/' + 'train_processed.csv')

# Split features and labels
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=True
)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print('Accuracy:', accuracy_score(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_proba))

[0]	validation_0-auc:0.86681
[1]	validation_0-auc:0.86802
[2]	validation_0-auc:0.86939
[3]	validation_0-auc:0.87073
[4]	validation_0-auc:0.86976
[5]	validation_0-auc:0.87485
[6]	validation_0-auc:0.87593
[7]	validation_0-auc:0.87699
[8]	validation_0-auc:0.87891
[9]	validation_0-auc:0.87938
[10]	validation_0-auc:0.87906
[11]	validation_0-auc:0.87975
[12]	validation_0-auc:0.88090
[13]	validation_0-auc:0.88158
[14]	validation_0-auc:0.88268
[15]	validation_0-auc:0.88316
[16]	validation_0-auc:0.88292
[17]	validation_0-auc:0.88394
[18]	validation_0-auc:0.88525
[19]	validation_0-auc:0.88583
[20]	validation_0-auc:0.88576
[21]	validation_0-auc:0.88723
[22]	validation_0-auc:0.88753
[23]	validation_0-auc:0.88861
[24]	validation_0-auc:0.88941
[25]	validation_0-auc:0.89017
[26]	validation_0-auc:0.89100
[27]	validation_0-auc:0.89149
[28]	validation_0-auc:0.89210
[29]	validation_0-auc:0.89216
[30]	validation_0-auc:0.89215
[31]	validation_0-auc:0.89224
[32]	validation_0-auc:0.89254
[33]	validation_0-au

An accuracy of .8 is really good! The XGBoost is clearly better, at least compared to the MLP (~.55 after optimizations).

I'll go with this model for my first submission. The goal is to see how well I'm doing so far, and if I should double down on this version or just keep trying different ways of modelling the problem.

In [11]:
# Load the test data
df_submission_original = pd.read_csv('data/' + 'test.csv')
df_submission = pd.read_csv('data/' + 'test_processed.csv')

y_submission_pred = model.predict(df_submission)
y_submission_bool = y_submission_pred.astype(bool)

submission_df = pd.DataFrame({
    'PassengerId': df_submission_original['PassengerId'],  # or whatever ID column you have
    'Transported': y_submission_bool
})

submission_df.to_csv('data/submission.csv', index=False)
print(submission_df)


     PassengerId  Transported
0        0013_01         True
1        0018_01        False
2        0019_01         True
3        0021_01         True
4        0023_01         True
...          ...          ...
4272     9266_02         True
4273     9269_01         True
4274     9271_01         True
4275     9273_01         True
4276     9277_01         True

[4277 rows x 2 columns]
