# Spaceship Titanic: Exploratory Data Analysis and Preprocessing

Project: [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/overview)

In this notebook, we will:
- Build baseline models using the processed dataset
- Tune model hyperparameters for better performance
- Compare different models
- Prepare submission files (if for competition)

Our goal here is to train, validate and tune models to to their best for this problem.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

SEED = 42

train = pd.read_csv('data/processed_train.csv')
test = pd.read_csv('data/processed_test.csv')


df_Y = train['Transported']
df_X = train.drop(columns=['Transported'])

### Baseline models
- Here we simply want to naively run some standard models and get a baseline prediction accuracy.
    Particullary we run them several times with different random states.

In [2]:

n_runs = 10  # how many times to run
results = []

for run in range(n_runs):
    train_x_run, val_x_run, train_y_run, val_y_run = train_test_split(df_X, df_Y, test_size=0.2, random_state=run, stratify=df_Y)
    
    # our baseline models
    models = {
    'RandomForest': RandomForestClassifier(random_state=run),
    'ExtraTrees': ExtraTreesClassifier(random_state=run),
    'LogisticRegression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, random_state=run)),
    'GradientBoosting': GradientBoostingClassifier(random_state=run),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=run),
    'LightGBM': LGBMClassifier(random_state=run, verbosity = -1),
    'CatBoost': CatBoostClassifier(verbose=0, random_state=run),
    'KNN': make_pipeline(StandardScaler(), KNeighborsClassifier()),
    'SVC': make_pipeline(StandardScaler(), SVC(probability=True, random_state=run)),
    'MLP': make_pipeline(StandardScaler(), MLPClassifier(max_iter=1000, random_state=run))
    }
    
    # We train and evaluate
    for name, model in models.items():
        model.fit(train_x_run, train_y_run)
        y_pred = model.predict(val_x_run)
        acc = accuracy_score(val_y_run, y_pred)
        
        results.append({ 'Model': name, 'Accuracy': acc}) # store our results

results_df = pd.DataFrame(results)

# we take the mean and std of our models
summary_df = results_df.groupby('Model')['Accuracy'].agg(['mean', 'std']).reset_index()
print(summary_df)

                Model      mean       std
0            CatBoost  0.812363  0.011334
1          ExtraTrees  0.779758  0.010921
2    GradientBoosting  0.800403  0.008233
3                 KNN  0.737205  0.012256
4            LightGBM  0.806728  0.009658
5  LogisticRegression  0.795860  0.010294
6                 MLP  0.768660  0.013546
7        RandomForest  0.799080  0.012232
8                 SVC  0.790684  0.010492
9             XGBoost  0.802473  0.009085


#### Results:
- The results show that boosting models such as **CatBoost**, **GradientBoosting**, **LightGBM**, and **XGBoost** perform the best on this dataset, managing to get validation accuracies around 80.2%-81.2. Among them, **CatBoost** slightly outperforms the others, making it the top candidate without any hyperparameter tuning.


- From this roundup we pickout three to parameter tune: **CatBoost**, **LightGBM** and **XGBoost**.

## Tuning
- To tune parameters we use RandomizedSeachCV

In [3]:
# Splitting the dataset
train_x, val_x, train_y, val_y = train_test_split(df_X,df_Y, test_size=0.25, shuffle=True, random_state=SEED)

# Defining our parameter space
catboost_params = {
    'depth': [3, 5, 8],
    'learning_rate': [0.005, 0.01, 0.05, 0.1],
    'iterations': [250, 500, 1000],
    'l2_leaf_reg': [1, 3, 5, 7],
    'bagging_temperature': [0, 1, 5],
    'random_strength': [1, 5, 10]
}

lightgbm_params = {
    'num_leaves': [31, 63, 127],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [500, 1000],
    'max_depth': [4, 6, 8],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

xgb_params = {
    'n_estimators': [250, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1, 0.5],
    'reg_alpha': [0, 0.1, 1],  # l1 regularization
    'reg_lambda': [1, 3, 5]    # l2 regularization
}

mlp_params = {
    'hidden_layer_sizes': [(50,), (100,), (50,50), (100,50), (100,100)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'adaptive'],
    'learning_rate_init': [0.001, 0.01]
}

#### CatBoost

In [4]:
cat_model = CatBoostClassifier(verbose=0, random_state=SEED)

cat_random = RandomizedSearchCV(
    cat_model,
    param_distributions=catboost_params,
    n_iter=200,  # number of random combos to try
    cv=3,       # 3-fold CV
    scoring='accuracy',
    random_state=SEED,
    n_jobs=-1
)

cat_random.fit(train_x, train_y)

print("Best CatBoost parameters:", cat_random.best_params_)
print("Best CatBoost CV Accuracy:", cat_random.best_score_)

Best CatBoost parameters: {'random_strength': 1, 'learning_rate': 0.05, 'l2_leaf_reg': 5, 'iterations': 250, 'depth': 5, 'bagging_temperature': 0}
Best CatBoost CV Accuracy: 0.8142353121644424


#### LightGBM

In [5]:
model = LGBMClassifier(random_state=SEED, verbosity = -1)

random_search = RandomizedSearchCV(
    model,
    param_distributions=lightgbm_params,
    n_iter=200,     # only 10 random trials
    cv=3,          # 3-fold cross-validation
    scoring='accuracy',
    random_state=SEED,
    n_jobs=-1
)
random_search.fit(train_x, train_y)

print("Best LightGBM parameters:", random_search.best_params_)
print("Best LightGBM CV Accuracy:", random_search.best_score_)



Best LightGBM parameters: {'subsample': 0.8, 'num_leaves': 31, 'n_estimators': 1000, 'max_depth': 6, 'learning_rate': 0.01, 'colsample_bytree': 0.8}
Best LightGBM CV Accuracy: 0.8091731860714834


#### XGBoost

In [6]:

model = XGBClassifier(random_state=SEED, eval_metric='logloss')

random_search = RandomizedSearchCV(
    model,
    param_distributions=xgb_params,
    n_iter=200,     # only 10 random trials
    cv=3,          # 3-fold cross-validation
    scoring='accuracy',
    random_state=SEED,
    n_jobs=-1
)
random_search.fit(train_x, train_y)

print("Best XGBoost parameters:", random_search.best_params_)
print("Best XGBoost CV Accuracy:", random_search.best_score_)

  _data = np.array(data, dtype=dtype, copy=copy,


Best XGBoost parameters: {'subsample': 0.8, 'reg_lambda': 1, 'reg_alpha': 0.1, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.01, 'gamma': 0.1, 'colsample_bytree': 0.8}
Best XGBoost CV Accuracy: 0.8110139591961958


### Result
- Again **CatBoost** perform the best out of the models with an accuracy ~81.42%
- We train a new **CatBoost** model with the chosen parameters and the full training dataset.

Parameters:
* 'random_strength': 1
* 'learning_rate': 0.05
* 'l2_leaf_reg': 5
* 'iterations': 250
* 'depth': 5
* 'bagging_temperature': 0

In [7]:
# Ruuning our final standard model and getting our predictions
final_model = CatBoostClassifier(verbose=0, random_state=SEED,
                                 random_strength=1,learning_rate=0.05,
                                 l2_leaf_reg=5, iterations=250,
                                 depth=5, bagging_temperature=0)


final_model.fit(df_X, df_Y) # we use all our training data for our final model.
test_pred = final_model.predict(test)

test_ids = pd.read_csv('data/test.csv')['PassengerId'] # get passengerId back, (removed during encoding)

submission_catboost = pd.DataFrame({ # format the submission
    'PassengerId': test_ids,
    'Transported': test_pred
})
submission_catboost.to_csv('submissions/submission_cat.csv', index=False) # store the submission as csv

### Result
- This CatBoost model got a score of 0.80406 on Kaggle.