# Spaceship Titanic: Exploratory Data Analysis and Preprocessing

Project: [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/overview)

In this notebook, we will:
- Build baseline models using the processed dataset
- Tune model hyperparameters for better performance
- Compare different models
- Prepare submission files for Kaggle

Our goal here is to train, validate and tune models to to their best for this problem.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

SEED = 42 # define our random seed

train = pd.read_csv('data/processed_train.csv')
test = pd.read_csv('data/processed_test.csv')


df_Y = train['Transported']
df_X = train.drop(columns=['Transported'])

### Baseline models
- Here we simply want to naively run some standard models and get a baseline prediction accuracy to assess which models we should devote time to tune.
Particulary we run them several times with different random states.

In [None]:

n_runs = 10  # number of different random seeds/runs
results = [] # where we will collect the results

for run in range(n_runs):
    # split into train and validation for this run, using stratification to preserve class balance
    train_x_run, val_x_run, train_y_run, val_y_run = train_test_split(df_X, df_Y, test_size=0.2, random_state=run, stratify=df_Y)
    
    # define a set of baseline models to train
    models = {
        'RandomForest': RandomForestClassifier(random_state=run), # basic random forest
        'ExtraTrees': ExtraTreesClassifier(random_state=run), # extra trees ensemble
        'LogisticRegression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, random_state=run)), # logistic regression with scaling # we define max iter since default causes small error
        'GradientBoosting': GradientBoostingClassifier(random_state=run), # gradient boosting trees
        'XGBoost': XGBClassifier(eval_metric='logloss', random_state=run), # boosted trees via XGBoost
        'LightGBM': LGBMClassifier(random_state=run, verbosity=-1), # boosted trees via LightGBM
        'CatBoost': CatBoostClassifier(verbose=0, random_state=run), # CatBoost for categorical-friendly boosting
        'KNN': make_pipeline(StandardScaler(), KNeighborsClassifier()),  # k-nearest neighbors with scaling (important because KNN is distance-based)
        'SVC': make_pipeline(StandardScaler(), SVC(probability=True, random_state=run)), # support vector classifier with scaling (important because SVMs are sensitive to feature scales)
        'MLP': make_pipeline(StandardScaler(), MLPClassifier(max_iter=1000, random_state=run)) # basic multi-layer perceptron with scaling (helps neural nets converge better)
    }
    
    # train and evaluate each model
    for name, model in models.items():
        model.fit(train_x_run, train_y_run) # train the model
        y_pred = model.predict(val_x_run) # predict on the validation split
        acc = accuracy_score(val_y_run, y_pred) # calculate accuracy
        
        results.append({'Model': name, 'Accuracy': acc}) # save the result

# create a DataFrame from all runs
results_df = pd.DataFrame(results)

# compute the mean and standard deviation accuracy for each model across all runs
summary_df = results_df.groupby('Model')['Accuracy'].agg(['mean', 'std']).reset_index()
print(summary_df)

                Model      mean       std
0            CatBoost  0.812363  0.011334
1          ExtraTrees  0.779758  0.010921
2    GradientBoosting  0.800403  0.008233
3                 KNN  0.737205  0.012256
4            LightGBM  0.806728  0.009658
5  LogisticRegression  0.795860  0.010294
6                 MLP  0.768660  0.013546
7        RandomForest  0.799080  0.012232
8                 SVC  0.790684  0.010492
9             XGBoost  0.802473  0.009085


#### Results:

- The results show that boosting models such as **CatBoost**, **GradientBoosting**, **LightGBM**, and **XGBoost** achieve the highest validation accuracies on this dataset, consistently reaching around 80.2% to 81.2%. Among these, **CatBoost** slightly outperforms the others, making it the strongest candidate without any hyperparameter tuning.

- Based on these findings, we selected **CatBoost**, **LightGBM**, and **XGBoost** for further hyperparameter optimization in the next stage of the project.

## Tuning
- To tune parameters we use RandomizedSeachCV

In [None]:
# Splitting the dataset into training and validation sets
train_x, val_x, train_y, val_y = train_test_split(df_X, df_Y, test_size=0.25, shuffle=True, random_state=SEED) 
# 25% validation split to leave enough data for training while still having a robust validation set
# shuffle=True ensures random mixing
# random_state=SEED for reproducibility across runs


# Defining parameter grids for hyperparameter tuning of our main baseline models

# --- CatBoost hyperparameters ---
catboost_params = {
    'depth': [3, 5, 8], # controls tree depth; deeper trees can capture more complex patterns but may overfit
    'learning_rate': [0.005, 0.01, 0.05, 0.1], # lower learning rates usually mean slower but safer convergence
    'iterations': [250, 500, 1000], # how many boosting rounds; more rounds can fit better but risk overfitting if too many
    'l2_leaf_reg': [1, 3, 5, 7], # L2 regularization to penalize large weights, helps avoid overfitting
    'bagging_temperature': [0, 1, 5], # controls randomness in bagging; higher values add more randomness, can help generalization
    'random_strength': [1, 5, 10] # randomness for feature splits; again helps regularize the model
}

# --- LightGBM hyperparameters ---
lightgbm_params = {
    'num_leaves': [31, 63, 127], # number of leaves controls model complexity; larger values = more complex model
    'learning_rate': [0.01, 0.05, 0.1], # learning rate tradeoff: slower rates may give better results if you can afford more training
    'n_estimators': [500, 1000], # total number of boosting rounds; tied to learning rate
    'max_depth': [4, 6, 8], # limit maximum depth to avoid very large trees that overfit
    'subsample': [0.8, 1.0], # randomly sample part of data for each tree (bagging); helps prevent overfitting
    'colsample_bytree': [0.8, 1.0] # randomly sample part of features per tree; another regularization method
}

# --- XGBoost hyperparameters ---
xgb_params = {
    'n_estimators': [250, 500, 1000], # number of trees (same logic: more trees = better fit, more overfitting risk)
    'learning_rate': [0.01, 0.05, 0.1], # small learning rates are safer but need more trees
    'max_depth': [3, 5, 7], # smaller depths generalize better; deeper trees can overfit
    'subsample': [0.8, 1.0], # bagging fraction of data; reduces overfitting
    'colsample_bytree': [0.8, 1.0], # feature bagging; forces model to not rely on all features all the time
    'gamma': [0, 0.1, 0.5], # minimum loss reduction required to make a split; adds pruning effect
    'reg_alpha': [0, 0.1, 1], # L1 regularization term (sparsity), useful for feature selection
    'reg_lambda': [1, 3, 5]  # L2 regularization term (weight shrinkage), helps with overfitting
}

#### CatBoost

In [None]:
# Initialize the base CatBoost model
cat_model = CatBoostClassifier(verbose=0, random_state=SEED) 
# verbose=0 to keep output clean
# setting random_state ensures reproducibility

# Setting up the Randomized Search for hyperparameter tuning
cat_random = RandomizedSearchCV(
    cat_model,
    param_distributions=catboost_params, # search space defined earlier
    n_iter=200,  # number of random combinations to try — high enough for good coverage without taking forever
    cv=3,        # 3-fold cross-validation: balances between speed and a reliable estimate of model performance
    scoring='accuracy', # optimize based on accuracy (since that's the competition metric)
    random_state=SEED, # reproducibility: makes search results consistent if rerun
    n_jobs=-1 # use all available CPU cores for faster search
)

# Run the randomized hyperparameter search
cat_random.fit(train_x, train_y)

# Print the best parameters and cross-validation accuracy score
print("Best CatBoost parameters:", cat_random.best_params_)
print("Best CatBoost CV Accuracy:", cat_random.best_score_)

Best CatBoost parameters: {'random_strength': 1, 'learning_rate': 0.05, 'l2_leaf_reg': 5, 'iterations': 250, 'depth': 5, 'bagging_temperature': 0}
Best CatBoost CV Accuracy: 0.8142353121644424


#### LightGBM

In [None]:
# Initialize the base LightGBM model
model = LGBMClassifier(random_state=SEED, verbosity=-1) 
# random_state for reproducibility
# verbosity=-1 to suppress LightGBM output and keep logs clean

# Setting up the Randomized Search for LightGBM
random_search = RandomizedSearchCV(
    model,
    param_distributions=lightgbm_params, # search space defined earlier
    n_iter=200,  # number of random combinations to try — enough to explore the space without taking too long
    cv=3,        # 3-fold cross-validation: common choice that balances speed and reliability
    scoring='accuracy', # optimizing for accuracy (competition metric)
    random_state=SEED, # reproducibility: same random trials every time
    n_jobs=-1 # use all CPU cores for faster search
)

# Run the randomized hyperparameter search
random_search.fit(train_x, train_y)

# Print the best parameters and cross-validation score
print("Best LightGBM parameters:", random_search.best_params_)
print("Best LightGBM CV Accuracy:", random_search.best_score_)




Best LightGBM parameters: {'subsample': 0.8, 'num_leaves': 31, 'n_estimators': 1000, 'max_depth': 6, 'learning_rate': 0.01, 'colsample_bytree': 0.8}
Best LightGBM CV Accuracy: 0.8091731860714834


#### XGBoost

In [None]:

# Initialize the base XGBoost model
model = XGBClassifier(random_state=SEED, eval_metric='logloss') 
# random_state for reproducibility
# eval_metric='logloss' is set manually because XGBoost needs an explicit evaluation metric to behave correctly for classification

# Setting up the Randomized Search for XGBoost
random_search = RandomizedSearchCV(
    model,
    param_distributions=xgb_params, # search space defined earlier
    n_iter=200,  # number of random parameter combinations to try — large enough for good coverage, faster than exhaustive grid search
    cv=3,        # 3-fold cross-validation: balances between training time and reliable evaluation
    scoring='accuracy', # targeting accuracy as the optimization metric (fits Kaggle competition goal)
    random_state=SEED, # ensures consistent trial results across runs
    n_jobs=-1 # use all CPU cores to parallelize the search and speed up
)

# Run the randomized hyperparameter search
random_search.fit(train_x, train_y)

# Print the best parameters and cross-validation score
print("Best XGBoost parameters:", random_search.best_params_)
print("Best XGBoost CV Accuracy:", random_search.best_score_)

  _data = np.array(data, dtype=dtype, copy=copy,


Best XGBoost parameters: {'subsample': 0.8, 'reg_lambda': 1, 'reg_alpha': 0.1, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.01, 'gamma': 0.1, 'colsample_bytree': 0.8}
Best XGBoost CV Accuracy: 0.8110139591961958


### Result

- Once again, **CatBoost** achieved the best performance among the models, reaching a cross-validation accuracy of approximately **81.42%**.
- Based on the selected hyperparameters, we retrained a new **CatBoost** model on the full training dataset to maximize its final performance.

The best-found parameters were:
- `random_strength`: 1
- `learning_rate`: 0.05
- `l2_leaf_reg`: 5
- `iterations`: 250
- `depth`: 5
- `bagging_temperature`: 0

#### Training the final model and making the submission

In [None]:
#Running our final CatBoost model and generating predictions

# Initialize the final CatBoost model with the best-found hyperparameters
final_model = CatBoostClassifier(
    verbose=0, 
    random_state=SEED,
    random_strength=1,
    learning_rate=0.05,
    l2_leaf_reg=5,
    iterations=250,
    depth=5,
    bagging_temperature=0
)

# Train the model on the full training dataset (no validation split, we use all training data)
final_model.fit(df_X, df_Y)

# Predict Transported status on the test set
test_pred = final_model.predict(test)

# Load the PassengerId column back in (it was dropped during preprocessing)
test_ids = pd.read_csv('data/test.csv')['PassengerId']

# Format the submission file according to Kaggle rules
submission_catboost = pd.DataFrame({
    'PassengerId': test_ids,
    'Transported': test_pred
})

# Save the submission file as a CSV
submission_catboost.to_csv('submissions/submission_cat.csv', index=False)

### Result
- The final **CatBoost** model achieved a Kaggle leaderboard score of **0.80406**, confirming its strong performance after hyperparameter tuning.
