# Achieving High Accuracy in Predicting Passenger Transportation with Ensemble Models

This data is sourced from the Kaggle Competition: [Spaceship Titanic](https://www.kaggle.com/c/spaceship-titanic/overview)

# Context

"Welcome to the year 2912, where data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system."

**Help save them and change history!**

# Dataset Description

In this competition the task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

### File and Data Field Descriptions

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
        
**Variables**

*PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.*

*HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.*

*CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.*

*Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.*

*Destination - The planet the passenger will be debarking to.*

*Age - The age of the passenger.*

*VIP - Whether the passenger has paid for special VIP service during the voyage.*

*RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.*

*Name - The first and last names of the passenger.*

*Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.*
    
**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.*
    
**sample_submission.csv** - A submission file in the correct format.

*PassengerId - Id for each passenger in the test set.*
*Transported - The target. For each passenger, predict either True or False.*

# Development

In order to achieve high prediction accuracy, different models were tested, fine-tuned, and evaluated based on multiple performance metrics. Below are the models that were used, along with their performance:

### CatBoost Classifier:
- **Accuracy**: 81.48%
- **Precision**: 81.87%
- **Recall**: 81.19%
- **F1 Score**: 81.53%

CatBoost emerged as the best-performing model, effectively handling categorical features and delivering the highest accuracy among all tested models.

### XGBoost Classifier:
- **Accuracy**: 81.25%
- **Precision**: 81.94%
- **Recall**: 80.50%
- **F1 Score**: 81.21%

XGBoost provided robust performance, achieving high precision and recall scores, making it a reliable model in various settings.

### Stacking Classifier:
- **Accuracy**: 81.02%
- **Precision**: 81.17%
- **Recall**: 81.11%
- **F1 Score**: 81.14%

Utilizing a diverse set of models, the Stacking Classifier successfully integrated predictions from CatBoost, XGBoost, LightGBM, Neural Networks, SVM, and Random Forest, demonstrating the power of ensemble learning.

### Random Forest Classifier:
- **Accuracy**: 80.37%
- **Precision**: 81.61%
- **Recall**: 78.75%
- **F1 Score**: 80.15%

The Random Forest model was optimized through hyperparameter tuning and delivered strong precision, indicating its effectiveness for complex datasets.

### LightGBM Classifier:
- **Accuracy**: 79.91%
- **Precision**: 79.91%
- **Recall**: 80.27%
- **F1 Score**: 80.09%

LightGBM provided balanced precision and recall, showcasing its potential in boosting tasks.

### Neural Network:
- **Accuracy**: 79.72%
- **Precision**: 80.39%
- **Recall**: 78.98%
- **F1 Score**: 79.68%

The advanced neural network architecture incorporated residual connections and regularization techniques, yielding competitive results among the ensemble.

### Support Vector Machine (SVM):
- **Accuracy**: 79.29%
- **Precision**: 79.66%
- **Recall**: 79.06%
- **F1 Score**: 79.36%

SVM, with extensive hyperparameter tuning, achieved strong consistency in precision and recall, validating its application for classification tasks.

## Final Kaggle Competition Results

After fine-tuning and stacking the models, the final submission scored 0.80710 in the Kaggle competition, placing 123rd out of 1,603 participants, ranking in the top 7.7%.

The CatBoost and Stacking Classifier models were instrumental in achieving this performance, demonstrating the importance of advanced ensemble methods and careful model selection.

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, Input, Activation, Add
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.feature_selection import RFE
import warnings
warnings.filterwarnings("ignore")

# Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Preview the first five rows of the train dataset
print(train.head())
print(train.shape)

# Preview the first five rows of the test dataset
print(test.head())
print(test.shape)

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  
(8

In [15]:
# Data Preparation - Handling Missing Values
def fill_missing_values(df):
    numeric_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']
    
    for feature in numeric_features:
        df[feature].fillna(df[feature].mean(), inplace=True)
    for feature in categorical_features:
        df[feature].fillna(df[feature].mode()[0], inplace=True)
    
    df['Cabin'].fillna('Z/9999/Z', inplace=True)
    df['Name'].fillna('Unknown', inplace=True)
    return df

train = fill_missing_values(train)
test = fill_missing_values(test)

# Feature Engineering - Creating New Features
def feature_engineering(df):
    # Split 'Cabin' into 'Deck', 'CabinNum', 'Side'
    df['Deck'] = df['Cabin'].str.split('/').str[0]
    df['CabinNum'] = df['Cabin'].str.split('/').str[1].astype(int)
    df['Side'] = df['Cabin'].str.split('/').str[2]
    df.drop('Cabin', axis=1, inplace=True)
    
    # Create 'Group' feature from 'PassengerId'
    df['Group'] = df['PassengerId'].str.split('_').str[0]
    
    # Extract 'Surname' from 'Name'
    df['Surname'] = df['Name'].str.split().str[-1]
    df.drop('Name', axis=1, inplace=True)
    
    # Total spend
    spend_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    df['TotalSpend'] = df[spend_cols].sum(axis=1)
    
    # Family size
    df['FamilySize'] = df.groupby('Group')['Group'].transform('count')
    
    # Is alone
    df['IsAlone'] = np.where(df['FamilySize'] == 1, 1, 0)
    
    # Drop 'PassengerId' as it's no longer needed
    df.drop(['PassengerId', 'Group', 'Surname'], axis=1, inplace=True)
    
    return df

train = feature_engineering(train)
test = feature_engineering(test)

# Convert categorical columns to dummy variables
categorical_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']
train = pd.get_dummies(train, columns=categorical_cols, drop_first=True)
test = pd.get_dummies(test, columns=categorical_cols, drop_first=True)

# Separate features and target variable
y = train['Transported'].astype(int)  # Ensure target is integer
X = train.drop('Transported', axis=1)

# Align the features of X and test data
X, test = X.align(test, join='left', axis=1, fill_value=0)

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
test_scaled = scaler.transform(test)

# Feature Selection using Recursive Feature Elimination (RFE)
selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=20)
selector.fit(X_scaled, y)
X_selected = selector.transform(X_scaled)
test_selected = selector.transform(test_scaled)
selected_features = pd.Series(X.columns[selector.support_])

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_selected, y, test_size=0.3, random_state=42, stratify=y)

# Define cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [16]:
# Build and Evaluate Models

# Initialize a DataFrame to store model performance
performance = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

# 1. Random Forest Classifier with Hyperparameter Tuning
rf_params = {
    'n_estimators': [100, 300, 500],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=cv_strategy, n_jobs=-1, verbose=1)
rf_grid.fit(X_train, y_train)
rf_best = rf_grid.best_estimator_
rf_pred = rf_best.predict(X_val)

# Evaluate Random Forest
performance = performance.append({
    'Model': 'Random Forest',
    'Accuracy': accuracy_score(y_val, rf_pred),
    'Precision': precision_score(y_val, rf_pred),
    'Recall': recall_score(y_val, rf_pred),
    'F1 Score': f1_score(y_val, rf_pred)
}, ignore_index=True)

# 2. XGBoost Classifier with Hyperparameter Tuning
xgb_params = {
    'n_estimators': [100, 300, 500],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 1.0]
}
xgb_grid = GridSearchCV(XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'), xgb_params, cv=cv_strategy, n_jobs=-1, verbose=1)
xgb_grid.fit(X_train, y_train)
xgb_best = xgb_grid.best_estimator_
xgb_pred = xgb_best.predict(X_val)

# Evaluate XGBoost
performance = performance.append({
    'Model': 'XGBoost',
    'Accuracy': accuracy_score(y_val, xgb_pred),
    'Precision': precision_score(y_val, xgb_pred),
    'Recall': recall_score(y_val, xgb_pred),
    'F1 Score': f1_score(y_val, xgb_pred)
}, ignore_index=True)

# 3. LightGBM Classifier with Hyperparameter Tuning
lgb_params = {
    'n_estimators': [100, 300, 500],
    'num_leaves': [31, 63],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 1.0]
}
lgb_grid = GridSearchCV(LGBMClassifier(random_state=42), lgb_params, cv=cv_strategy, n_jobs=-1, verbose=1)
lgb_grid.fit(X_train, y_train)
lgb_best = lgb_grid.best_estimator_
lgb_pred = lgb_best.predict(X_val)

# Evaluate LightGBM
performance = performance.append({
    'Model': 'LightGBM',
    'Accuracy': accuracy_score(y_val, lgb_pred),
    'Precision': precision_score(y_val, lgb_pred),
    'Recall': recall_score(y_val, lgb_pred),
    'F1 Score': f1_score(y_val, lgb_pred)
}, ignore_index=True)

# 4. CatBoost Classifier
cat_model = CatBoostClassifier(verbose=0, random_state=42)
cat_model.fit(X_train, y_train)
cat_pred = cat_model.predict(X_val)

# Evaluate CatBoost
performance = performance.append({
    'Model': 'CatBoost',
    'Accuracy': accuracy_score(y_val, cat_pred),
    'Precision': precision_score(y_val, cat_pred),
    'Recall': recall_score(y_val, cat_pred),
    'F1 Score': f1_score(y_val, cat_pred)
}, ignore_index=True)

# 5. Deep Neural Network with Advanced Architecture using KerasClassifier

# Custom KerasClassifier to handle the absence of predict_classes
class MyKerasClassifier(KerasClassifier):
    _estimator_type = "classifier"  # Ensure scikit-learn recognizes it as a classifier

    def predict(self, x, **kwargs):
        """Override the default predict method to handle predict_classes absence."""
        proba = self.model.predict(x)
        if proba.shape[-1] > 1:
            return proba.argmax(axis=-1)
        else:
            return (proba > 0.5).astype("int32")

    def predict_proba(self, x, **kwargs):
        """Return class probabilities."""
        proba = self.model.predict(x)
        if proba.shape[-1] == 1:
            # Binary classification
            return np.hstack([1 - proba, proba])
        else:
            # Multi-class classification
            return proba

def build_advanced_nn():
    inputs = Input(shape=(X_train.shape[1],))
    
    # Encoder
    x = Dense(512, activation='relu')(inputs)
    x = BatchNormalization()(x)
    x = Dropout(0.3)(x)
    
    x = Dense(256, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.3)(x)
    
    # Residual Block
    shortcut = x
    x = Dense(256, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.3)(x)
    x = Dense(256, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Add()([x, shortcut])
    x = Activation('relu')(x)
    
    # Decoder
    x = Dense(128, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.3)(x)
    
    outputs = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                  loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Define early stopping and learning rate reduction
early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, verbose=1, min_lr=1e-6)

# Wrap the model using MyKerasClassifier
nn_model_wrapped = MyKerasClassifier(
    build_fn=build_advanced_nn,
    epochs=100,
    batch_size=64,
    verbose=0
)

# Fit the wrapped neural network
nn_model_wrapped.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping, reduce_lr]
)

# Predictions
nn_pred = nn_model_wrapped.predict(X_val)

# Evaluate Neural Network
performance = performance.append({
    'Model': 'Neural Network',
    'Accuracy': accuracy_score(y_val, nn_pred),
    'Precision': precision_score(y_val, nn_pred),
    'Recall': recall_score(y_val, nn_pred),
    'F1 Score': f1_score(y_val, nn_pred)
}, ignore_index=True)

# 6. Support Vector Machine with Extensive Hyperparameter Tuning
svm_params = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': ['scale', 'auto', 0.1, 0.01, 0.001],
    'kernel': ['rbf', 'poly', 'sigmoid']
}
svm_grid = GridSearchCV(SVC(probability=True, random_state=42), svm_params, cv=cv_strategy, n_jobs=-1, verbose=1)
svm_grid.fit(X_train, y_train)
svm_best = svm_grid.best_estimator_
svm_pred = svm_best.predict(X_val)

# Evaluate SVM
performance = performance.append({
    'Model': 'SVM',
    'Accuracy': accuracy_score(y_val, svm_pred),
    'Precision': precision_score(y_val, svm_pred),
    'Recall': recall_score(y_val, svm_pred),
    'F1 Score': f1_score(y_val, svm_pred)
}, ignore_index=True)

# Display performance
print(performance.sort_values(by='Accuracy', ascending=False))

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Fitting 5 folds for each of 54 candidates, totalling 270 fits
Fitting 5 folds for each of 36 candidates, totalling 180 fits

Epoch 00016: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.

Epoch 00029: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.

Epoch 00034: ReduceLROnPlateau reducing learning rate to 8.000000525498762e-06.

Epoch 00041: ReduceLROnPlateau reducing learning rate to 1.6000001778593287e-06.

Epoch 00046: ReduceLROnPlateau reducing learning rate to 1e-06.
Epoch 00046: early stopping
Fitting 5 folds for each of 75 candidates, totalling 375 fits
            Model  Accuracy  Precision    Recall  F1 Score
3        CatBoost  0.814801   0.818740  0.811881  0.815296
1         XGBoost  0.812500   0.819380  0.805027  0.812140
0   Random Forest  0.803681   0.816101  0.787510  0.801550
2        LightGBM  0.799080   0.799090  0.802742  0.800912
4  Neural Network  0.797163   0.80387

In [17]:
# Ensemble Techniques - Stacking Classifier
estimators = [
    ('rf', rf_best),
    ('xgb', xgb_best),
    ('lgb', lgb_best),
    ('cat', cat_model),
    ('svm', svm_best),
    ('nn', nn_model_wrapped)
]

# Meta-classifier
meta_classifier = LogisticRegression()

stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=meta_classifier,
    cv=cv_strategy,
    n_jobs=1,  # Set n_jobs=1 to avoid pickling issues
    passthrough=False
)

# Fit the stacking classifier
stacking_clf.fit(X_train, y_train)

# Predictions
stacking_pred = stacking_clf.predict(X_val)

# Evaluate Stacking Classifier
performance = performance.append({
    'Model': 'Stacking Classifier',
    'Accuracy': accuracy_score(y_val, stacking_pred),
    'Precision': precision_score(y_val, stacking_pred),
    'Recall': recall_score(y_val, stacking_pred),
    'F1 Score': f1_score(y_val, stacking_pred)
}, ignore_index=True)

# Display performance
print(performance.sort_values(by='Accuracy', ascending=False))

# Final Model Selection
final_model = stacking_clf
final_model.fit(X_selected, y)

# Make predictions on test data
final_predictions = final_model.predict(test_selected)

# Prepare submission file
submission = pd.read_csv('test.csv')[['PassengerId']]
submission['Transported'] = final_predictions.astype(bool)
submission.to_csv('submission.csv', index=False)

                 Model  Accuracy  Precision    Recall  F1 Score
3             CatBoost  0.814801   0.818740  0.811881  0.815296
1              XGBoost  0.812500   0.819380  0.805027  0.812140
6  Stacking Classifier  0.810199   0.811738  0.811120  0.811429
0        Random Forest  0.803681   0.816101  0.787510  0.801550
2             LightGBM  0.799080   0.799090  0.802742  0.800912
4       Neural Network  0.797163   0.803876  0.789794  0.796773
5                  SVM  0.792945   0.796623  0.790556  0.793578
