# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
# Define features
num_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
cat_features = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']

# Preprocessor pipeline
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_features),
    ('cat', cat_pipe, cat_features)
], remainder='drop')

# Convert Transported to int (0/1)
spaceship['Transported'] = spaceship['Transported'].astype(int)

X = spaceship.drop(columns=['Transported', 'Name'])  # drop target + irrelevant
y = spaceship['Transported']

**Perform Train Test Split**

In [4]:
# Train/test split (80/20, stratified on target for balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (6954, 12) (6954,)
Test shape: (1739, 12) (1739,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [5]:
# Base estimator
base_clf = DecisionTreeClassifier(random_state=42)

# --- Bagging ---
bagging_clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', BaggingClassifier(
        estimator=base_clf,
        n_estimators=200,
        max_samples=0.8,
        bootstrap=True,      # with replacement = Bagging
        n_jobs=-1,
        random_state=42
    ))
])

bagging_clf.fit(X_train, y_train)
y_pred_bag = bagging_clf.predict(X_test)
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))

# --- Pasting ---
pasting_clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', BaggingClassifier(
        estimator=base_clf,
        n_estimators=200,
        max_samples=0.8,
        bootstrap=False,     # without replacement = Pasting
        n_jobs=-1,
        random_state=42
    ))
])

pasting_clf.fit(X_train, y_train)
y_pred_pas = pasting_clf.predict(X_test)
print("Pasting Accuracy:", accuracy_score(y_test, y_pred_pas))

Bagging Accuracy: 0.7987349051178838
Pasting Accuracy: 0.7912593444508338


- Random Forests

In [6]:
# Random Forest pipeline
rf_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(
        n_estimators=500,
        max_depth=None,
        min_samples_leaf=1,
        max_features='sqrt',
        n_jobs=-1,
        random_state=42
    ))
])

# Train
rf_clf.fit(X_train, y_train)

# Evaluate
y_pred_rf = rf_clf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf, digits=3))

Random Forest Accuracy: 0.7941345600920069
[[702 161]
 [197 679]]
              precision    recall  f1-score   support

           0      0.781     0.813     0.797       863
           1      0.808     0.775     0.791       876

    accuracy                          0.794      1739
   macro avg      0.795     0.794     0.794      1739
weighted avg      0.795     0.794     0.794      1739



- Gradient Boosting

In [7]:
gbr_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(
        learning_rate=0.1,
        n_estimators=200,
        max_depth=3,
        subsample=1.0,
        random_state=42
    ))
])

gbr_clf.fit(X_train, y_train)
y_pred_gbr = gbr_clf.predict(X_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gbr))
print(confusion_matrix(y_test, y_pred_gbr))
print(classification_report(y_test, y_pred_gbr, digits=3))

Gradient Boosting Accuracy: 0.7981598619896493
[[661 202]
 [149 727]]
              precision    recall  f1-score   support

           0      0.816     0.766     0.790       863
           1      0.783     0.830     0.806       876

    accuracy                          0.798      1739
   macro avg      0.799     0.798     0.798      1739
weighted avg      0.799     0.798     0.798      1739



- Adaptive Boosting

In [8]:
ada_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', AdaBoostClassifier(
        n_estimators=200,
        learning_rate=0.5,
        random_state=42
    ))
])

ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ada))
print(confusion_matrix(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada, digits=3))

AdaBoost Accuracy: 0.7607820586543991
[[717 146]
 [270 606]]
              precision    recall  f1-score   support

           0      0.726     0.831     0.775       863
           1      0.806     0.692     0.744       876

    accuracy                          0.761      1739
   macro avg      0.766     0.761     0.760      1739
weighted avg      0.766     0.761     0.760      1739



Which model is the best and why?

Among the ensemble methods tested, Gradient Boosting provided the best performance with an accuracy of ~79.8% and the highest recall for the positive (transported) class. While Bagging and Random Forests performed similarly (~79–79.5%), Gradient Boosting slightly outperformed them by better balancing precision and recall across both classes. AdaBoost underperformed, with an accuracy of ~76% and weaker recall for the transported class. Therefore, Gradient Boosting is the most suitable model for this dataset because it captures complex relationships in the features and achieves the best overall generalization.