# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [11]:
import pandas as pd

spaceship_encoded = pd.get_dummies(spaceship, drop_first=True)
features = spaceship_encoded.drop('Transported', axis=1)
target = spaceship_encoded['Transported']

In [12]:
from sklearn.preprocessing import StandardScaler

X = features.iloc[:, :-1]
y = features.iloc[:, -1]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

**Perform Train Test Split**

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features,
    target,
    test_size=0.2         
)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [17]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

base_model = DecisionTreeClassifier()

bagging = BaggingClassifier(estimator=base_model, n_estimators=100, bootstrap=True, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)

pasting = BaggingClassifier(estimator=base_model, n_estimators=100, bootstrap=False, random_state=42)
pasting.fit(X_train, y_train)
y_pred_paste = pasting.predict(X_test)

print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))
print("Pasting Accuracy:", accuracy_score(y_test, y_pred_paste))

Bagging Accuracy: 0.7717078780908568
Pasting Accuracy: 0.7510063254744106


- Random Forests

In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.7607820586543991


- Gradient Boosting

In [21]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train_imputed, y_train)
y_pred_gb = gb.predict(X_test_imputed)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.7676825761932144


- Adaptive Boosting

In [19]:
from sklearn.impute import SimpleImputer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
ada.fit(X_train_imputed, y_train)
y_pred_ada = ada.predict(X_test_imputed)

print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ada))



AdaBoost Accuracy: 0.7607820586543991


Which model is the best and why?

In [22]:
from sklearn.metrics import accuracy_score, f1_score

results = {
    'Bagging': accuracy_score(y_test, y_pred_bag),
    'Pasting': accuracy_score(y_test, y_pred_paste),
    'Random Forest': accuracy_score(y_test, y_pred_rf),
    'Gradient Boosting': accuracy_score(y_test, y_pred_gb),
    'AdaBoost': accuracy_score(y_test, y_pred_ada)
}

for model, preds in [
    ('Bagging', y_pred_bag),
    ('Pasting', y_pred_paste),
    ('Random Forest', y_pred_rf),
    ('Gradient Boosting', y_pred_gb),
    ('AdaBoost', y_pred_ada)
]:
    print(f"{model} → Accuracy: {accuracy_score(y_test, preds):.4f}, F1-score: {f1_score(y_test, preds, average='weighted'):.4f}")

Bagging → Accuracy: 0.7717, F1-score: 0.7717
Pasting → Accuracy: 0.7510, F1-score: 0.7509
Random Forest → Accuracy: 0.7608, F1-score: 0.7607
Gradient Boosting → Accuracy: 0.7677, F1-score: 0.7671
AdaBoost → Accuracy: 0.7608, F1-score: 0.7607


In [None]:
# Based on accuracy alone, the Bagging model was the top performer in this case.
# However, Bagging and Pasting were executed together and took approximately 28 minutes to run, which was significantly longer than any of the other models.
# Considering both accuracy and computational efficiency, the Gradient Boosting model offers a strong balance and could reasonably be considered the best overall choice.