# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv"
df = pd.read_csv(url)

# Drop PassengerId and Name columns
df.drop(columns=['PassengerId', 'Name'], inplace=True)

# Transform Cabin to contain only the first letter (deck category)
df['Cabin'] = df['Cabin'].astype(str).str[0]

# Convert categorical variables to dummies
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Define features (X) and target (y)
X = df.drop(columns=['Transported'])
y = df['Transported']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fill missing numerical values with median
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

# Feature Scaling (Standardization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling and engineering completed successfully!")

Feature scaling and engineering completed successfully!


**Perform Train Test Split**

In [10]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop(columns=['Transported'])  # Features (independent variables)
y = df['Transported']  # Target variable

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print shapes to verify split
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

print("Train-Test Split completed successfully!")

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize base estimator (Decision Tree)
base_estimator = DecisionTreeClassifier(random_state=42)

# Bagging Classifier (Updated Syntax)
bagging_clf = BaggingClassifier(estimator=base_estimator, n_estimators=50, random_state=42)
bagging_clf.fit(X_train_scaled, y_train)
y_pred_bagging = bagging_clf.predict(X_test_scaled)

# Pasting Classifier (Bagging with bootstrap=False, Updated Syntax)
pasting_clf = BaggingClassifier(estimator=base_estimator, n_estimators=50, bootstrap=False, random_state=42)
pasting_clf.fit(X_train_scaled, y_train)
y_pred_pasting = pasting_clf.predict(X_test_scaled)

# Evaluate models
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
accuracy_pasting = accuracy_score(y_test, y_pred_pasting)
class_report_bagging = classification_report(y_test, y_pred_bagging)
class_report_pasting = classification_report(y_test, y_pred_pasting)

# Print results
print(f"Bagging Accuracy: {accuracy_bagging:.4f}")
print(f"Pasting Accuracy: {accuracy_pasting:.4f}")
print("\nBagging Classification Report:\n", class_report_bagging)
print("\nPasting Classification Report:\n", class_report_pasting)


Bagging Accuracy: 0.7941
Pasting Accuracy: 0.7412

Bagging Classification Report:
               precision    recall  f1-score   support

       False       0.79      0.81      0.80       863
        True       0.80      0.78      0.79       876

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739


Pasting Classification Report:
               precision    recall  f1-score   support

       False       0.75      0.71      0.73       863
        True       0.73      0.77      0.75       876

    accuracy                           0.74      1739
   macro avg       0.74      0.74      0.74      1739
weighted avg       0.74      0.74      0.74      1739



- Random Forests

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize Random Forest model
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
random_forest_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = random_forest_clf.predict(X_test_scaled)

# Evaluate model performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
class_report_rf = classification_report(y_test, y_pred_rf)

# Print results
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print("\nRandom Forest Classification Report:\n", class_report_rf)


Random Forest Accuracy: 0.7947

Random Forest Classification Report:
               precision    recall  f1-score   support

       False       0.78      0.82      0.80       863
        True       0.81      0.77      0.79       876

    accuracy                           0.79      1739
   macro avg       0.80      0.79      0.79      1739
weighted avg       0.80      0.79      0.79      1739



- Gradient Boosting

In [7]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize Gradient Boosting model
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
gb_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_gb = gb_clf.predict(X_test_scaled)

# Evaluate model performance
accuracy_gb = accuracy_score(y_test, y_pred_gb)
class_report_gb = classification_report(y_test, y_pred_gb)

# Print results
print(f"Gradient Boosting Accuracy: {accuracy_gb:.4f}")
print("\nGradient Boosting Classification Report:\n", class_report_gb)


Gradient Boosting Accuracy: 0.8033

Gradient Boosting Classification Report:
               precision    recall  f1-score   support

       False       0.82      0.77      0.80       863
        True       0.79      0.84      0.81       876

    accuracy                           0.80      1739
   macro avg       0.80      0.80      0.80      1739
weighted avg       0.80      0.80      0.80      1739



- Adaptive Boosting

In [8]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize AdaBoost model with default base estimator (Decision Stump)
ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
ada_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_ada = ada_clf.predict(X_test_scaled)

# Evaluate model performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
class_report_ada = classification_report(y_test, y_pred_ada)

# Print results
print(f"Adaptive Boosting Accuracy: {accuracy_ada:.4f}")
print("\nAdaptive Boosting Classification Report:\n", class_report_ada)



Adaptive Boosting Accuracy: 0.7752

Adaptive Boosting Classification Report:
               precision    recall  f1-score   support

       False       0.76      0.81      0.78       863
        True       0.80      0.74      0.77       876

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.77      1739



Which model is the best and why?

In [9]:
# Based on the accuracy and classification reports, Gradient Boostin performed the best among all models.
#Why is Gradient Boosting the Best?

# Highest Accuracy (80.33%)
# It outperforms all other models by achieving the best accuracy.
# Balanced Precision & Recall
# Both precision and recall are well-balanced, ensuring the model makes fewer false positives and false negatives.
# False (not transported) recall: 77%
# True (transported) recall: 84% (better than other models)
# Handles Complex Relationships Better
# Unlike Bagging or Random Forest, Gradient Boosting sequentially improves each tree, learning from previous mistakes.
# Improvement Over Adaptive Boosting
# While Adaptive Boosting (AdaBoost) achieved 77.52% accuracy, its recall for True class was only 74%, meaning it misclassified more transported passengers.
Gradient Boosting improved this to 84%, leading to a better F1-score.