# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [13]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
features = spaceship.drop("Transported", axis=1)

features = features.select_dtypes(include=["number"])

target = spaceship["Transported"]

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [6]:
normalizer = MinMaxScaler()

In [9]:
normalizer.fit(X_train)

**Perform Train Test Split**

In [10]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [20]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Base estimator
base_model = DecisionTreeClassifier(random_state=42)

# Bagging
model = BaggingClassifier(
    estimator=base_model,
    n_estimators=100,
    bootstrap=True,
    random_state=42
)

# train
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)


In [21]:
base_model.fit(X_train_norm, y_train)

In [22]:
# Evaluate model's performance

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7786083956296722
              precision    recall  f1-score   support

       False       0.81      0.72      0.76       863
        True       0.75      0.83      0.79       876

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739

[[623 240]
 [145 731]]


- The model correctly predicts ≈77.9% of the examples on the test set.

- This is an overall performance measure: (number of correct predictions) / (total number of examples).



Precision: Of all the predictions for this class, how many are correct.

E.g., for True, 75% of True predictions are correct.

Recall: Of all the true instances of this class, how many were correctly detected.

E.g., for True, 83% of the true instances were correctly predicted.

F1-score: Harmonic mean of precision and recall.

E.g., for True, F1 ≈ 0.79.

Support: Number of real examples of each class in the test set.

✅ We can see that the model is fairly balanced between the two classes (False and True).

**Overall Interpretation**

The model is fairly balanced, slightly more accurate at detecting False but more efficient at detecting True (higher recall).

The accuracy of 77.9% is better than simple KNN on numeric columns only, showing that ensemble methods (Bagging here) improve performance.

The confusion matrix shows that it still makes some errors, but it detects both classes well.

- Random Forests

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix



rf_model = RandomForestClassifier(
    n_estimators=100,    # nombre d'arbres
    random_state=42
)

# train
rf_model.fit(X_train, y_train)

# test prediction
y_pred_rf = rf_model.predict(X_test)

# evaluate
accuracy = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


Random Forest Accuracy: 0.7837837837837838

Classification Report:
               precision    recall  f1-score   support

       False       0.82      0.72      0.77       863
        True       0.76      0.84      0.80       876

    accuracy                           0.78      1739
   macro avg       0.79      0.78      0.78      1739
weighted avg       0.79      0.78      0.78      1739

Confusion Matrix:
 [[624 239]
 [137 739]]


The model correctly predicts ≈78.4% of the examples on the test set.

This is an overall performance measure.

Slightly better than the previous Bagging (~77.9%), so Random Forest improves accuracy somewhat.

- Gradient Boosting

In [26]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [27]:
print(X_train.isnull().sum())

Age             146
RoomService     151
FoodCourt       148
ShoppingMall    172
Spa             152
VRDeck          146
dtype: int64


In [28]:
X_train = X_train.dropna()
y_train = y_train[X_train.index]
X_test = X_test.dropna()
y_test = y_test[X_test.index]

In [30]:
gb_model = HistGradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

gb_model = GradientBoostingClassifier(
    n_estimators=100,   # nombre d'arbres
    learning_rate=0.1,  # vitesse d'apprentissage
    max_depth=3,        # profondeur maximale de chaque arbre
    random_state=42
)

# train
gb_model.fit(X_train, y_train)

# test prediction
y_pred_gb = gb_model.predict(X_test)

# evaluate
accuracy = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_gb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))


Gradient Boosting Accuracy: 0.7970168612191959

Classification Report:
               precision    recall  f1-score   support

       False       0.83      0.73      0.78       762
        True       0.77      0.86      0.81       780

    accuracy                           0.80      1542
   macro avg       0.80      0.80      0.80      1542
weighted avg       0.80      0.80      0.80      1542

Confusion Matrix:
 [[560 202]
 [111 669]]


The model correctly predicts ≈79.7% of the examples on the test set.

This is better than KNN (~77%) and Random Forest (~78%), showing that Gradient Boosting slightly improves performance.

- Adaptive Boosting

In [33]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier


base_model = DecisionTreeClassifier(max_depth=1, random_state=42)

# Initialize
ada_model = AdaBoostClassifier(
    estimator=base_model,
    n_estimators=100,    # nombre de classifieurs faibles
    learning_rate=1.0,   # poids attribué à chaque classifieur
    random_state=42
)

# train
ada_model.fit(X_train, y_train)

# test predictions
y_pred_ada = ada_model.predict(X_test)

# evaluate
accuracy = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_ada))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ada))


AdaBoost Accuracy: 0.7879377431906615

Classification Report:
               precision    recall  f1-score   support

       False       0.83      0.72      0.77       762
        True       0.76      0.86      0.80       780

    accuracy                           0.79      1542
   macro avg       0.79      0.79      0.79      1542
weighted avg       0.79      0.79      0.79      1542

Confusion Matrix:
 [[547 215]
 [112 668]]




The model correctly predicts ≈78.8% of the examples on the test set.

Slightly better than Bagging (~77.9%) and Random Forest (~78.4%), but slightly worse than Gradient Boosting (~79.7%).

This is an overall performance measure.

---------------------------------------------------

**Which model is the best and why?**

**Gradient Boosting** has the best accuracy (≈79.7%) and F1-score (~0.80):

- It learns sequentially to correct previous errors, which improves detection of both classes.

- It balances precision and recall well for the False and True classes.

AdaBoost is very close (≈78.8%), but slightly less efficient than Gradient Boosting.

Random Forest and Bagging are stable but less accurate than the boosting methods.

**✅ Best Model: Gradient Boosting**


- Highest overall accuracy (~79.7%).

- Good balance between precision and recall for both classes.

- Fewer errors in the confusion matrix compared to other models.

- Sequential method that progressively corrects errors from previous trees, giving it a slight advantage over Bagging, Pasting, Random Forest, and AdaBoost.