# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship_cleaned = pd.read_csv("spaceship_cleaned.csv")
spaceship_cleaned.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,39,0,0,0,0,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
1,24,109,9,25,549,44,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0
2,58,43,3576,0,6715,49,0,1,0,0,...,0,1,1,0,0,0,0,0,0,0
3,33,0,1283,371,3329,193,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
4,16,303,70,151,565,2,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


**Perform Train Test Split**

In [3]:
X = spaceship_cleaned.drop(columns=['Transported'])
y = spaceship_cleaned['Transported']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [4]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data and transform
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

#### Bagging and Pasting

In [9]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [10]:
tree = DecisionTreeClassifier(random_state=42)


In [None]:
bagging_clf = BaggingClassifier(
    estimator=tree,      # Base model to bag
    n_estimators=100,     # Number of base models to create 
    max_samples=0.8,      # Each model gets 80% of the training data 
    bootstrap=True,       # With replacement (Bagging)
    random_state=42,
    n_jobs=-1             # USe all available CPUs
)

In [14]:
pasting_clf = BaggingClassifier(
    estimator=tree,
    n_estimators=100,
    max_samples=0.8,
    bootstrap=False,      # Without replacement (Pasting)
    random_state=42,
    n_jobs=-1
)

In [15]:
# Train both models
bagging_clf.fit(X_train, y_train)
pasting_clf.fit(X_train, y_train)

In [16]:
# Predictions
bagging_pred = bagging_clf.predict(X_test)
pasting_pred = pasting_clf.predict(X_test)

In [18]:
bagging_acc = accuracy_score(y_test, bagging_pred)
pasting_acc = accuracy_score(y_test, pasting_pred)
print(f"Bagging Accuracy: {bagging_acc:.2%}")
print(f"Pasting Accuracy: {pasting_acc:.2%}")

Bagging Accuracy: 81.39%
Pasting Accuracy: 80.86%


#### Random Forests

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [23]:
rf_clf = RandomForestClassifier(
    n_estimators=100,      # Number of trees in the forest
    max_depth=None,        # Expand until all leaves are pure or contain less than min_samples_split samples
    random_state=42,       # Reproducibility
    n_jobs=-1              # Utilize all available CPUs
)

In [24]:
rf_clf.fit(X_train, y_train)

In [25]:
y_pred_rf = rf_clf.predict(X_test)

In [26]:
rf_accuracy = accuracy_score(y_test, y_pred_rf) * 100
print(f"Random Forest Accuracy: {rf_accuracy:.2f}%")

Random Forest Accuracy: 81.62%


#### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

gb_clf = GradientBoostingClassifier(
    n_estimators=100,       # Number of trees
    learning_rate=0.1,      # Contribution of each tree
    max_depth=3,            # Depth of each tree
    random_state=42         # Ensures reproducibility
)



In [29]:
gb_clf.fit(X_train, y_train)

In [30]:
y_pred_gb = gb_clf.predict(X_test)

In [31]:
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb:.2%}")

Gradient Boosting Accuracy: 80.56%


#### Adaptive Boosting

In [35]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

ada_clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Weak learner (decision stump)
    n_estimators=50,       # Number of weak learners
    learning_rate=1.0,     # Contribution of each learner
    random_state=42        # Ensures reproducibility
)

In [36]:
ada_clf.fit(X_train, y_train)



In [37]:
y_pred_ada = ada_clf.predict(X_test)

In [39]:
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"Adaptive Boost Accuracy: {accuracy_ada:.2%}")

Adaptive Boost Accuracy: 79.73%


Which model is the best and why?

## In order of accurancy:

##### 1. Random Forest Accuracy: 81.62%
##### 2. Bagging Accuracy: 81.39%
##### 3. Pasting Accuracy: 80.86%
##### 4. Gradient Boosting Accuracy: 80.56%
##### 5. Adaptive Boost Accuracy: 79.73%


##### Random Forest: best performance due to its ability to reduce variance and overfitting while combining multiple decision trees.

In [41]:
y_pred = rf_clf.predict(X_test)
print("Predicted values:", y_pred[:10])
print("Actual values:   ", y_test[:10].values)

Predicted values: [1 1 1 1 0 0 1 0 0 1]
Actual values:    [1 1 1 0 0 0 1 0 0 1]
