# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [4]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [5]:
spaceship_cleaned = spaceship.dropna()
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]
spaceship_cleaned = spaceship_cleaned.drop(columns=['Name', 'PassengerId'])
spaceship_encoded = pd.get_dummies(spaceship_cleaned, columns=['Cabin', 'HomePlanet', 'Destination', 'VIP', 'CryoSleep'], drop_first=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]


In [6]:
# Feature Scaling
features = spaceship_encoded.drop(columns=['Transported'])
target = spaceship_encoded['Transported']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

**Perform Train Test Split**

In [7]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(
    features_scaled,
    target,
)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
# Bagging and Pasting
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, r2_score, mean_absolute_error, root_mean_squared_error

In [9]:
#your code here
# Bagging
bg = BaggingClassifier()
bg.fit(X_train, y_train)
pred_bg = bg.predict(X_test)

print("Bagging accuracy:", accuracy_score(y_test, pred_bg), "\n")

print("Bagging RMSE:", root_mean_squared_error(y_test, pred_bg))
print("Bagging MAE:", mean_absolute_error(y_test, pred_bg))
print("Bagging R2 score:", r2_score(y_test, pred_bg))


Bagging accuracy: 0.7772397094430993 

Bagging RMSE: 0.47197488339624677
Bagging MAE: 0.22276029055690072
Bagging R2 score: 0.10832629306298724


- Random Forests

In [10]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

In [11]:
#your code here

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print("Random Forest accuracy:", accuracy_score(y_test, pred_rf), "\n")

print("Random Forest RMSE:", root_mean_squared_error(y_test, pred_rf))
print("Random Forest MAE:", mean_absolute_error(y_test, pred_rf))
print("Random Forest R2 score:", r2_score(y_test, pred_rf))


Random Forest accuracy: 0.7905569007263923 

Random Forest RMSE: 0.4576495376088648
Random Forest MAE: 0.20944309927360774
Random Forest R2 score: 0.16163287336900434


- Gradient Boosting

In [12]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

In [13]:
#your code here

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
pred_gb = gb.predict(X_test)

print("Gradient Boosting accuracy:", accuracy_score(y_test, pred_gb), "\n")

print("Gradient Boosting RMSE:", root_mean_squared_error(y_test, pred_gb))
print("Gradient Boosting MAE:", mean_absolute_error(y_test, pred_gb))
print("Gradient Boosting R2 score:", r2_score(y_test, pred_gb))


Gradient Boosting accuracy: 0.812953995157385 

Gradient Boosting RMSE: 0.4324881557252349
Gradient Boosting MAE: 0.18704600484261502
Gradient Boosting R2 score: 0.25128484933821493


- Adaptive Boosting

In [14]:
# Adaptive Boosting
from sklearn.ensemble import AdaBoostClassifier

In [15]:
#your code here

ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
pred_ada = ada.predict(X_test)

print("Adaptive Boosting accuracy:", accuracy_score(y_test, pred_ada), "\n")

print("Adaptive Boosting RMSE:", root_mean_squared_error(y_test, pred_ada))
print("Adaptive Boosting MAE:", mean_absolute_error(y_test, pred_ada))
print("Adaptive Boosting R2 score:", r2_score(y_test, pred_ada))


Adaptive Boosting accuracy: 0.7893462469733656 

Adaptive Boosting RMSE: 0.4589703182414244
Adaptive Boosting MAE: 0.2106537530266344
Adaptive Boosting R2 score: 0.15678682061391191


Which model is the best and why?

In [16]:
#comment here
# Based on the results obtained from the different ensemble methods, we can see that Gradient Boosting outperforms the other methods in terms of accuracy and other metrics (highest R2 score and lowest RMSE and MAE).
# Therefore, we can conclude that Gradient Boosting gives the best results for this dataset because it builds the model in a stage-wise fashion and optimizes for the errors made by previous models, leading to better overall performance.